Nigel Clay's Blog

Tuesday, 23 February 2016

Setting up Scientific Python on a Mac with El Capitan

Since it may be of use to others and so I don't forget it myself, here's how I set up my Mac to run a Python scientific bundle in a virtual environment. Both of my macs use El Capitan as the OS but this should work for some earlier operating system versions as well.

Step 1 - Chown /usr/local

When El Capitan came out it implemented something Apple calls System Integrity Protection (SIP). This protects certain system files from being modified or tampered with. Unfortunately we need to do exactly that, so to work around this you'll need to chown the '/usr/local' directory. Run the following in terminal:

 sudo chown $(whoami):admin /usr/local && sudo chown -R $(whoami):admin /usr/local

You'll be prompted for your admin password when you do this.

Step 2 - Xcode and Command Line Tools

Install Xcode and Command Line Tools. This can be downloaded from the app store for free. Once this is done you'll need to fire up Xcode to complete the set up. Once this is done you'll need to get the Command Line tools. Run the following in terminal:

 xcode-select --install

Step 3 - Homebrew

Install homebrew by typing the following into terminal:

 /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

You might also want to make some changes to your ~/.bashrc file as well. I added

 export PATH="/usr/local/bin:usr/local/sbin:$PATH"

Step 4 - Get Python

In terminal type:

 brew install python3

The latest version of python will now install. It is possible to install specific versions of python using homebrew but I have never had the need to do it. You should also add the following to your ~/.bashrc file

 export PATH="/usr/local/share/python:$PATH"

Step 5 - Install freetype, libpng and ffmpeg

Freetype is a software library used to render fonts, libpng is used for portable network graphics (PNG) and ffmpeg is used to record, convert and stream audio and video. To install them type the following into terminal:

 Brew install libpng

 Brew install freetype  
 Brew install ffmpeg

Step 6 - Install gcc and pkg-config

gcc provides a front end for C, C++ and Fortran. Some python libraries make use of this. When you install this don't be surprised if it takes a while and terminal appears to hang. Don't worry it's doing stuff in the background.

 Brew install gcc  
 Brew install pkg-config

Now we'll create a virtual environment where we can run and create python projects

Step 7 - Install Virtualenv and create a virtual environment

As it's name suggests virtualenv allows us to create virtual environments. To make it easier to use it's a good idea to install virtualenvwrapper as well.

 Pip3 install virtualenv  
 Pip3 install virtualenvwrapper

You might also have to add the following to your ~/.bashrc file:

 export WORKON_HOME=$HOME/.virtualenvs  
 export VIRTUALENVWRAPPER_PYTHON=/usr/local/bin/python3  
 source /usr/local/bin/virtualenvwrapper.sh

You can now create your first virtual environment by typing the following into terminal:

 mkvirtualenv env1

I've called the environment env1 but you can call it whatever you like.

Your terminal window should now look similar to this.

All the remaining steps take place in this virtual environment, so don't deactivate it.

Step 8 - Install iPython

The iPython (Jupyter) notebook is one of the most popular ways to create and edit python code. I use a mixture of this or PyCharm depending on what I am working on at the time. Type the following into terminal:

 pip install iPython[all]

Step 9 - Matplotlib, Numpy, Pandas and Scipy

In most scientific python bundles there are four workhorse libraries. Matplotlib has great data visulaisation tools. Numpy is used for a lot of numeric and array processing. Pandas is used for data analysis. Scipy is used for scientific computing tasks. To install these type the following into terminal:

 pip install matplotlib  
 pip install numpy  
 pip install pandas  
 pip install scipy

It is possible that you might have to let matplotlib know which backend you are using. To do this you will need to edit the matplotlbrc file which is in '~/.virtualenvs/yourenv/lib/yourpythonversion/site-packages/matplotlib/mpl-data/' You should now have python and a great bunch of libraries running on your mac. To fire it up first make sure that your virtual environment is active, then type:

 jupyter notebook

You'll now get a new Jupyter page in your browser.

Of course you may need to add other libraries from time to time and you can do that much in the same way I've shown here for adding libraries to python.

Lastly at some point in the future you may want to update all of your libraries to the most current version. Rather than do this one library at a time here's a neat bit script to do it.

 pip freeze --local | grep -v '^\-e' | cut -d = -f 1 | xargs -n1 pip install -U

Saturday, 6 February 2016

MISG 2016

I spent the week of 1-5 February at the University of South Australia where I attended MISG 2016. This is a conference where representatives of industry present some thorny problems they are working on and we maths dudes and dudettes try to help them. There were several industry problems being worked on and I joined a group looking at inference on a knowledge base. I won't go into too much detail about what the industry client does (I'm sure you can work it out) but I will offer a tantalizing clue in that the introductory presentation was headed "Unclassified." The problem was pretty interesting though and could be summarized with two questions:

Entity resolution: How do we disambiguate a knowledge base?
Activity based intelligence: What conclusions can be drawn from a noisy knowledge base?

Given the problem domain it makes perfect sense that we were not given real information. Instead, we were given some manufactured GPS tracking data which some of us used and some of us didn't. I opted not to use the supplied data and instead used a publicly available data set of the network behind the Madrid Train Bombing.

The Madrid Train Bombing Network

My approach was graph based. Mainly because where you have a graph you can also represent it as a matrix, and where you have a matrix, you can use linear algebra to say some useful things about it. The graph is shown below. The vertices have been numbered as names would cause too much clutter. Also for my purposes I was looking at the structure of the graph and names do not assist with that.

The Madrid Train Bombing Network

So, as I said, if you have a graph you have a matrix. One of the useful things you can do with a matrix is a similarity transform. For those that are interested in a refresher on what this is, consider a matrix \(\mathbf{A}\). Find two matrices \(\tilde{\mathbf{A}}\) and \(\mathbf{P}\) such that: \[\mathbf{P}\mathbf{\tilde{A}P^{-1}}=\mathbf{A}\]
Some of you may know this as matrix factorisation, diagonalisation or eigenvector decomposition. The thing is, this transform tells you some interesting stuff. For example, if we use only the largest eigenvalue/eigenvector pair we can isolate the largest community structure in this graph:

The largest community structure in the Madrid Train bombing network

Adding Noise

Now, I wanted to know how much noise could be added to the edge weights on the graph before the largest structure was no longer able to be isolated. It turns out, for this graph at least, that the structure is fairly robust. Yes, you can swamp it with noise and make it unusable, but if the edge weights varied by \(\pm 40 \%\) I could still recover most of the useful structure with a few extra vertices added and at \(\pm 20\%\) I could recover it perfectly. There are still other things to check though, and I didn't have the time to do all of them: what out deeper structures? will it hold for other networks? how do the edge thresholds I used change the result?

Backing Noise into a Corner

If you sort the eigenvalue/eigenvector pairs from largest to smallest the similarity transform causes structure to bubble up and noise to be concentrated in the smaller eigenvalue/eigenvector pairs. It is then possible to consider the structure that manifests using the k-smallest pairs. I was interested in the order that vertices "leave" the graph. In the Madrid Train Bombing leaving out the two largest eigenvector/eigenvalue pairs causes six vertices to detach. Think about what this is saying for a bit. These guys participate in the larger structures in the graph but not the smaller structures. They're probably very important.

Remove the two largest eigenvalue/eigenvector pairs

We can continue to remove information. Keeping track of the order in which vertices detach from the graph suggests a way to prioritize any consequent actions. When 5 pairs were removed I got the following:

Removing the five largest eigenvalue/eigenvector pairs

Continuing to do this eventually the causes the graph to fall apart. The important structure is captured by the largest pairs while the smallest contain noise and unimportant structure. Effectively, the noise is backed into a corner. Or, put another way, the information in a graph can be approximated pretty well with fewer pairs than there are vertices.


Removing the 15 largest eigenvalue/eigenvector pairs

So, it is possible to deal to the noise that is present in the edge weights and it is possible to isolate vertices that are important to the structure and can use this to prioritise. But the real world is not that co-operative. In any graph of a network of individuals you can only record relationships (edges) between vertices if you actually observe them. What about those that you can't see?

Revealing what can't be seen.

Remember, I represent the graph as a matrix. If a relationship between two vertices is observed a weight is associated with the edge between them that, in a noisy and numerical sense, describes the importance of that relationship. This value appears in the matrix where the two vertices intersect. If no relationship is observed then the edge has a weight of 0 and this is also recorded in the matrix at the appropriate location. But there's structure in this particular graph - that has been shown. Does structure imply the existence of edges that were not observed? I can demonstrate this with a digital image. A digital image is nothing but a matrix where the cell values correspond to colour or to a greyscale. In the image below a (very) large proportion of the information has been removed randomly.

An image (matrix) with missing information

If you look at it you might see some areas of light and others that are dark but not much more than that. If you've got this far you might remember (its a theme I've been pushing) that the similarity transform concentrates information in the larger eigenvalue/eigenvector pairs. However you cannot do a similarity transform on a matrix with missing information. That is, unless you try to "learn" the eigenvalue/eigenvector pairs numerically. That's exactly what I did to the image above to obtain the image below.

The image (matrix) reconstructed from only those values (structure) which could be observed.

You now have a pretty good idea of what the image is. For comparison here's the original image.

The original image (matrix)

But can we do this to a graph? I removed an edge in the graph of the Madrid Train Bombing. It happens that this particular edge participates in the largest community structure and has an edgeweight of 1. By removing it we're saying we didn't observe that edge.

Running the numerical approximation to the similarity transform while treating the edge as 'missing' gave the estimate of the edge weight as 0.9812. Now, if an organisation were investigating potential relationships, the approximations could be ranked largest to smallest and used to prioritize any consequent activity. Of course, one edge does not prove anything. So I iterated though all of the edges, removing them one at a time, and testing if the edge could be put back. The result showed that important edges could be recovered and while unimportant edges were not.

Conclusions

In some graphs noise is present in the weights on the edges and in the structure of the graph itself. In concept, and provided there's not too much noise, it is possible to extract important structures in noisy graphs. It is also possible to filter out noise (or push into a corner) by using the similarity transform of the underlying matrix and this can be used to isolate important vertices using the order they detach to prioritize actions. But noise also manifests by the absence of edges. Provided there is structure in the matrix of a graph it is possible to use that structure to approximate the weights of unobserved edges. Moreover, using this estimate it is also possible to prioritize any consequent activity.

Outstanding Questions

In reality it would be unlikely that we would be presented something that looks as 'nice' as the example shown. The graph would be embedded deep within a larger structure of unimportant information. So an outstanding question is how can you isolate the important structure of a relatively small number of individuals in a vastly larger graph? This is like adding hundreds, thousands, or tens of thousands of additional vertices. Bystanders, that happened to be in the same coffee shop or stopped to tie their shoe in the wrong location. There may, or may not, be some basis by which edges in this larger graph form. Random attachment is one way, preferential attachment is another. Ramsey theory tells us that any sufficiently large set of elements will guarantee the existence of some property of interest. That, and apophenia (or, specifically in the case of visual phenomena, pareidolia), is why we see patterns in the stars, faces on Mars, Jesus on toast and subsribe to some conspiracy theories. So a sufficiently large graph would contain a lot of red herrings to sort though.

Lastly, it would be interesting to know how well the approaches I described work on other graphs. Does this just 'happen' to work on the graph I chose?

Saturday, 30 January 2016

The Eleven Socks of Karl Broman

On 17th October 2014 Karl Broman, a professor at the University of Wisconsin-Madison, tweeted about his laundry and caused a bit of discussion.

Is it true that if you find that the first eleven socks you remove from your washing machine are unmatched it suggests that there are a lot more socks in the washing machine? I intend to find out by building a Bayesian model in Python to test it.

Defining a Prior

Bayesian models require a prior. This is similar to how humans process information. That is, the conclusions we draw and the predictions we make are based on our prior experience. Bayesian models make this idea more formal.

We have several bits of prior information about this problem.

That there were eleven unmatched socks suggests that the minimum number of socks in the washing machine is 11.
He doesn't have an infinite quantity of socks (who does right?) so there should be a finite upper bound. To work out what this might be I weighed my socks (yes, I actually did this) to work out how many would fit in the washing machine. My socks were 100g (50g each) and the capacity of my washing machine is 7kg so I could (in theory) wash 70 pairs of (or 140 single) socks. In the photo with Karl Broman's tweet its clear that there are kids socks and other items of clothing as well. It might be reasonable to suggest that the most socks that could be in the machine is 200 but to allow for the additional clothing items I'll set the maximum at 100.
I know from experience with kids that often unmatched socks go into the wash.

We now have our priors. For clarity let \(X\) be the total number of socks put into the washing machine and let \(Y\) be the proportion of those socks that are unmatched at the beginning of the wash. \[X\sim \text{DU}\left(11, 100\right)\] \[Y \sim \text{UNIF}\left(0,1\right)\]

The Model

Now we simulate loading Karl Broman's washing machine. We put some socks in, some in pairs, some as singles and we then sample eleven socks. If the eleven socks are all different we count that as a success and record the number of pairs and number of singles that went into the wash. If there is at least one pair in the first eleven sock we discard the observation that caused it. We do this 1,000,000 times. What we end up with is an array containing the combinations of odd and paired socks that gave us the outcome we're interested in. If the combination only generates eleven unmatched socks very rarely then it does not appear in the array very often. On the other hand if it generates the outcome fairly frequently it's in the array more often. Here's my python code to do this.

 import numpy as np  
 import random  
 from matplotlib.colors import LogNorm  
 import matplotlib.pyplot as plt  
 actual = 11 #The number of unmatched socks observed.  
 simlen = 1000000  
 i = 0  
 outcomes = np.zeros((simlen, 2))  
 while i < simlen:  
   no_socks = random.randrange(11, 100) # Prior on the total number of socks  
   prop_odd = random.random() # Prior on the proportion of socks that are odd  
   no_pairs = int(np.floor(no_socks * (1 - prop_odd) / 2))  
   no_odd = no_socks - no_pairs * 2  
   socks = list(range(no_pairs)) + list(range(no_pairs + no_odd)) # socks put into the washing machine  
   sample = np.random.choice(socks, 11, replace=False) # Sample eleven socks  
   if len(set(sample)) == actual:  
     outcomes[i, 0] = no_pairs  
     outcomes[i, 1] = no_odd  
     i += 1  
 %matplotlib inline  
 plt.hist2d(outcomes[:,0], outcomes[:,1], bins = 49, norm=LogNorm())  
 plt.colorbar()  
 plt.title("Density cases where the first eleven socks were odd.")  
 plt.ylabel("Number of odd socks in machine")  
 plt.xlabel("Number of paired socks in machine")  
 plt.axis([0,50,0,100])  
 plt.show()

Results

When the code is run we get the following density plot of the combinations that resulted in eleven odd socks.

The outcome is not too surprising. However, the bit that is surprising is that the actual number of socks in Karl Broman's washing machine is 21 pairs and 3 singletons.

This would be in the pale blue area of the chart suggesting that it is one to two orders of magnitude away from what might otherwise be expected.

So, my conclusion. The fact that the first eleven socks were unmatched does not suggest that there are more socks in the machine. It suggests that there was a high proportion of unmatched socks. The fact the the real outcome was different does not change the conclusion. It just means that the real outcome was unlikely based on the information we had at hand. Sure, with more information we could have come up with a more refined prior. But in the absence of information you do what you can.

Welcome

Thought I'd do the studenty thing and set up a blog on some of my meanderings in the realm of geekdom. At least this time I am blogging under my own name rather than a pseudonym. I hope you enjoy my musings. If so, drop me a line. If not, well...