Tuesday, 31 July 2012 at 2:05 pm

Performing principal component analysis with matplotlib is extremely easy.

The input is a 2d numpy array where columns are the dimensions you want reduced and rows are samples. For example, a set of transcriptome data:

        Condition1   Condition2   Condition3   Condition 4   Condition5   
geneA 231 321 4 221 2312
geneB 211 4 53 34 53
geneC 4 343 .. .. ..
geneD  43 .. .. .. ..
.. .. .. .. .. .. 

Here is all you need to do a PCA (assuming you have already gotten your data into a native python 2d array):

from matplotlib.mlab import PCA
#construct your numpy array of data
myData = numpy.array(data) 
results = PCA(myData) 

#this will return an array of variance percentages for each component

#this will return a 2d array of the data projected into PCA space

Here is a great how-to on how to plot a 3d graph using matplotlib and Tk. Here is the code from the how-to plotting the PCA results from above:

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

x = []
y = []
z = []
for item in result.Y:

plt.close('all') # close all latent plotting windows fig1 = plt.figure() # Make a plotting figure
ax = Axes3D(fig1) # use the plotting figure to create a Axis3D object.
pltData = [x,y,z]
ax.scatter(pltData[0], pltData[1], pltData[2], 'bo') # make a scatter plot of blue dots from the data

# make simple, bare axis lines through space:
xAxisLine = ((min(pltData[0]), max(pltData[0])), (0, 0), (0,0)) # 2 points make the x-axis line at the data extrema along x-axis
ax.plot(xAxisLine[0], xAxisLine[1], xAxisLine[2], 'r') # make a red line for the x-axis.
yAxisLine = ((0, 0), (min(pltData[1]), max(pltData[1])), (0,0)) # 2 points make the y-axis line at the data extrema along y-axis
ax.plot(yAxisLine[0], yAxisLine[1], yAxisLine[2], 'r') # make a red line for the y-axis.
zAxisLine = ((0, 0), (0,0), (min(pltData[2]), max(pltData[2]))) # 2 points make the z-axis line at the data extrema along z-axis
ax.plot(zAxisLine[0], zAxisLine[1], zAxisLine[2], 'r') # make a red line for the z-axis.

# label the axes
ax.set_xlabel("x-axis label")
ax.set_ylabel("y-axis label")
ax.set_zlabel("y-axis label")
ax.set_title("The title of the plot")
plt.show() # show the plot

  Wednesday, 18 July 2012 at 12:07 pm

XSQ files are binary files conforming to the HDF5 format used to store ABI SOLiD's color-space and quality data. Manipulating the data can be done with NGS_plumbing and XSQ_tools. If you just want to do a simple converstion or basic stats on the XSQ file, I recommend to just use NGS_plumbing. In this post, I will go through how the data is stored in the XSQ file and how to access it with python. 

  Thursday, 12 July 2012 at 09:33 am

Information theory as an interdiciplinary field consisting of engineering, physics, bioinformatics, mathematics, and many others started with Claude E. Shannon's 1948 paper, "A mathematical theory of communication". Advancements in this field have been instrumental in improving communication across the world from data storage on disc drives to satellite communcations. 

In this blog post, I will briefly go over what information theory is all about in an intuitve way and it's practical application to the field of bioinformatics and evolution. I am not an expert on information theory, so I welcome any corrections.