Friday, 03 August 2012 at 4:00 pm

SciPy is a great library for performing complex operations used commonly in science. Hierarchical clustering is one of them. Briefly, here is how to perform this with SciPy:

Input data is a 2d array where rows are genes and columns are your conditions. For example:

        Condition1   Condition2   Condition3   Condition 4   Condition5   
geneA 231 321 4 221 2312
geneB 211 4 53 34 53
geneC 4 343 .. .. ..
geneD  43 .. .. .. ..
.. .. .. .. .. .. 

To get this into a 2d array you can:

dataFile = open('path to your file','r')

dataMatrix = []

#skip first line because it's the column headers

for line in dataFile:
#append each column except the first column which is the gene ID
 dataMatrix.append([float(x) for x in line.strip().split('\t')[1:]])

After you have gotten the data into a 2d array, you can do the following to perform the clustering:

import numpy, scipy
import scipy.cluster.hierarchy as hier
import scipy.spatial.distance as dist

#get your data into a 2d array where rows are genes, and columns
#are conditions
data = numpy.array(dataMatrix)

#calculate a distance matrix distMatrix = dist.pdist(data)

#convert the distance matrix to square form. The distance matrix
#calculated above contains no redundancies, you want a square form
#matrix where the data is mirrored across the diagonal.
distSquareMatrix = dist.squareform(distMatrix)

#calculate the linkage matrix linkageMatrix = hier.linkage(distSquareMatrix)

 You can used various parameters for different distance calculations or linkage types. Read the API for more:

What you can with this linkageMatrix now is to create a dendrogram by using the dendrogram class:

dendro = hier.dendrogram(linkageMatrix)

#get the order of rows according to the dendrogram
leaves = dendro['leaves'] 

#reorder the original data according to the order of the
#dendrogram. Note that this slice notation is numpy specific.
#It just means for every row specified in the 'leaves' array,
#get all the columns. So it effectively remakes the data matrix
#using the order from the 'leaves' array.
transformedData = data[leaves,:]

Here is a great python scrip by Nathan Salomonis that'll take in your tab delimited expression file, perform a hierarchical clustering on rows + columns, and produce a heatmap.

  Tuesday, 31 July 2012 at 2:05 pm

Performing principal component analysis with matplotlib is extremely easy.

The input is a 2d numpy array where columns are the dimensions you want reduced and rows are samples. For example, a set of transcriptome data:

        Condition1   Condition2   Condition3   Condition 4   Condition5   
geneA 231 321 4 221 2312
geneB 211 4 53 34 53
geneC 4 343 .. .. ..
geneD  43 .. .. .. ..
.. .. .. .. .. .. 

Here is all you need to do a PCA (assuming you have already gotten your data into a native python 2d array):

from matplotlib.mlab import PCA
#construct your numpy array of data
myData = numpy.array(data) 
results = PCA(myData) 

#this will return an array of variance percentages for each component

#this will return a 2d array of the data projected into PCA space

Here is a great how-to on how to plot a 3d graph using matplotlib and Tk. Here is the code from the how-to plotting the PCA results from above:

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

x = []
y = []
z = []
for item in result.Y:

plt.close('all') # close all latent plotting windows fig1 = plt.figure() # Make a plotting figure
ax = Axes3D(fig1) # use the plotting figure to create a Axis3D object.
pltData = [x,y,z]
ax.scatter(pltData[0], pltData[1], pltData[2], 'bo') # make a scatter plot of blue dots from the data

# make simple, bare axis lines through space:
xAxisLine = ((min(pltData[0]), max(pltData[0])), (0, 0), (0,0)) # 2 points make the x-axis line at the data extrema along x-axis
ax.plot(xAxisLine[0], xAxisLine[1], xAxisLine[2], 'r') # make a red line for the x-axis.
yAxisLine = ((0, 0), (min(pltData[1]), max(pltData[1])), (0,0)) # 2 points make the y-axis line at the data extrema along y-axis
ax.plot(yAxisLine[0], yAxisLine[1], yAxisLine[2], 'r') # make a red line for the y-axis.
zAxisLine = ((0, 0), (0,0), (min(pltData[2]), max(pltData[2]))) # 2 points make the z-axis line at the data extrema along z-axis
ax.plot(zAxisLine[0], zAxisLine[1], zAxisLine[2], 'r') # make a red line for the z-axis.

# label the axes
ax.set_xlabel("x-axis label")
ax.set_ylabel("y-axis label")
ax.set_zlabel("y-axis label")
ax.set_title("The title of the plot") # show the plot

  Tuesday, 22 May 2012 at 10:35 am

When trying to tie together several packages/scripts into an ad hoc pipeline, retrieving standard output becomes important for downstream analysis. To run console commands in python and retrieve the standard output, we can use the popen2 command:

import popen2
output = popen2.popen3('your console command')

This popen2 command will run your console command and return 3 objects to the output variable: 1) stdout file object. 2) stdin file object. 3) stderror file object. 

Here is how to access those file objects:

import popen2
output = popen2.popen3('your console command')
standardOutput = output[0].read()
standardInput = output[1].read()
standardError = output[2].read()

Simple example would be to run the 'ls' terminal command to list all objects in the current directory and retrieve the output:

import popen2
listDirectory = popen2.popen3('ls')[0].read()
#you can further parse the text to get an array of directory items
listDirectory = listDirectory.strip().split('\n') 

Note that you can use the subprocess class instead of popen2. I just find popen2 to be simpler.

  Sunday, 13 May 2012 at 2:20 pm

When processing large files, we often want to print a log to the screen to keep track of the script's process. For example, if I was parsing a tab delimited file with 200 million lines:

dataFile = open('myfile','r')
lineNum = 1
for line in dataFile:
#parse and process my data
#print log to the screen
print 'finished processing line number ' + str(lineNum)
lineNum += 1

After parsing and processing every line, the script will print out a message indicating the number of lines it has processed.

With large files, printing out a log for every line is too verbose and just leads to clutter on the screen. To have the script print out a line every interval, we can use the modulus operator.

The modulus operator is represented in most languages by the '%' sign. It returns the remainder after division. 20 / 3 is equal to 3 * 6 + 2. The remainder is 2. 20 % 3 would result in 2. 

Here is an example of using the modulus operator to print out a log message every 10,000 lines:

dataFile = open('myfile','r')
lineNum = 1
for line in dataFile:
#parse and process my data
#print log to the screen
if lineNum % 10000 == 0:
print 'finished processing line number ' + str(lineNum)
lineNum += 1

Whenever the line number is divisible by 10000 without a remainder, the script will print a message.

  Friday, 06 April 2012 at 12:13 pm

There are plenty of helpful shell commands one can use for working with tab delimited files. The five most commonly used ones for me are:

  1. cut - cut the files according to space delimiter 
  2. grep - search for string in a line
  3. uniq - get uniq counts of element
  4. wc - count words, lines
  5. sort - sort your file

Using these commands with pipes allows you to do many useful things:

#output the first column of your tab delimited file
cut -f 1

#output all lines that have the word 'gene' in it
grep 'gene'  

#output number of lines that have the word 'gene' in in it
grep -c "gene"

#ouput number of lines that have the word "gene" in its third column
#this works on a GTF/GFF file and will output how many gene features there are in the file
cut -f 3 | grep -c "gene" 

#output number of each type of word in the third column
#If you run this on a GTF/GFF file, you'll get number of features for each feature type
cut -f 3 | sort | uniq -c

#ouput number of unique elements in column 1
#useful for checking how many queries had blast result without redundancy in a tab delimited blast output
cut -f 1 | sort | uniq -c | wc -l

  Wednesday, 04 April 2012 at 5:37 pm

You can define a dictionary by using an array of key/value tuple. For example the following dictionary:

myDict = {'keyA':'valueA','keyB':'valueB'}

Can be defined by:

myDict = dict([('keyA','valueA'),('keyB','valueB')]) 

This can be very useful when you want to parse text files into dictionaries. For example, a tab-delimited file where the first column is the key and the rest of the columns are values:

myDict = dict([(line.split('\t')[0],line.split('\t')[1:]) for line in open('pathToFile','r')])

This single line will create a dictionary where the first column is the key and the value is an array of the rest of the columns.

  Thursday, 29 March 2012 at 12:54 pm

It is a very common operation to sort an array of non-scalar variables (arrays, tuples, dicionaries, objects) in python. For example:

#array of tuples
myTuples = [('a',122),('b',4322),('c',531),('e',1)] 

If I want to sort the above array with respect to the second element of the tuple, I can use the 'itemgetter' function from the 'operator' module:

import operator
myTuples.sort(key = operator.itemgetter(1)) #1 is the index of the second element in the tuple

If you have an array of dictionaries, you can still use the itemgetter function by using the name of the key:

#array of dictionaries
myDicts = [{'name':'a','count':122},{'name':'b','count':4322},{'name':'c','count':531},{'name':'e','count':1}] 
myDicts.sort(key = operator.itemgetter('count'))

If you have an array of objects where you want to sort by an attribute, you can use 'attrgetter', instead of 'itemgetter':

#array of objects
myDicts = [objectA, objectB, ojbectC, objectD] 
myDicts.sort(key = operator.attrgetter('count'))

To sort in ascending or descending order, use the 'reverse' argument in the sort function:

myTuples.sort(key = operator.itemgetter(1), reverse = True)

If you do not want to use the operator module or using version of python that doesn't have the module, you can always use a lambda function to define the key:

myTuples.sort(key = lambda x : x[1]) 

  Monday, 26 March 2012 at 5:20 pm

If you want to extract reads that mapped only to a certain strand of the reference, you can use the -F and -f flags in samtools:

#for positive strand
samtools view -F 0x10 -b input.bam > postiveStrand.sam

#for negative strand 
samtools view -f 0x10 -b input.bam > negativeStrand.sam 

The 0x10 flag of the bitwise flags indicates whether the read is reverse complemented or not. If 0x10 is set, then the read mapped to the reverse strand. If it is not set, then the read mapped to the positive strand.

The -F parameter with 0x10 will skip anything that has 0x10 set, meaning it will skip anything that mapped to the reverse strand. The -f parameter will only keep anything that has 0x10 set.