Monday, 20 August 2012 at 10:34 am

As an extension on the post earlier on finding overlapping/consensus regions of features, this is a post on how to find features given a coordinate range. For example, given a range of coordinates and a reference contig, how does a genome browser efficiently find all features within that range?

The brute-force method would be to go through every feature one by one and determine whether the feature is within the given range by checking for overlap. That would make the search time linear relative to the number of total features (big-O notation of  O(n)). As it turns out, a binary search tree method called an interval tree can be applied to this problem that'll make searches O(log n). Here is an extremely good guide on the theory behind interval trees.

I am currently trying to write a client-side genome browser. Interval trees solves the problem of finding features effciently. Below is code on how to construct a interval tree in python and how to query the tree in python/javascript.

  Wednesday, 08 August 2012 at 11:20 am

view the demo here
Python script and HTML source is in the post.
Alternatively, you can download a zip file of source code and example data here

I posted some brief code on how to do hierarchical clustering of expression values with SciPy few days ago. I am going to expand on how that's done and also show you how you can visualize it as a heatmap in a browser (works best with Chrome or Safari) with javascript and d3.js. 

I will also post something on visualizing dendrograms in the coming weeks.

  Tuesday, 07 August 2012 at 10:21 am

A common task in analyzing annotations or alignments is to see what the consensus or overlapping regions are. For example:

How do you calculate the overlapping region (red) or the consensus region (blue) between these 2 features on a reference sequence? The best available tool for this type of analysis is bedtools. However, just for fun, I will show you how to perform consensus and overlap in python without using any external libraries.

  Friday, 03 August 2012 at 4:00 pm

SciPy is a great library for performing complex operations used commonly in science. Hierarchical clustering is one of them. Briefly, here is how to perform this with SciPy:

Input data is a 2d array where rows are genes and columns are your conditions. For example:

        Condition1   Condition2   Condition3   Condition 4   Condition5   
geneA 231 321 4 221 2312
geneB 211 4 53 34 53
geneC 4 343 .. .. ..
geneD  43 .. .. .. ..
.. .. .. .. .. .. 

To get this into a 2d array you can:

dataFile = open('path to your file','r')

dataMatrix = []

#skip first line because it's the column headers

for line in dataFile:
#append each column except the first column which is the gene ID
 dataMatrix.append([float(x) for x in line.strip().split('\t')[1:]])

After you have gotten the data into a 2d array, you can do the following to perform the clustering:

import numpy, scipy
import scipy.cluster.hierarchy as hier
import scipy.spatial.distance as dist

#get your data into a 2d array where rows are genes, and columns
#are conditions
data = numpy.array(dataMatrix)

#calculate a distance matrix distMatrix = dist.pdist(data)

#convert the distance matrix to square form. The distance matrix
#calculated above contains no redundancies, you want a square form
#matrix where the data is mirrored across the diagonal.
distSquareMatrix = dist.squareform(distMatrix)

#calculate the linkage matrix linkageMatrix = hier.linkage(distSquareMatrix)

 You can used various parameters for different distance calculations or linkage types. Read the API for more:

What you can with this linkageMatrix now is to create a dendrogram by using the dendrogram class:

dendro = hier.dendrogram(linkageMatrix)

#get the order of rows according to the dendrogram
leaves = dendro['leaves'] 

#reorder the original data according to the order of the
#dendrogram. Note that this slice notation is numpy specific.
#It just means for every row specified in the 'leaves' array,
#get all the columns. So it effectively remakes the data matrix
#using the order from the 'leaves' array.
transformedData = data[leaves,:]

Here is a great python scrip by Nathan Salomonis that'll take in your tab delimited expression file, perform a hierarchical clustering on rows + columns, and produce a heatmap.