Thursday, 29 March 2012 at 12:54 pm

It is a very common operation to sort an array of non-scalar variables (arrays, tuples, dicionaries, objects) in python. For example:

#array of tuples
myTuples = [('a',122),('b',4322),('c',531),('e',1)] 

If I want to sort the above array with respect to the second element of the tuple, I can use the 'itemgetter' function from the 'operator' module:

import operator
myTuples.sort(key = operator.itemgetter(1)) #1 is the index of the second element in the tuple

If you have an array of dictionaries, you can still use the itemgetter function by using the name of the key:

#array of dictionaries
myDicts = [{'name':'a','count':122},{'name':'b','count':4322},{'name':'c','count':531},{'name':'e','count':1}] 
myDicts.sort(key = operator.itemgetter('count'))

If you have an array of objects where you want to sort by an attribute, you can use 'attrgetter', instead of 'itemgetter':

#array of objects
myDicts = [objectA, objectB, ojbectC, objectD] 
myDicts.sort(key = operator.attrgetter('count'))

To sort in ascending or descending order, use the 'reverse' argument in the sort function:

myTuples.sort(key = operator.itemgetter(1), reverse = True)

If you do not want to use the operator module or using version of python that doesn't have the module, you can always use a lambda function to define the key:

myTuples.sort(key = lambda x : x[1]) 

  Wednesday, 28 March 2012 at 2:17 pm

View the demo here
Download source code here

There are many ways of displaying gene ontology (GO) information. The most common way is to use software like Cytoscape to generate a graph diagram of GO with terms of interest highlighted. However, it is much more interesting to graph your data in an interactive way.

Here is a demo graphing only a small subset of GO. The demo starts with the root GO term, biological process and allows you to click on children terms (in orange) to navigate down the GO graph. It is a force-directed graph so you can also drag the nodes around. 

The two links on the top (fix and unfix) will turn off/on the force physics. There are also two sample GO term links that'll allow you to jump to the specified GO term.

Since the entire JSON formatted data of GO (biological process only) is around 4MB, this demo does not contain all possible GO terms. You can download a version with all GO terms here

The graph was rendered using D3.js. The data was parsed and formatted with python. I am not going to into great detail about the source code as there are too many things to cover. I'll just generally talk about the steps in creating this visualization.

  Monday, 26 March 2012 at 5:47 pm

Anyone who deals with .sam mapping files has seen the bitwise flag column. It is a single number that can somehow indicate settings for a bunch of different parameters. As per the sam file specifications:

Bit    Description
0x1 template having multiple segments in sequencing
0x2 each segment properly aligned according to the aligner
0x4 segment unmapped
0x8 next segment in the template unmapped
0x10 SEQ being reverse complemented
0x20 SEQ of the next segment in the template being reversed
0x40 the rst segment in the template
0x80 the last segment in the template
0x100 secondary alignment
0x200 not passing quality controls
0x400 PCR or optical duplicate

What does all that mean and how does bitwise flags work? This post will probably go into more detail than you need to for understand bitwise flags.

  Monday, 26 March 2012 at 5:20 pm

If you want to extract reads that mapped only to a certain strand of the reference, you can use the -F and -f flags in samtools:

#for positive strand
samtools view -F 0x10 -b input.bam > postiveStrand.sam

#for negative strand 
samtools view -f 0x10 -b input.bam > negativeStrand.sam 

The 0x10 flag of the bitwise flags indicates whether the read is reverse complemented or not. If 0x10 is set, then the read mapped to the reverse strand. If it is not set, then the read mapped to the positive strand.

The -F parameter with 0x10 will skip anything that has 0x10 set, meaning it will skip anything that mapped to the reverse strand. The -f parameter will only keep anything that has 0x10 set.

  Monday, 26 March 2012 at 4:14 pm

Here is a simple way to get intersection of two lists (listA and listB) in python:

intersection = list(set(listA) & set(listB))

It will first make listA and listB unique (remove all redundancies) by making them into sets. Then it will make a new list with elements that are in both listA and listB.

To get items that are exclusive to either lists:

exclusve = list(set(listA) ^ set(listB))

For example:

listA = [1, 2, 3, 4, 5, 6, 7]
listB = [1, 2, 3, 4, 5, 8, 9]

intersection = list(set(listA) & set(listB))
exclusve = list(set(listA) ^ set(listB))

#that would yield: [1, 2, 3, 4, 5] #intersection [6, 7, 8, 9] #exclusive

Going a step further, you can do this to get items in listA that is not found in listB:

uniqueToListA = list(set(listA) & set(set(listA) ^ set(listB)))

#which would yield:
[6, 7,] #unique to listA 

  Saturday, 24 March 2012 at 11:21 am

I had no idea there were so many styles of lab coats available. Some of these don't even look like lab coats. Here is a link to the lab coats section of


There is a whole section on Grey's Anatomy lab coats....

  Thursday, 22 March 2012 at 2:41 pm

View the demo here
HTML source is at the bottom of the post

Computers and the internet have changed academia in dramatic ways from greater sharing of data to a larger sense of community. Science journals are now all digitized and available online either through your web browser or downloadble as a .pdf. 

Even with all the technology available for presenting data, most published papers still only contain static figures. I am not undervaluing the importance of having nicely formatted figures and graphs. But I do want to show how data can be presented with all the tools available now. 

Science papers are generally viewed on a computer through a web browser like Chrome, Firefox, or Safari which use javascript/html/css for displaying information. Therefore, browser languages are ideal for ensuring accessibility of your data. Javascript is often touted as the most prevalent programming language in the world since every computer has a browser and most browsers can interpret javascript.

Here are a bunch of examples of interactive figures made using browser technologies, specifically D3.js.

  Thursday, 22 March 2012 at 1:07 pm

There are times when you just don't want to spend the time to look up the API of your favorite python module just to see if you are spelling a method or property correctly. Use the dir() function to list all attributes of an object.

For example, if you want to look at the attributes of a String object:

>>> dir(str)
['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

  Thursday, 22 March 2012 at 08:44 am

A common task in my lab is blasting two .fasta files together. Often times someone would ask me to blast a few known genes against a set of transcripts to get some quick information. It became such a frequent task that I decided to just write a script that will blast two .fastile files against each other and generate a tab-delimited file. 

I am using python and blast 2.2.25+. User manual for blast 2.2.X is located here.

  Wednesday, 21 March 2012 at 04:08 am

Gene Ontology (GO) is a controlled dictionary of terms used commonly among biologists to describe genes. Using a strict set of classifiers allows computational biologists to analyze qualitative data in a quantitative manner. 

GO is structured as a directed acylic graph. Basically it is a hierarchy of terms where children terms cannot be the parent of ancestors. It is important to note that parents can have multiple children and children can have multiple parents.

GO terms can be accessed via the AmiGO web interface, downloaded and installed as a SQL database, or downloaded as a flat-file (.obo). There are also plenty of software and packages available.

One of the most common operations in GO analysis is finding ancestors or descendents of a specific term. AmiGO is great for looking up details for a single term, but it doesn't display information in an easily parsable way. SQL queries can be made to a database to get this information, but for instances where there is no database available or no pre-installed packages, you can use the .obo flat-file to get this information.

  Tuesday, 20 March 2012 at 11:55 am

Most gene enrichment websites out there only allow you to find enrichments for popular model organisms using pre-established gene ontology annotations. I ran into this problem early on during my phd when confronted with having to generate enrichment data on Schmidtea mediterranea.

  Tuesday, 20 March 2012 at 09:53 am

Linux bash shell or OSx terminal commands can be used for more than shuffling files and running processes. Using a combination of grep, awk, sed and pipes, one can extract or input data into files. Here are 4 useful one-liners for manipulating .fasta files.