Wednesday, 31 October 2012 at 4:23 pm

Bioinformatics is a rapidly evolving field with new software and methodologies coming out constantly. It is increasingly important to not be constrained by only using the traditional tools of the field. Furthermore, we should look outside academia for outside the box inspiration. 

In this post, I'll list a few tools that are not conventionally used in bioinformatic analysis, but have the potential to be effective.

  Friday, 26 October 2012 at 4:21 pm

There are a set of specialized data containers in python that can be extremely useful as an extension to the basic data structures (lists, dictionary, tuple..):

  • Default dictionaries
  • Named tuples
  • Ordered dictionaries

These structures are only available in python 2.5+. Import them from the collections module by:

from collections import *

Description and code examples of these structures are in this post.

  Monday, 22 October 2012 at 11:49 am

I recently came across a paper mentioned in this Biostar post:

Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Wagner GP, Kin K, Lynch VJ. Theory Biosci. 2012.

The paper argues that RPKM (reads per kilobase per million) as a measure for mRNA abundance is inconsistent and proposes another measure, TPM (transcripts per million).

In this post, I'll go over RPKMs and how they are inconsistent among samples as explained by the paper.

update (October 23, 2012)

Here is a cool paper comparing 7 popular methods of normalizing tag counts:

Marie-Agnès Dillies, Andrea Rau, Julie Aubert. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics. 10.1093/bib/bbs046.

  Wednesday, 17 October 2012 at 10:24 am

I recently wrote a script to bootstrap correlations between RNA-seq datasets. The script randomly chooses 50 pairs of expression values from two datasets and calculated a Spearman correlation. It will then perform this process 1000 times resulting in a file of 1000 correlation values.

Input to this script is RNA-seq data in csv format of 10 conditions. I wanted to get all non-redundant pair-wise bootstrap correlations among the 10 conditions. I used the itertools.combinations function to get all possible pair-wise combinations.

The script took 48.7 seconds to run total, producing 45 files for the pair-wise comparisons.

I wasn't satisfied with the run speed, so I looked up how to perform multi-thread process in python and rewrote the script. The difference between the single-thread and multi-thread scripts is literally 5 lines. Instead of running bootstrap correlation on each comparison one after another, I ran it 10 at a time with the multiprocessing module

The multi-threaded version turned out to take 1.09 seconds to run on 10 threads!

I have no idea why 10 threads made the run-time improve by almost 50X. I am running this script on a Centos server with 24 cores and plenty of ram.

The two version of the bootstrap correlation script are included here. The scripts are commented with what the input data should look like. Requires python 2.6+, numpy, and scipy.

single-thread version

multi-thread version

update (October 17th, 2012)

As Aaron correctly pointed out in the comments, the multi-thread script is actually running all the processes at once. The nested for loops are running 10 processes at a time, however they are not waiting for the first 10 to finish before running the next 10.

And actually running 10 processes at a time really isn't how it should work in the first place. I should divide the total number of processes by the number of threads, round that number up and distribute that number among 10 processes. So for the 45 processes in this bootstrap script distributed among 10 threads, I should really give each thread 4.5 jobs to do. If I round up, that's 5 jobs per thread and 9 threads; if I round down, that's 4 jobs per thread and 11 threads (last thread with 5 jobs).

Here is a corrected multi-threading script that takes 5.8 seconds to run. I guess this is more what I was expecting. Good to know that I can run all 45 threads at once and run it in ~1 seconds though.

updated multi-thread script

  Friday, 12 October 2012 at 11:20 am

I've been trying to deal with the problem of having TOO much data recently. When you have several sets of annotations for a genome, how do you discard redundant annotations? One solution might be to just use Cuffmerge, which will combined various annotations into one merged set. 

There really is no simple solution to this problem. A step in the right direction would be to cluster similar annotations together. In this post, I will present a way to calculate a correlation coefficient for annotations and subsequently use MCL to cluster the annotations.