Monday, 30 April 2012 at 4:51 pm

I am going to try to explain the basics of data correlation. I am no mathematician, so hopefully I'll be able to convey it in a more intuitive way. 

I am sure we've all calculated correlation at some point in our scientific careers. One common task in RNA-seq analysis is to calculate the correlation of tag counts between two replicate samples to check for technical or biological errors. This correlation value falls between -1 and 1. 1 being directly correlated, 0 being not correlated at all, and -1 being inversely correlated. 

Let's take a very simple example of two lines (red and blue):

A. Linear relationship, high correlation B. Zero correlation C. Inverse relationship, negative correlation

Correlation is basically the cosine of the angle between the two lines. Remember that cosine(0) = 1, consine(90) = 0 and cosine(180) = -1.

The angle between the two lines is very small in figure A (<90), indicating high correlation; the angle is 90 degrees in figure B, indicating no correlation; and the angle is very large (>90) in figure C, indicating negative correlation. 

Correlation between two RNA-seq samples would basically be the cosine of the angle between the regression lines of the two RNA-seq data set. That's pretty much the gist of it. In the rest of the blog post, I'll go into the mathematics of calculating correlation.

  Friday, 20 April 2012 at 2:06 pm

Phred quality scores are the de facto standard for evaluating the quality of individual bases in DNA sequences. It was used originally in Phred, a software used to call bases from sequencing trace files. The concept of Phred quality scores has not changed much over the years; however the scale of the scores has changed due to Illumina's decision to shift the ASCII scale.

In this post, I will talk about what Phred scales represent and what the various scales are.

  Wednesday, 18 April 2012 at 7:25 pm

view the demo here
HTML source is at the bottom of the post

In part 1 and 2 of this series, I showed you how to graph a simple chart and how to use Chrome's developer tools to more effectively debug/test your code. In this part of the series, I will show you how to add interactivity to your data visualizations.

The key method in d3 used for adding interactivity is the '.on' method. I will go through a few examples of how to use the '.on' method and briefly go over some more advanced topics like event bubbling and using mouse coordinates.

  Monday, 16 April 2012 at 2:21 pm

An article came out few days ago on the topic of source code sharing as part of publications:

Shining Light into Black Boxes, Science 13 April 2012: Vol. 336 no. 6078 pp. 159-160 DOI: 10.1126/science.1218263

With the wealth of sequencing data available, also comes a wider room for mistakes. The recent controversy surrounding RNA-editing (M. Li et al. Science doi:10.1126/science.1207018 ; 2011) is a prime example of the this 'black box' of computational analysis. A summary of how the analysis was performed is just not good enough nowadays. 

I agree with the article that scripts/software should be open sourced. But I also wonder how the journals will determine if a script is acceptable for publication. Are bioinformaticians going to have to spend extra time making scripts into a pipeline? Are all analytical scripts going to have to be curated and packaged? Do we run in to the danger of turning the biology source code landscape into a mess of spaghetti code by mandating everyone submitting their own version of even simple operations?

  Tuesday, 10 April 2012 at 4:51 pm

I've been working on our lab website the past couple of days. I am happy to say that it is pretty much done. We just have to add in text and content. Like this blog, it uses PivotX as a backend CMS. For the frontpage slideshow, I used TinySlideshow. It's not the most feature rich slideshow library, but it is the simplest and doesn't depend on jQuery like most other ones.

Check it out here:

I am glad all those years of screwing around with web development/design have not been wasted....

  Tuesday, 10 April 2012 at 10:47 am

Note, April 11th, 2012: I decided to not use any of the data I generated in this analysis, as I felt I was overcomplicating the problem.

I gave an overview of the 3 planarian transcriptomes in my last blog post. Since we do not have immediate access to all the raw data that went into the transcriptomes, we have to resort to merging the assembled transcriptomes according to their strengths and weaknesses.

Here are some thoughts on the 3 transcriptomes that'll need to be address when performing the merge:

  • The BIMSB transcriptome contains the most full length, non redundant set of transcripts, but it also has the least amount of transcripts.
  • The AAA and Heidelberg transcriptomes have better coverage of the transcriptome due to high depth of sequencing.
  • The BIMSB and AAA transcriptomes both have a small population of very short sequences. It looks like Heidelberg discarded transcripts below 100 base pairs.
  • The Heidelberg transcriptome contains a lot of Ns.
  • There maybe some isoform information contained in all the transcriptomes.
  • The AAA transcriptome may contain elements of both asexual and sexual sequences since the SOLiD reads were reference assembled.
  • Strandness in BIMSB and Heidelberg transcripts are mainly based on ORF or homology evidence.

  Friday, 06 April 2012 at 12:13 pm

There are plenty of helpful shell commands one can use for working with tab delimited files. The five most commonly used ones for me are:

  1. cut - cut the files according to space delimiter 
  2. grep - search for string in a line
  3. uniq - get uniq counts of element
  4. wc - count words, lines
  5. sort - sort your file

Using these commands with pipes allows you to do many useful things:

#output the first column of your tab delimited file
cut -f 1

#output all lines that have the word 'gene' in it
grep 'gene'  

#output number of lines that have the word 'gene' in in it
grep -c "gene"

#ouput number of lines that have the word "gene" in its third column
#this works on a GTF/GFF file and will output how many gene features there are in the file
cut -f 3 | grep -c "gene" 

#output number of each type of word in the third column
#If you run this on a GTF/GFF file, you'll get number of features for each feature type
cut -f 3 | sort | uniq -c

#ouput number of unique elements in column 1
#useful for checking how many queries had blast result without redundancy in a tab delimited blast output
cut -f 1 | sort | uniq -c | wc -l

  Friday, 06 April 2012 at 10:11 am

GFF and GTF are data formats heavily used for storing annotation information. It's common to see these two formats used interchangeably. However, GFFs (general feature format) are actually meant to be used for any genomic feature, while GTF (gene transfer format) is strictly used for genes. 

Both of these formats are very similar, making conversion pretty simple. The only problem in conversion is when the lines of the GFF file is not arranged in feature blocks. This entry will show you the differences between these two files and how to interconvert between the two formats.

  Thursday, 05 April 2012 at 11:45 am

CDHIT is a program commonly used to cluster nucleotide/protein sequences. It is used routinely by NCBI to get rid of redundant sequences in the NR (non-redundant) database. It is extremely fast compared to a traditional all vs all blast and subsequent pair-wise clustering.

I am going to attempt to explain the algorithm behind CDHIT and the associated advantages and disadvantages.

  Wednesday, 04 April 2012 at 5:37 pm

You can define a dictionary by using an array of key/value tuple. For example the following dictionary:

myDict = {'keyA':'valueA','keyB':'valueB'}

Can be defined by:

myDict = dict([('keyA','valueA'),('keyB','valueB')]) 

This can be very useful when you want to parse text files into dictionaries. For example, a tab-delimited file where the first column is the key and the rest of the columns are values:

myDict = dict([(line.split('\t')[0],line.split('\t')[1:]) for line in open('pathToFile','r')])

This single line will create a dictionary where the first column is the key and the value is an array of the rest of the columns.

  Tuesday, 03 April 2012 at 11:29 am

In terms of sequences, there are currently quite a lot of data in the planarian (Schmidtea medterranea) field. We have an assembled genome from University of Washington's genome institute and various transcriptome assemblies using different sequencing platforms. The four main transcriptome assemblies that we have are:

I am going to briefly go over the various transcriptomes and give some thoughts on what can be improved.

  Tuesday, 03 April 2012 at 11:11 am was setup back in early 90s predominantly for physicists and engineers to publish preprints of their work. Nature precedings was a effort to emulate that open publishing model which recently announced it's termination.  

In recent years, has added a quantitative biology section with a lot of interesting papers. Since arxiv is not strictly peer reviewed, there is bound to be some iffy papers. But generally they are a good read. I also find that since most of the authors do not come from a biological background, they tend to explain things in less technical terms making some of the methods they use easier to understand.

Here are some interesting papers, some have been published in other journals and others only in

  • Abolfazl Motahari, Guy Bresler and David Tse. Information Theory of DNA sequencing. arXiv:1203.6233.
  • Wang Liang. Segmenting DNA sequence into `words' based on statistical language model.  arXiv:1202.2518
  • T. Chandrasekhar, K. Thangavel, E. Elayaraja. Effective Clustering Algorithms for Gene Expression Data. arXiv:1201.4914
  • Gang Fang, Michael Steinbach, Chad L. Myers, Vipin Kumar. Integration of Differential Gene-combination Search and Gene Set Enrichment Analysis: A General Approach. arXiv:1101.3474
  • Lior Pachter. Models for transcript quantification from RNA-Seq. arXiv:1104.3889. (very math heavy, but pretty good overview from what I can understand)

I'll throw this one in just to demonstrate the convergence of sciences we have in, Fabio Pichierri. Quantum ProteomicsarXiv:1107.5853. I have no idea about the quality of this paper since I don't understand most of it, but it does have a cool title. 


  Sunday, 01 April 2012 at 11:18 am

In part 1 of this series of post, I showed you how to graph a simple bar graph. In this post, I will show you how to use Chrome's developer tools to better debug your code. The javascript console in developer tools is an extremely powerful resource that allows you access to access and run javascript within the browser. You can even run all your code in the console to render a figure if you choose to. 

To turn on the developer tools in Chrome, click the wrench icon in the upper right hand corner of the browser window -> tools -> developer tools. Your browser will now have a new partition in the bottom showing the developer tools. The two most useful feature in the developer tools are the 'Elements' and 'Console' tabs.