Thursday, 27 September 2012 at 3:37 pm

Correlations are commonly calculated in RNA-seq data to check whether there are any dependence between two sets of data. For example technical or biological replicates are expected to be closely correlated, ie. similar expression values for transcripts between two samples. For an more in depth explanation of correlation, check out my previous post.

Pearson and Spearman are two widely used methods for calculating correlation. In this post, I will attempt to explain the differences between the two methods and their usage in RNA-seq data.

  Thursday, 12 July 2012 at 09:33 am

Information theory as an interdiciplinary field consisting of engineering, physics, bioinformatics, mathematics, and many others started with Claude E. Shannon's 1948 paper, "A mathematical theory of communication". Advancements in this field have been instrumental in improving communication across the world from data storage on disc drives to satellite communcations. 

In this blog post, I will briefly go over what information theory is all about in an intuitve way and it's practical application to the field of bioinformatics and evolution. I am not an expert on information theory, so I welcome any corrections.

  Monday, 30 April 2012 at 4:51 pm

I am going to try to explain the basics of data correlation. I am no mathematician, so hopefully I'll be able to convey it in a more intuitive way. 

I am sure we've all calculated correlation at some point in our scientific careers. One common task in RNA-seq analysis is to calculate the correlation of tag counts between two replicate samples to check for technical or biological errors. This correlation value falls between -1 and 1. 1 being directly correlated, 0 being not correlated at all, and -1 being inversely correlated. 

Let's take a very simple example of two lines (red and blue):

A. Linear relationship, high correlation B. Zero correlation C. Inverse relationship, negative correlation

Correlation is basically the cosine of the angle between the two lines. Remember that cosine(0) = 1, consine(90) = 0 and cosine(180) = -1.

The angle between the two lines is very small in figure A (<90), indicating high correlation; the angle is 90 degrees in figure B, indicating no correlation; and the angle is very large (>90) in figure C, indicating negative correlation. 

Correlation between two RNA-seq samples would basically be the cosine of the angle between the regression lines of the two RNA-seq data set. That's pretty much the gist of it. In the rest of the blog post, I'll go into the mathematics of calculating correlation.

  Friday, 20 April 2012 at 2:06 pm

Phred quality scores are the de facto standard for evaluating the quality of individual bases in DNA sequences. It was used originally in Phred, a software used to call bases from sequencing trace files. The concept of Phred quality scores has not changed much over the years; however the scale of the scores has changed due to Illumina's decision to shift the ASCII scale.

In this post, I will talk about what Phred scales represent and what the various scales are.

  Thursday, 05 April 2012 at 11:45 am

CDHIT is a program commonly used to cluster nucleotide/protein sequences. It is used routinely by NCBI to get rid of redundant sequences in the NR (non-redundant) database. It is extremely fast compared to a traditional all vs all blast and subsequent pair-wise clustering.

I am going to attempt to explain the algorithm behind CDHIT and the associated advantages and disadvantages.

  Monday, 26 March 2012 at 5:47 pm

Anyone who deals with .sam mapping files has seen the bitwise flag column. It is a single number that can somehow indicate settings for a bunch of different parameters. As per the sam file specifications:

Bit    Description
0x1 template having multiple segments in sequencing
0x2 each segment properly aligned according to the aligner
0x4 segment unmapped
0x8 next segment in the template unmapped
0x10 SEQ being reverse complemented
0x20 SEQ of the next segment in the template being reversed
0x40 the rst segment in the template
0x80 the last segment in the template
0x100 secondary alignment
0x200 not passing quality controls
0x400 PCR or optical duplicate

What does all that mean and how does bitwise flags work? This post will probably go into more detail than you need to for understand bitwise flags.