Thursday, 27 September 2012 at 3:37 pm

Correlations are commonly calculated in RNA-seq data to check whether there are any dependence between two sets of data. For example technical or biological replicates are expected to be closely correlated, ie. similar expression values for transcripts between two samples. For an more in depth explanation of correlation, check out my previous post.

Pearson and Spearman are two widely used methods for calculating correlation. In this post, I will attempt to explain the differences between the two methods and their usage in RNA-seq data.

Pearson vs Spearman correlation

My previous blog post gave a geometric interpretation of Pearson's correlation. It is basically the cosine of the angle between regression lines of the two data sets being compared. Cosine of a small angle approaching 0 degrees, signifying high positive correlation would be close to 1. Cosine of angle approaching 180 degrees, signifying high negative correlation would be close to -1. Cosine of a 90 degree angle (orthogonal), signifying no correlation would be 0.

Spearman correlation is a ranked correlation where the two data sets are first ranked and then a Pearson correlation is calculated from the ranks. Ranking is done by simply ordering the dataset and then assigning an increasing number to each piece of data. For example, the following RNA-seq data for two conditions are ranked by expression value.

gene Condition 1 Expression Value Condition 2 Expression Value Condition 1 Rank Condition2 Rank
A 10 351 2
22 2 1
34 451 3
50 124495614 5
78 23541 4

To calculate the Pearson correlation, we would use the expression values. To calculate the Spearman correlation, we would use the rank numbers.

From the above example we can already notice something about using ranks vs raw values. The range of values do not matter after ranking. Notice gene D's condition 2 value is clearly an outlier, but is simply represented as the last rank. This is the first important difference between Pearson and Spearman: the presence of outliers will influence your Pearson correlation a lot more than Spearman.

I want to emphasize that there is no 'best' method of calculating correlation. Various statistical methods are implemented to give you different perspective on the data. There are no 'correct' methods, just methods that illustrate your point more clearly.

Monotonic vs linearity

Correlation calculations are used to quantify the amount of dependence that exist between two data sets. As an observation increase in one data set, does it also increase in the other data set? Does it decrease? Does it not matter? Monotonic is a term describing this increasing or decreasing trend and Spearman correlation is used to determine monotonicity.

What is the difference between monotonicity and linearity? Think of linearity as a type of monotonicity. Since monotonic only cares about the increasing or decreasing dependence of your two data sets, how it increase/decreases doesn't matter. It is very possible to have two data sets that show good monotonicity, but bad linearity. For example two data sets that show an exponential curve of increasing correlation would be monotonic, but not linear.

Pearson correlation is a measure of linearity while Spearman correlation is for measuring monotonicity. 

RNA-seq correlation

What do we care about when looking at two expression profiles? If we are just looking to see how similar the two profiles are to each other, then we would care about linearity.

Two samples that are monotonic, but not linear would not tell us how similar the two samples are. It would, however tell us that certain ranges of expression values tend to correlate differently than other ranges, which might be interesting by itself.

It's not a simple answer of choosing Pearson or Spearman. Both calculations have their pros and cons depending on your data. We've already seen previously that Pearson correlation can be heavily influenced by outliers and Spearman correlation doesn't give us a strict measure of linearity. 

The following are scatter plots (plotted in R with ggplot2) of a few data sets I am working with. The expression values are normalized log10 scaled tag counts. The first figure shows two closely correlated samples with correlation values of 0.98 (Pearson) and 0.99 (Spearman). Both correlation methods have p-values of pretty much zero.

The red line is the 1:1 diagonal line. Any points near this line have very similar expression values in both samples. The blue line is the regression line for the two data sets. The closer the blue line is to the red line, the higher the Pearson linearity correlation.

This second figure shows two disparate samples with correlation values of 0.22 (Pearson) and 0.88 (Spearman). Both have p-values of zero. 

This second figure is interesting because of the large difference between Pearson and Spearman correlation values. The Spearman correlation value is high because there is an increasing correlative trend, however the Pearson correlation is low because this trend is not linear. The presence of large and small outliers probably also contributed to the low Pearson correlation.

There is no single method that will give you the best observance of your data set. Different statistical methods give you different perspective of the same data. In terms of seeing how similar two expression profiles are, I think Pearson correlation is a more relevant measure because we want to know if expression values are the same between two samples (linearity), not just whether they have an increasing or decreasing trend. My advice would be to remove outliers and then use Pearsons correlation. Make sure you don't have an Anscome's quartet situation by visually looking at your data set also.