Monday, 30 April 2012 at 4:51 pm

I am going to try to explain the basics of data correlation. I am no mathematician, so hopefully I'll be able to convey it in a more intuitive way. 

I am sure we've all calculated correlation at some point in our scientific careers. One common task in RNA-seq analysis is to calculate the correlation of tag counts between two replicate samples to check for technical or biological errors. This correlation value falls between -1 and 1. 1 being directly correlated, 0 being not correlated at all, and -1 being inversely correlated. 

Let's take a very simple example of two lines (red and blue):

A. Linear relationship, high correlation B. Zero correlation C. Inverse relationship, negative correlation

Correlation is basically the cosine of the angle between the two lines. Remember that cosine(0) = 1, consine(90) = 0 and cosine(180) = -1.

The angle between the two lines is very small in figure A (<90), indicating high correlation; the angle is 90 degrees in figure B, indicating no correlation; and the angle is very large (>90) in figure C, indicating negative correlation. 

Correlation between two RNA-seq samples would basically be the cosine of the angle between the regression lines of the two RNA-seq data set. That's pretty much the gist of it. In the rest of the blog post, I'll go into the mathematics of calculating correlation.

 

Geometry

Taking the example of the following two vectors represented by the red and blue lines:

The correlation is equal to cosine (r). Let's review our geometry and calculate all the variables:

  • a = the magnitude of the red vector = pythagorean theorem = sqrt ((2 * 2) + (5 * 5)) = sqrt (29)
  • b = the magnitude of the blue vector = pythagorean theorem = sqrt ((4 * 4) + (2 * 2)) = sqrt (20)
  • d = the distance between the two end points = distance formula = sqrt ( (4 - 2)^2 + (5 - 2)^2 ) = sqrt (13)
  • r = law of cosine = arccos (((a * a) + (b * b) - (d * d)) / (2 * a * b))
  • cos (r) = ((a * a) + (b * b) - (d * d)) / (2 * a * b) = (29 + 20 - 13) / (2 * sqrt (29) * sqrt (20)) = 0.75


Dot product

An easier way to calculate all of that above is to just use a dot product. A dot product is defined as:

  • Given two vectors a and b with n dimensions
  • dot product = a1*b1 + a2 * b2 + a3 * b3 ... an*bn

For example given two vectors:

  • a = (2,5) and b = (4,2)
  • dot product = 2 * 4 + 5 * 2

Knowing all of that, we can simplify our correlation equation to:

  • correlation = dot product / magnitude of a * magnitude of b
  • cos (r) = dot product / a * b = (2 * 4 + 5 * 2) / sqrt (29) * sqrt (20) = 0.75


Python implementation

A simple python implementation of the above would be:

vectorA = [2,5]
vectorB = [4,2]

dotProduct = sum([vectorA[i] * vectorB[i] for i in range(len(vectorA))])
magnitudeA = sum([v ** 2 for v in vectorA]) ** 0.5
magnitudeB = sum([v ** 2 for v in vectorB]) ** 0.5
correlation = dotProduct / (magnitudeA * magnitudeB)

print dotProduct
print magnitudeA
print magnitudeB
print correlation


Higher dimensions

This will also work for higher dimension data sets. Replicate samples of RNA-seq data of 25,000 transcripts will have a dimension of 25,000. The dot product and magnitude are calculated the same way as for a 2 dimensional vector. For example (assuming normalization has been performed):

  Dataset A Dataset B
geneA 52 12
geneB 31 41
geneC 54 341
geneD 54 3412
  • Vector for dataset A = [52, 31, 54, 54]
  • Vector for dataset B = [12, 41, 341, 3412]
  • Dot product = 52 * 12 + 31 * 41 + 54 * 341 + 54 * 3412
  • Magnitude of dataset A = sqrt (52 * 52 + 31 * 31 + 54 * 54 + 54 * 54)
  • Magnitude of dataset B = sqrt (12 * 12 + 41 * 41 + 341 * 341 + 3412 * 3412)







Search

Categories


Archive