I am going to try to explain the basics of data correlation. I am no mathematician, so hopefully I'll be able to convey it in a more intuitive way.
I am sure we've all calculated correlation at some point in our scientific careers. One common task in RNA-seq analysis is to calculate the correlation of tag counts between two replicate samples to check for technical or biological errors. This correlation value falls between -1 and 1. 1 being directly correlated, 0 being not correlated at all, and -1 being inversely correlated.
Let's take a very simple example of two lines (red and blue):
A. Linear relationship, high correlation | B. Zero correlation | C. Inverse relationship, negative correlation |
Correlation is basically the cosine of the angle between the two lines. Remember that cosine(0) = 1, consine(90) = 0 and cosine(180) = -1.
The angle between the two lines is very small in figure A (<90), indicating high correlation; the angle is 90 degrees in figure B, indicating no correlation; and the angle is very large (>90) in figure C, indicating negative correlation.
Correlation between two RNA-seq samples would basically be the cosine of the angle between the regression lines of the two RNA-seq data set. That's pretty much the gist of it. In the rest of the blog post, I'll go into the mathematics of calculating correlation.
Geometry
Taking the example of the following two vectors represented by the red and blue lines:
The correlation is equal to cosine (r). Let's review our geometry and calculate all the variables:
- a = the magnitude of the red vector = pythagorean theorem = sqrt ((2 * 2) + (5 * 5)) = sqrt (29)
- b = the magnitude of the blue vector = pythagorean theorem = sqrt ((4 * 4) + (2 * 2)) = sqrt (20)
- d = the distance between the two end points = distance formula = sqrt ( (4 - 2)^2 + (5 - 2)^2 ) = sqrt (13)
- r = law of cosine = arccos (((a * a) + (b * b) - (d * d)) / (2 * a * b))
- cos (r) = ((a * a) + (b * b) - (d * d)) / (2 * a * b) = (29 + 20 - 13) / (2 * sqrt (29) * sqrt (20)) = 0.75
Dot product
An easier way to calculate all of that above is to just use a dot product. A dot product is defined as:
- Given two vectors a and b with n dimensions
- dot product = a1*b1 + a2 * b2 + a3 * b3 ... an*bn
For example given two vectors:
- a = (2,5) and b = (4,2)
- dot product = 2 * 4 + 5 * 2
Knowing all of that, we can simplify our correlation equation to:
- correlation = dot product / magnitude of a * magnitude of b
- cos (r) = dot product / a * b = (2 * 4 + 5 * 2) / sqrt (29) * sqrt (20) = 0.75
Python implementation
A simple python implementation of the above would be:
vectorA = [2,5] vectorB = [4,2] dotProduct = sum([vectorA[i] * vectorB[i] for i in range(len(vectorA))]) magnitudeA = sum([v ** 2 for v in vectorA]) ** 0.5 magnitudeB = sum([v ** 2 for v in vectorB]) ** 0.5 correlation = dotProduct / (magnitudeA * magnitudeB) print dotProduct print magnitudeA print magnitudeB print correlation
Higher dimensions
This will also work for higher dimension data sets. Replicate samples of RNA-seq data of 25,000 transcripts will have a dimension of 25,000. The dot product and magnitude are calculated the same way as for a 2 dimensional vector. For example (assuming normalization has been performed):
Dataset A | Dataset B | |
geneA | 52 | 12 |
geneB | 31 | 41 |
geneC | 54 | 341 |
geneD | 54 | 3412 |
- Vector for dataset A = [52, 31, 54, 54]
- Vector for dataset B = [12, 41, 341, 3412]
- Dot product = 52 * 12 + 31 * 41 + 54 * 341 + 54 * 3412
- Magnitude of dataset A = sqrt (52 * 52 + 31 * 31 + 54 * 54 + 54 * 54)
- Magnitude of dataset B = sqrt (12 * 12 + 41 * 41 + 341 * 341 + 3412 * 3412)