I've been trying to deal with the problem of having TOO much data recently. When you have several sets of annotations for a genome, how do you discard redundant annotations? One solution might be to just use Cuffmerge, which will combined various annotations into one merged set.

There really is no simple solution to this problem. A step in the right direction would be to cluster similar annotations together. In this post, I will present a way to calculate a correlation coefficient for annotations and subsequently use MCL to cluster the annotations.

## Correlation coefficient

To calculate correlation between two annotations, I used Matthew's correlation coefficient (MCC). It is typically used in machine learning as a metric for assessing the quality of the predicted value to the observed value. As a metric for comparing annotations, it takes into account the correctness AND the incorrectness (true/false positives/negatives) of the comparison, allowing for a more balanced score among different comparisons.

The equation for calculating MCC as per wikipedia is:

Where:

- TP = true positives
- TN = true negatives
- FP = false positives
- FN = false negatives

Let's take the example of this pair of annotations, feature A in red and feature B in blue. The annotation coordinates are also marked at the bottom. In terms of feature annotations, true positives and true negatives are:

True positives are the number of base pair overlaps between the two features. True negatives are the number of bases where both features have no annotations (ie. introns).

False positives and false negatives are a bit of a misnomer in the context of comparing annotations. If we assume feature A is the true annotation and feature B is the predicted annotation, false positives and false negatives would be:

Coordinates 5-10 is considered a false negative because feature B doesn't contain annotations in that region where there should be annotation (remember feature A is assumed to be correct). Conversely, coordinates 20-25 is a false positive because feature B do not contain annotations in that region where there should not be.

However, we are not concerned with whether the annotation is right or not. We are concerned with how similar feature A is to feature B. Notice if we now assume feature B is the true annotation and feature A is the predicted, we would end up with:

The false positives and false negatives are switched compared with the previous assumption that feature A is the true annotation. Let's take this a step further and look at the equation for calculating MCC. If we switched FP and FN around, do we end up with the same MCC value? The answer is yes, we do. Whether feature A or feature B is the true annotation doesn't matter. The MCC value is the same under either assumptions.

MCC, similar to other correlation coefficients is a value between 1 and -1. In terms of annotations, we really only care about a value larger than 0.

## Calculate annotation MCCs in python

There are a few considerations when calculating MCCs for annotations:

- We only care about MCC for annotations that overlap. Annotations that don't overlap will have true positive of 0, making the MCC 0.
- If a pair of overlapping annotations has no intron/exon structure, then there will be 0 true negatives, falsely making the MCC to be 0. In this case, we need to set the true negatives to 1.
- False positives and false negatives in terms of annotations have no meaning unless we are assuming one of the annotations is true. We are really just calculating the number of bases of one annotation that is not present in the other.

import copy def sortCoord(d): coords = [] for coord in d: coords.append(('s',coord[0])) coords.append(('e',coord[1])) coords.sort(key = lambda x : x[0], reverse = True) #sort by coordinate coords.sort(key = lambda x : x[1]) return coords def overlapLength(coordsA, coordsB): c = copy.deepcopy(coordsA) c.extend(coordsB) c = sortCoord(c) count = 0 posA = 0 level = 1 out = [] for pos in c: if pos[0] == 's': count = 1 level += 1 posA = pos[1] if pos[0] == 'e': level -=1 count -=1 if count == 0: if level > 1: out.append((posA, pos[1], level)) return sum([x[1] - x[0] + 1 for x in out])

The overlapLength function takes in two arrays of annotation coordinates. Each array is an array of tuples where first element is the start coordinate and the second element is the end coordinate:

coordsA = [(5,10),(15,20),(25,30)]

coordsB = [(5,10),(15,20),(25,35)]

ovLen = overlapLength(coordsA, coordsB)

#the overlap length (TP) in this case would be 18

Once we have the overlap length, we can calculate the interchangeable FP and FN which are the sum length of an annotation minus the overlap length:

lengthA = sum([x[1] - x[0] + 1 for x in coordsA])

legnthB = sum([x[1] - x[0] + 1 for x in coordsB])

aDifference = lengthA - ovLen

bDifference = lengthB - ovLen

The value for true negative is the total length of the comparison minus ovLength, aDifference, and bDifference:

import copy

combinedCoords = copy.deepcopy(coordsA)

combinedCoords.extend(coordsB)

compLength = max([x[1] for x in combinedCoords]) - min([x[0] for x in combinedCoords]) + 1

TN = compLength - ovLength - aDifference - bDifference

Taken all together, here is the function to calculate MCC:

def calculateMCC(coordsA, coordsB): lengthA = sum([x[1] - x[0] + 1 for x in coordsA]) lengthB = sum([x[1] - x[0] + 1 for x in coordsB]) TP = overlapLength(coordsA, coordsB) if TP > 0: combinedCoords = copy.deepcopy(coordsA) combinedCoords.extend(coordsB) compLength = max([x[1] for x in combinedCoords]) - min([x[0] for x in combinedCoords]) + 1 aDiff = lengthA - TP bDiff = lengthB - TP TN = compLength - TP - aDiff - bDiff if TN == 0: TN = 1 return ((TP * TN) - (aDiff * bDiff)) / (((TP + aDiff) * (TP + bDiff) * (TN + aDiff) * (TN + bDiff)) ** 0.5) else: return -1

*Note that this MCC function will only calculate if the two annotations overlap (TP > 0). It will return -1 if TP is 0.

## Generate an edge file and run MCL

The easist way to input data into MCL is to generate a tab-delimited edge file of the format:

nodeA nodeB score

nodeC nodeD score

...

For our annotation data, each node is an annotation and score is the MCC. Depending on what format your annotations are in, you will have to write a script to parse the data into arrays of coordinates and use the functions above to calculate the MCC.

Remember that most genomes are broken up in many big reference contigs and annotations are specific to a reference contig. Only calculate MCCs for annotations on the same reference contig.

Once you've generated the tab-delimited edge file, you can run MCL (assuming you've installed it already) by:

mcl myEdgeFile.tab --abc -o outputFile

For more advanced options, read the MCL manual. For example, you can adjust the granularity of the clustering using the -I option.

The output file will be a tab delimited file where each line is a cluster group. For example, here is an output showing three clusters. Cluster 1 contains Transcript A,B,C. Cluster 2 contains Transcript O,D,E,H. Cluster 3 contains Transcript K, I.

TranscriptA TranscriptB TranscriptC

TranscriptO TranscriptD TranscriptE TranscriptH

TranscriptK TranscriptI

## Conclusion

Now that you've clustered your annotations, you have the difficult task of choosing which annotation is the best representative of the cluster or how you might want to merge all the annotations in a cluster.