Tuesday, 07 August 2012 at 10:21 am

A common task in analyzing annotations or alignments is to see what the consensus or overlapping regions are. For example:

How do you calculate the overlapping region (red) or the consensus region (blue) between these 2 features on a reference sequence? The best available tool for this type of analysis is bedtools. However, just for fun, I will show you how to perform consensus and overlap in python without using any external libraries.


Input

The example shown above is the simplest case for this type of analysis, but what happens when you have something like this:

I am going to assume the input to this script will be an array of tuples representing a series of start and end coordinates. So for the complicated example above, the data will be: 

data = [(5,15),(20,38),(25,50),(30,35),(50,60),(55,70),(80,90)]
# A B C D E F G 


Overlap comparisons algorithm

One method of finding consensus is to first cluster all the coordinates into groups where each feature in the group overlaps with at least one other feature in the group. For the above example, there would be 3 groups:

  • Group 1 - A
  • Group 2 - B, C, D, E, F
  • Group 3 - G

Then to get the consensus region, you can just find the minimum start coordinate and the maximum end coordinate for each group. However, finding overlapping region is more complicated. Instead of writing about methods that requires you to do series of coordinate comparisons, I will present a simpler solution.


Lexical algorithm

As it turns out, you can treat this problem of finding consensus and overlaps as a lexical parsing problem. Lexical parsing is essential to how compilers and interpreters understand code. It is also something humans can do intuitively to interpret language and grammar. In the following mathematical equation, we understand the order in which we do the operation because we can interpret the parentheses grouping parts of the equation together:

((1 + 2) * (3 + 4)) - (5 + 6)

We understand that 1 + 2 and 3 + 4 are to be performed first. Then we multiply the previous two operations and end with a final subtraction of 5 + 6.  

So how do we apply this to consensus and overlap? Let's take the example above and label the start/end coordinates of each feature as 'S' and 'E'. If we write out the 'S' and 'E' in coordinate order, we get:

S E S S S E E S E S E E S E

From this string of sorted start and end coordinates, we can extrapolate the consensus and overlaps by:

[S E] S S [S E] E [S E] [S E] E [S E]   
every inner-most lexical 'S E' is a potential overlap region

[S E] [S S S E E S E S E E] [S E] 
every outer-most lexical 'S . E' is a consensus region

If we look at the string of 'S's and 'E's and treat them as parentheses where 'S' is '(' and 'E' is ')':

( )    ( ( ( ) ) ( ) ( ) )    ( )

The inner-most nested parentheses that do not contain any more parenthesis within itself are the inner-most 'S E'. The outer-most parentheses that are not nested within any other parentheses are what I mean by the outer-most 'S . E'. That's basically the gist of this algorithm. 

There are a few things to watch out for. In cases where the start coordinate of one feature is the same as the end coordinate of another feature, we sort the positions by start coordinate first. So in the above example, the start coodinate of feature E (50,60) will come before the end coordinate of feature C (25,50). 

Another issue is how we want to define overlap. Finding the inner-most lexical 'S E' will give us overlaps and any stand-alone features (ie. feature A and G). Counting the number of consecutive 'S' + 1 before the inner-most lexical 'S E' will give us the number of features that composes the overlap. For example:

[S E] S S [S E] E [S E] [S E] E [S E]   
0 + 1 2 + 1 0 + 1 0 + 1 0 + 1 #number of features that makes up the overlap

If we want to discard the stand-alone features, we can simply just remove any overlaps consisting of only 1 feature.


Code

Here are some python code to get consensus and overlap. We first sort the coordinates then pass the sorted coordinates to the consenus/overlap functions. Code is commented.

data = [(5,15),(20,38),(25,50),(30,35),(50,60),(55,70),(80,90)]

def sortCoord(d):
	coords = []
        #add each start/end position into a new array as a tuple where the
        #first element represents whether it is a start or end
	for coord in d:
		coords.append(('s',coord[0]))
		coords.append(('e',coord[1]))

	#sort by start and end first. In case of event where
	#a start and end coordinate are the same, we want the start
	#coordinate to come first. For example feature C's end coordinate
	#and feature E's start coordinate are the same, we want to make
	#sure feature E's start cooodinate will be first
	coords.sort(key = lambda x : x[0], reverse = True)

	#sort by coordinate
	coords.sort(key = lambda x : x[1])

	return coords

def overlap(c):
	#not going to explain the logic of this code, it's
	#just a simple lexical tokenizer
	count = 0
	posA = 0

	#this will tell you how many 'levels' there are
	#to the current overlap. Essentially how many
	#features makes up the overlap.
	level = 1
	out = []
	for pos in c:
		if pos[0] == 's':
			count = 1
			level += 1
			posA = pos[1]
		if pos[0] == 'e':
			level -=1
			count -=1

		if count == 0:
                        #only output overlap if there are more than 1 feature
                        #making up the overlap
			if level > 1:
				out.append((posA, pos[1], level))

	return out

def consensus(c):
	#not going to explain the logic. It basically just
	#looks for the outer-most SE's
	count = 0
	posA = 0
	out = []
	for pos in c:
		if count == 0:
			posA = pos[1]
		if pos[0] == 's':
			count += 1
		if pos[0] == 'e':
			count -=1

		if count == 0:
			out.append((posA, pos[1]))

	return out

sortedCoords = sortCoord(data)
print sortedCoords
#[('s', 5), ('e', 15), ('s', 20), ('s', 25), ('s', 30), ('e', 35), ('e', 38), ('s', 50), ('e', 50), ('s', 55), ('e', 60), ('e', 70), ('s', 80), ('e', 90)]

consensusData = consensus(sortedCoords)
print consensusData
#[(5, 15), (20, 70), (80, 90)]

overlapData = overlap(sortedCoords)
print overlapData
#its a tuple of (start, end, number of features overlapping)
#[(30, 35, 3), (50, 50, 2), (55, 60, 2)]







Search

Categories


Archive