Monday, 20 August 2012 at 10:34 am

As an extension on the post earlier on finding overlapping/consensus regions of features, this is a post on how to find features given a coordinate range. For example, given a range of coordinates and a reference contig, how does a genome browser efficiently find all features within that range?

The brute-force method would be to go through every feature one by one and determine whether the feature is within the given range by checking for overlap. That would make the search time linear relative to the number of total features (big-O notation of  O(n)). As it turns out, a binary search tree method called an interval tree can be applied to this problem that'll make searches O(log n). Here is an extremely good guide on the theory behind interval trees.

I am currently trying to write a client-side genome browser. Interval trees solves the problem of finding features effciently. Below is code on how to construct a interval tree in python and how to query the tree in python/javascript.

Constructing the interval tree in python

Here is the python code for constructing an interval tree. The input is an array of elements where each element has start coordinate, end coordinate, and ID. Here are some sample data:

#features where first and second numbers are start and end coordinates. 
#Third element is the id of the feature.
features = [ [1,90,'A'],[90,125,'B'],[25,60,'C'],[100,170,'D'],[170,220,'E'],[220,250,'F'],[600,700,'G'],[120,125,'H'],[500,550,'I'],[1000,1200,'J'],[800,850,'K'],[1100,1500,'L'] ]

The general steps of building this tree are:

  1. Convert the data into elementary intervals 
  2. Recursively build an empty tree with the elementary intervals. This tree is essentially a BST (binary search tree) with an addition of empty leave nodes.
  3. Recursively add the data into the empty tree.
  4. Trim the tree for empty data nodes
Here is code for the interval tree class with comments.

To use this class, all you have to do is:

myTree = intervalTree(features, 0, 1, 1, 2000)

The second and third arguments are the index/key for the start and end coodinates in the data array. For our data, our elements are in the format of:

[start coordinate, end coordinate, feature id]

The start coordinate is the 0 index and end coordinate is the 1st index. 

The third and fourth argument is the start and end of the total search space. For example if your genome is a million base pairs, the start and end range would be 1 and 1000000. 

Querying the interval tree in python

The 'findRange' function in the class will allow you to find features within a range:

#this will find all features between 20 and 200
results = myTree.findRange([20,200])

Querying the interval tree in javascript

To query this data structure in javascript, we first need to output the tree in a javascript file, which is basically just printing the tree as a string:

print 'var myTree = ' + myTree.tree + ";"

Save this data as a javascript file, tree.js. Here is the HTML/javascript template for querying the tree in javascript:

<script type="text/javascript" src="tree.js"></script>
//accessory functions for finding overlap function ptWithin(pt, subject) {
if (pt >= subject[0] && pt <= subject[1]) return true; return false;
function overlap(query, subject) {
if (self.ptWithin(query[0], subject) || self.ptWithin(query[1], subject) || self.ptWithin(subject[0], query) || self.ptWithin(subject[1], query)) return true; return false;

//accessory function to remove redundancies in an array function unique(a) {
var temp = {};
var len = a.length;

if (len > 0) { do {
len --;
temp[a[len]] = true;
} while (len)
var r = [];
for (var k in temp) r.push(k);
return r;

//recursively finds features within find range, same as the python implementation function find(node, findRange, start, end) {
var data = []; var left = [start, node[0]];
var right = [node[0], end]; if (overlap(left, findRange)) {
data = data.concat(node[3]);
if (node[1] != -1) data = data.concat(find(node[1], findRange, left[0], left[1]));
} if (overlap(right, findRange)) {
data = data.concat(node[3]);
if (node[2] != -1) data = data.concat(find(node[2], findRange, right[0], right[1]));
} return unique(data);

//to use the previous functions
results = find(myTree, [1,220], 1, 2000);
//results now contains the ids of the features between 1 and 220 </script>