Wednesday, 19 December 2012 at 4:26 pm

As the gap among various disciplines of science get shorter, researchers are starting to see that problems in one field can be answered in another. And these similar concepts are just presented in a different context. Information and complexity are not concepts exclusive to the fields of mathematics or engineering. 

Looking at the universe through the lense of information gives us another perspective of life. The data that makes up each of us is enormous. How would we even come up with the "data" that describes each of us? A recording of every atom in our body? (Star Trek transporters apparently can do this). However, all of the "data" that makes up us is algorithmically compressed into our DNA. If we take the universe as a giant computer (turing machine) and our DNA as the source code, we are the output. 

Geneticists are essentially studying how this compression algorithm works. This post will touch on how concepts of algorithmic complexity, data compression, and similarity distances can be applied to biological data.

Algorithmic complexity

From my previous post on information theory, I've presented that the information/complexity of a message reflects the entropy of the system that emitted that message. Getting a result of 2 from rolling a six-sided die contains more information than getting a result of 2 from a three-sided die because there were 6 possible results (more entropy) in the first case versus 3 possible results from the second case. The sequence of a gene with 6 alleles in the population contains more information than the sequence of a gene with 3 alleles in the population.

An off-shoot of information theory is algorithmic information. It measures information in a different way. The algorithmic complexity (also known as Kolmogorov complexity) of a piece of data is the size of the smallest possible program that can generate the data given that the same coding language is used consistently.

A very random piece of data will require a longer program to generate it versus an ordered piece of data. For example:

Data A. Random piece of data:           101 110 001 010 100 011 111
Data B. Ordered piece of data:          100 100 100 100 100 100 100

The worst program you can write for any piece of data is just a program that outputs the exact same data. For example here is a program for outputting Data A (in python):

print '101 110 001 010 100 011 111'

Pretty dumb isn't it. If a piece of data is completely random, then the only way to output the data will be this worst case program. The algorithmic complexity of a completely random piece of data will be equal to the length of data plus constants that accounts for the programming language (the 'print' statement for example).

For data B, we can simply write (in python):

print '100' * 7 

One very important thing about algorithmic complexity is that there is no way to determine or calculate it. I won't go into exactly why (relates to Turing's halting problem). Intuitively, to determine the algorithmic complexity of a piece of data, we would need to know the absolute shortest algorithm to generate the data. There is simply no way to determine whether a given algorithm is the shortest. 


I will briefly get into the concept of similary distances and discuss how it relates to algorithmic complexity. The choice of a distance measure when performing any kind of similarity analysis on a data set can significantly influence the results. That is because each distance measure makes some kind of an assumption about the data.

Hamming distance assumes that the least possible number of changes to transform one data to another reflects the similarity of two pieces of data. Is that really the best assumption to make for all types of data? Imagine the hamming distances between a digital photograph and its negative picture. Every pixel of the photograph is inversed from the negative picture, resulting in a huge hamming distance. But is a picture and its negative really that dissimilar? Let's come up with a new distance measure called 'negative distance'. It measures deviations of a pixel from it's negative. With this new distance, the photograph and its negative are now perfectly similar. 

All of these different similarity distances are essentially trying to describe how to transform one piece of data to another piece of data. Euclidean distance describes the geometric length it takes to change one vector into another. Hamming distance describes the number of substitutions it takes to change a string of data into another. The made up 'negative distance' describes the deviation from inverse it takes to change one data into another. They are all describing an algorithm to change a piece of data to another piece of data.

Is there a non-parametric way to calculate distance? Is there a way to get similarity without defining a set of rules? 

Information distance and compression distance

The smallest algorithm to convert one data to another data is information distance. It requires no rules and is non-parametric. You may remember from the first section that there is no way to determine what the smallest algorithm is. Information distance is not computable. 

We can, however, approximate the smallest algorithm using compression algorithms. Through a series of complicated maths, we can arrive at a normalized compression distance for data x and data y:

ncd = compressed(x and y) - minimum(compressed(x), compressed(y))
maximum(compressed(x), compressed(y)

Which is really a distance measure parameterized by the compression algorithm. The closer the compression algorithm is to the smallest theoretical conversion algorithm, the closer the distance is to the information distance.

Using popular compression algorithms we can attempt to build phylogenetic tress by compressing DNA data. A couple of papers have attempted this with promising results:

Semantic distance

Another application for this concept is to discover semantic distance between objects from a set of observations. A cool variation is the normalized google distance where number of page where the objects are mentioned is used to calculate the semantic distance.

This may also be applied to gene ontology terms using a representative set of annotated sequences (UniProt). A formula for this would be:

a = number of genes in uniprot with GO term A
b = number of genes in uniprot with GO term B
c = number of genes in uniprot with both GO term A and B
n = total number of genes in uniprot

distance = maximum(log(a), log(b)) - log(c)
log(n) - minimum(log(a), log(b))

Note this is completely disregarding the GO graph structure and may not be applicable. For semantic similarities, people often use MICA (most informative common ancestor) in their measures. These papers describe 3 different variations of semantic similarities: 

  1. Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1 (IJCAI'95), Chris S. Mellish (Ed.), Vol. 1. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 448-453
  2. Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML '98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 296-304
  3. J. J. Jiang and D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In International Conference on Research on Computational Linguistics (ROCLING X), pages 9008+, September 1997


Compression distance might be a great method for calculating a non-parametric similarity measure. While it hasn't been used much in the bioinformatics field, it has been proven useful in the data mining field. One fault one may find with using the concept of information distance for phylogenetics is that the shortest algorithm may not actually be representative of evolution. 

A great primer on this topic can be found here: Normalized Information Distance,