Information theory as an interdiciplinary field consisting of engineering, physics, bioinformatics, mathematics, and many others started with Claude E. Shannon's 1948 paper, "A mathematical theory of communication". Advancements in this field have been instrumental in improving communication across the world from data storage on disc drives to satellite communcations.
In this blog post, I will briefly go over what information theory is all about in an intuitve way and it's practical application to the field of bioinformatics and evolution. I am not an expert on information theory, so I welcome any corrections.
What is it
Information theory is a framework used to describe communication. It formalizes the abstract ideas of "information" into a mathematical (more statistical) framework.
It is important to note that information theory provides system solutions to data problems and not physical solutions. For example, to improve the signal quality of a channel such as phone lines, we can physically improve the hardware by replacing old cables, make the transmitter more sensitive, or a host of other physical solutions. A system solution, on the other hand is a way of organizing the data we are transmitting to ensure fidelity. A simple system solution might be to just transmit duplicated data so if one copy is missing data, there are redundancies to make up for it.
Shannon's paper essentially presents two ideas: source coding theory and channel capacity theory. Source coding theory deals with data compression  taking a piece of data and using clever algorithms to reduce the size of the data while maintaing the information content. Channel capacity deals with data encoding  organizing a piece of data in such a way to ensure informational fidelity when transmitted across an imperfect channel, usually results in make the transmitted data larger. If you want to know more about these two theories, they are presented at the end of this blog post. For relevance to evolution, you will only need to know what information is as defined by Shannon.
Information
Shannon's definition of information can also be thought of as complexity. Intuitively, we think a system is more complex when there are more number of possible configurations. A mechanical system with 2 parts where each part can be adjusted to 6 different settings will have 36 different configurations. A biological system such as a cell with an enormous number of parts (organelles, molecules) will have an exponentially larger amount of configurations.
Let's take a simple system of a 2sided coin. The outcome of flipping an unbased coin is either a head or a tail. There are 2 possible configurations of this system. On the other hand, if we toss a coin where both sides are heads, we will always get head as the outcome. There is only 1 configuration of this 2headed coin system. Tossing this 2headed coin will tell us nothing new because we know it will always be heads.
The information content of a system is quantified by the complexity, or entropy of the system. The more entropic a system is, the more possibilities there are for the outcome. More possibilities for the outcome makes a message more significant and contain more information. Conversely, A less entropic system where there are less variation and less possibilities makes the message less significant and contain less information.
It is important to note that Information as defined by Shannon is independent of the meaning or interpretation of the message. Two pieces of data describing complete disparate ideas can have the same quantity of information content.
Quantities of information is conventionally meaured in the unit of a bit (0 or 1). 1 bit would give you 2 outcomes, 2 bits would give you 4 outcomes (00, 01, 10, 11), 3 bits would give you 8 outcomes (000, 001, 011, 111, 010, 110, 100, 101). Any base system is actually fine for measuring information. For example we might use a base of 4 for measuring nucleotide information or a base of 20 for measuring protein residue information. The previous example of flipping an unbased 2sided coin would contain 1 bit of information (0 or 1 is same as head or a tail). Conversely, flipping a 2headed coin would give us 0 information.
Mathematically, entropy of a system is defined as:(in bits)
 each possible outcome/configuration of a system are: (x1, x2, x3, x4...xi).
 entropy of each outcome: p(xi) * log2 (p(xi)) where p(xi) is probability of the outcome xi.
 entropy of the system is the sum of each individual outcome entropy.
For a system of an unbiased 2sided coin, the entropy is:
 p(heads) = 0.5
 p(tail) = 0.5
 entropy of heads = H(heads) = 0.5 * log2(0.5) = 0.5
 entripy of tails = H(tails) = 0.5 * log2(0.5) = 0.5
 sum entropy= 0.5 + 0.5 = 1 bit
Information theory and evolution
Information theory is practically applicable to NGS data in terms of data compression and data modelling. But can we apply information theory to evolution and perhaps gain insight? Is the inheritance of genetic material similar to transmission of information across a channel?
The more well known (wrong)application of information theory to evolution was actually an attempt to prove intelligent design. This argument was presented to Richard Dawkins in the late 90's by a film crew pretending to be unbiased film makers: "give an example of a genetic mutation or an evolutionary process which can be seen to increase the information in the genome". Dawkins, more surprised by the creationist agenda of the film crew didn't give a satisfactory answer and was subsequently portrayed as being stumped in the documentary.
The question revealed a lack of understanding of information theory on the film maker's part. Their question is essentially: "Can random mutations cause complex structures to form, ie. eyes, brain, etc". They chose to dress the question up in a pseudoscientific manner to give themselves legitimacy.
Shannon's information is different from the general public's definition of information. The film makers were really using the term 'information' as changes in the genome that give rise to a functional phenotype. They made the mistake of bringing functional relevance into the definition of Shannon's information. Remember that Shannon's information says nothing about the interpretation of the message.
Mutations actually will increase information in the genome. Adding more variation, increasing entropy, will increase information content of the message. Maximal entropy is when all nucleotides have equal chance of occuring at every base position. However, just because variation is being added, does not mean the new variations have any functional or practical significance in terms of the animal's fitness. The layman's definition of "functional information" is what the film makers were actually referring to.
Of course the part of the puzzle that the film makers were missing is natural selection. Mutations produce variation, increasing entropy, increasing the information content of the message. Natual selection in a sense IS the message. The instance of genomic variation that was selected for and transmitted to the offspring is reflective of the selective pressures.
Evolution, as frameworked by information theory would generally go through these phases:
 New generation of animals are born. Entropy is at it's lowest as natural selection has taken a subset of the previous generation, reducing variation in the possible genotypes. Information content is lowest at this point.
 Any factor that causes variation (mutations, crossing over, retroviral insertions..) increases the entropy of the gene pool, increasing the information content of the system.
 Transmission of the message occurs with natural selection selecting for the variations that give best fitness. This process is analgous to flipping a coin or tossing a die under biased circumstances. This message will have information content relative to the entropy of the gene pool.
 This new generation of animals produced is in essence, the message transmitted, reflective of the natural selection that has taken place previously.
Dawkin's lengthy reply to the film maker's question after the documentary's release points out several salient issues:
 There are information contained within the genomes of organisms, but what is considered relevant information? Just the coding sequence? What about noncoding regulatory regions? Do we care about information content of the entire genome?
 Why is an increase in information (more complexity) assumed to be the default trajectory of evolution?
 Mutations causes uncertainty (variation), natural selection funnels the variation, narrowing the uncertainty.
Analysis of gene entropy
Whether we can perform this type of analysis or not is constrained by the data available to us. Performing this type of analysis would be very similar to performing phylogenetic analysis using SNP data. Information content of genes can be calculated if we have SNP data representing the probabilities of individual nucleotides at a certain position. Genes with less SNPs have less entropy, and less information content. Vice versa for genes with more SNPs.
If we were able to get our hands on SNP data for ancient species, we would be able to compare the information content of individual genes and perhaps make a statement relating a change in entropy to a change in selective pressure. Perhaps a function that is lost gradually throughout evolution will be represented by a gradual increase in entropy of the group of genes responsible for the function.
Instead of looking at entropy of genes at the species level with SNP data, Dr. Adami recently looked at the entropy of gene families across taxons (C. Adami. The Use of Information Theory in Evolutionary Biology, Annals NY Acad. Sciences 1256 (2012) 4965). Specifically, he looked at HOX (homeobox) gene family and a COX2 protein. The HOX family genes are essential in determining patterning during development and COX2 is a subunit of a larger complex involved in cellular respiration. A nonfunctioning HOX gene will likely lead to severe developmental aberrations. Impaired COX2 has a milder effect and is usually not fatal.
Graphing the level of entropy of HOX and COX2 by taxon levels taken from NCBI, Adami showed a small range of entropies for HOX genes and a wider range of entropies for COX2 across the taxons. The differential entropy ranges represents the selective pressure on these genes. Since HOX is essential in most animals, the entropy range is smaller, meaning there is pressure on this gene family to not vary too much. COX2 on the other hand do not have a severe phenotype and has less selective pressure acting on it, resulting in a larger range of variations.
One big concern on using entropy as a measure of variation is how we setup the groupings of data. For example in Adami's entropy graph, it looks like HOX genes of mouse and human are a lot more entropic than HOX genes of drosophila. Does that necessarily mean Drosophila's HOX genes had gone through higher selective pressure or is this difference reflective of differential amount of HOX genes among the organisms. Adami is taking all possible HOX genes from mouse and human and comparing that to all possible HOX genes from Drosophila. Mammals have 4 banks of HOX genes (A, B, C and D) with multiple genes in each bank while Drosophila have 8 HOX genes. The higher entropy of mouse and humans may just be artificially inflated by the variation within species.
Summary
Information theory can be applied to study biological evolution. But are we trying to fit a round peg into a square hole? Information in biology can exist on so many levels (nucleotide, protein, structure, function, phenotype...), it seems that to get a holistic picture we would need to incorporate as much data as we can from all levels. However, not all biological data are digital like nucleotides and amino acids to allow us a straightforward way to calculate information.
The interpretation agnostic aspect of information theory where the meaning or function of the information is separate from the quantity of information can be a double edged sword. While it gives us an unbiased method to measure variation, it also detaches your data from higher levels of information. Entropy in nucleotide without protein substitution data can falsely give higher entropy, same with protein sequence and structural data, and so on. However, that is not a problem unique to this type of analysis.



*This rest of the blog post is for people who want to know more about Shannon's paper
Source coding theory (data compression)
Data compression is something that is desparately needed in the NGS field due to the deluge of sequencing data coming out of sequencers all over the world. As reads get better in sequencing quality and longer in length, we may reach a point where we can afford to discard raw data after processing.
Data can be compressed by many clever algorithms. A common compression method is to create a symbol table where commonly found symbols in your data are indexed and assigned a shorter bit representation. For example if I wanted to transmitt this data:
AAAAAAAAAAAA B C D EEEEEEEEEEEE
The letters 'A' and 'E' occur most in this piece of data. In the symbol table, I would assign a shorter bitcode to A and E:
A  0
B  00
C  01
D  10
E  1
Using this symbol table, when transmitting the above data, I would encode it as:
000000000000 00 01 10 111111111111 *total of 30 bits
Conversely, if I didn't optimize my symbol table and I assigned longer bitcodes to frequently occuring symbols, I would end up with this:
A  01
B  0
C  1
D  00
E  10
010101010101010101010101 0 1 00 101010101010101010101010 *total of 52 bits
A symbol table is not the only way to compress data, there are numerous other algorithms optimized for different types of data.
What is significant about Shannon's theory of source coding is that the information content of a piece of data is representative of the theoretical average data compression one can obtain under an optimal compression algorithm.
It is important to note the word "average" because it is easy to fit an algorithm to a specific piece of data and compress the data down to 1 bit. This algorithm will then be overfit for this specific case and work poorly for other data sequences (source coding also relates to the field of machine learning in this sense).
The sequence described above has the information content:
 27 symbols total, 12 As, 1 B, 1 C, 1D, 12 Es.
 p(A) = 12 / 27, p(B) = 1 / 27, p(C) = 1 / 27, p(D) = 1 / 27. p(E) = 12 / 27
 information per symbol = 0.520 + 0.176 + 0.176 + 0.176 + 0.520 = 1.568
 1.568 * 27 symbols = 42.33 bits
As seen previously, a simple optimized symbol table can compress the sequence down to 30 bits and a bad symbol table will yield 50 bits. Other algorithms might be able to do much better or much worse. The best general algorithm will theoretically give a compression of 42.33 bits on average.
Depending on what we know about the data, we might be able to come up with extremely good algorithms to fit that data. But there is always the danger of overfitting the algorithm. The 'best' and extremely overfitted compression algorithm will be this symbol table:
A  01
B  11
C  1
D  00
E  10
AAAAAAAAAAAABCDEEEEEEEEEEEE  0
Under this ridiculous symbol table, we would be able to compress the data down to 1 bit.
Source coding theory gives us the amount of bits for a theoretical optimal compression algorithm, but it makes no comment on the compression algorithm itself. As seen from the ridiculous symbol table, one can easily compress the entirety of human genome down to 1 bit, but the algorithm symbol table will be the size of the human genome, rendering the algorithm useless practically.
Channel capacity theory (data encoding)
Channel capacity theory deals with organizing data in a way that information fidelity is preserved when transmitted across a noisy channel. This process of data organization is also called encoding. Encoding data for fidelity results in an increase of the size of data being transmitted. This is why compression and encoding go hand in hand. Data is first compressed to get rid of redundancy and then encoded for fidelity across a channel. The increase in size from encoding can be offset by the initial compression.
Just like data compression, there are numerous encoding systems used by communication industries all over the world. A simple example of encoding is a parity check encoding. For example, if we are transmitting the following series of numbers:
1 2 3 0 1 2 3
We can calculate a sum parity check number which is just a sum of the above numbers:
sum parity = 1 + 2 + 3 + 0+ 1 + 2 + 3 = 12
After transmitting the series of numbers, we will also transmit the sum parity number. The decoder on the receiving end will then perform the same addition of numbers and check if it corresponds with the parity number. If one of the numbers was transmitted wrongly:
1 2 3 0 1 2 3 transmitted> 1 2 2 0 1 2 3
We will be able to deduce that there is an error.
This is a extremely simplistic view of what parity encoding is. Parity checks are usually a bitwise XOR operation that will allow the decoder to error correct a single error. But I won't get into that.
Fidelity of the signal can be represented by the probability of an error occuring after transmission. The increase in size after encoding is also called 'rate' and is represented as a ratio of original data size and postencoding data size. If we represent these two measures as a graph:
Xaxis is the rate of encoding. The larger the rate, the close the postencoding size is to the original data size. The smaller the rate, the larger the postencoding size is compared to original data size. On the yaxis is the probability of error after decoding. Ideally, we want an encoding algorithm that will maximize the rate (smaller postencoding size) and minimize the probability of error.
Before Shannon's paper, coding scientists assumed there is always a tradeoff between rate and probability of error. If you want a better error rate, you will have to increase your postencoding file size. The optimal limit of encoding algorithms was seen as a line that goes through the origin of the above graph.
Howerver, through a series of statistical maths, Shannon was able to prove a maximum channel capacity. Instead of a limit that goes through the origin, the limit is actually a line that vertically falls somewhere around the rate of 0.5:
This was significant as this means for the tradeoff of doubling your postencoding data size, you can theoretically get extremely good probability rates.