Enrichment analysis are applied when you have categorical data associated with your dataset. For example gene ontology, pfam families, molecular pathways, enzymatic activity...etc. The gist of the analysis is to see whether a certain category (GO term, pfam…) are over-represented in a subset of your data.

Let’s take an example. Let’s say I have:

- A transcriptome of 20,000 genes.
- 400 genes out of 20,000 are categorized as “cell cycle”.
- We found 1,000 genes to be differentially expressed under a certain condition.
- 300 genes have the “cell cycle” category out of the 1,000 differentially expressed genes.

What is the significance of this? In other words, if we pick 1,000 genes randomly from the total pool of 20,000 genes, what are the chances there will be more than 300 genes with the cell cycle category?

In this post I will go through the basics of how enrichment analysis is performed and some thoughts on how informative this analysis is as applied to biological systems.

## Calculating p-value with a simple hypergeometric distribution

A more intuitive way to think about how calculating the p-value of enrichment works is to imagine how you would bootstrap this value. You would write a script to randomly pick 1,000 genes from the total population of 20,000 genes. Then look at this set of 1,000 genes and see if more than 300 are cell cycle. Record this result as success or failure. Then repeat this process millions of times. The probability value will just be the number of success divided by the number of test cases (repeats). This is not the most efficient method obviously.

The statistical way of calculating this p-value is to use a hypergeometric distribution. I won’t go into the mathematics of this probability distribution. I’ll present a more intuitive description. Let’s take our cell cycle example again.

Let’s refer to a cell cycle gene as True and not cell cycle gene as False. And instead of picking 1,000 genes, we are only picking 2 genes. How many combinations (without replacement) of 2 genes can we get out of the total population of 20,000 genes? Using a binomial coefficient we can calculate this to be 199,990,000.

Out of the 199,990,000 possible combinations, how many are True False, how many are True True, how many are False False?

The number of True True is the same as asking “out of total population of 400 cell cycle genes, how many combinations of 2 genes are there?”. Again, with a binomial coefficient, the answer is 79,800. The probability of getting a True True is 79,800 / 199,990,000 = 0.0004

The number of False False is the same as asking “out of the total population of 19,600 not cell cycle genes, how many combinations of 2 genes are there?”. With a binomial coefficient, the answer is 192,070,200. The probability of getting a False False is 192,070,200 / 199,990,000 = 0.9604

The number of True False is the same as the product of “out of total population of 400 cell cycle genes, how many combinations of 1 gene are there” (the answer is simply 400) and “out of total population of 19,600 not cell cycle genes, how many combinations of 1 gene are there” (again the answer is just 19,600). The number of True False, by multiplying 400 and 19,600, is 7,840,000. The probability of getting a True False is 7,840,000 / 199,990,000 = 0.0392

The numbers of possible True True, True False, and False False naturally add up to the total number of possible combinations of 2 genes which is 199,990,000.

Now imagine expanding this to our set of differentially expressed 1,000 genes. How many combinations of 1,000 genes are there in our total 20,000 genes? How many of these combinations are 300 Trues and 700 Falses? How many are 301 Trues and 699 Falses? How many are 302 Trues and 698 Falses? Add up all the possible combinations where there are more than 300 Trues, divide by the total possible combinations and you will get your p-value.

Here is a python script requiring scipy that can be used to calculate hypergeometric p-value:

import sys import scipy.stats as stats print print 'total number in population: ' + sys.argv[1] print 'total number with condition in population: ' + sys.argv[2] print 'number in subset: ' + sys.argv[3] print 'number with condition in subset: ' + sys.argv[4] print print 'p-value <= ' + sys.argv[4] + ': ' + str(stats.hypergeom.cdf(int(sys.argv[4]) + 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3]))) print print 'p-value >= ' + sys.argv[4] + ': ' + str(stats.hypergeom.sf(int(sys.argv[4]) - 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3]))) print

Save the script and use it by:

python script.py [total number of genes in the list] [total number of genes in the list with condition A] [total number of genes in the list with condition B] [number of genes with both condition A and B]

The script will print out:

- a p-value where by random chance number of genes with both condition A and B will be <= to your number with condition A and B
- a p-value where by random chance number of genes with both condition A and B will be >= to your number with condition A and B

## Statistical probability space and biological probability space

The probability space is the number of possible combinations when we pick a certain number of items out of a total pool of things. For the example above, there are 199,990,000 possible combinations when we pick 2 genes out of 20,000 genes.

Applying it to the list of 1,000 differentially expressed genes, we are asking “if we randomly designate 1,000 genes to be differentially expressed, what are the chances 300 or more of these genes are cell cycle?”

However, a simple hypergeometric method of calculating enrichment essentially assumes each possible combination has the same chance of occuring as any other possible combination. This does not reflect reality. Certain combinations probably have a very low chance of occuring in a real biological system.

Are all the mathematically possible combinations of 1,000 differentially expressed genes biologically possible? Do certain sets of 1,000 genes ever become differentially expressed naturally? Maybe certain sets of 1,000 genes being differentially expressed would just outright kill the organism? If we assume, in natural environments, the organism has a very limited configurations as to which 1,000 genes can be differentially expressed, then our probability space becomes much much smaller; and our p-value becomes much much bigger.