Monday, 22 October 2012 at 11:49 am

I recently came across a paper mentioned in this Biostar post:

Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Wagner GP, Kin K, Lynch VJ. Theory Biosci. 2012.

The paper argues that RPKM (reads per kilobase per million) as a measure for mRNA abundance is inconsistent and proposes another measure, TPM (transcripts per million).

In this post, I'll go over RPKMs and how they are inconsistent among samples as explained by the paper.

update (October 23, 2012)

Here is a cool paper comparing 7 popular methods of normalizing tag counts:

Marie-Agnès Dillies, Andrea Rau, Julie Aubert. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics. 10.1093/bib/bbs046.


The number of sequenced reads that align to a gene of interest is conventionally called a tag count. To make tag counts comparable among samples, a normalization must be performed. RPKM (reads per kilobase per million) is a method of normalization that is widely used in RNA-seq analysis.

A normalization to library size is the most basic normalization that needs to be done. Two replicate samples with different total library size would produce proportionally different tag counts for the same gene. There are many ways of normalizing by library size. DESeq's normalization by factor size is a popular method where the median of geometric mean to gene tag count ratio is used as the factor size for each sample. RPKM method takes a simple approach of just dividing the tag count by the total library size. 

The RPKM method also involves another normalization to transcript length. Since most RNA-seq library preparation process goes through an initial transcript fragmentation step followed by size selection, longer transcripts will tend to produce more fragments than shorter transcripts. To remove this fragmentation bias, the RPKM method divides the tag count by the length of the transcript in kilobases. 

Here is the equation for calculating RPKM:

(tag count  * 1,000,000) / (total number of tags * kilobase of transcript)

The multiplication by a million is done to make the numbers easier to read and work with. 

Let's say we have a sample with 100,000 tags total. Gene A is 2 kilobases long and contains 10 tags. The RPKM would be:

10 * 1,000,000 / 100,000 * 2 = 50

Average normalized abundance

The paper presents the idea that the average normalized abundance of a set of transcripts should be constant among samples as long as the same set of transcripts are being assayed. 

We'll look at an example to illustrate this idea of constant average normalized abundance. Let's say we have a magical machine that will tell us the exact number of transcripts for each gene in our sample. We use this machine on two samples, X and Y taken from the same organism under different conditions. The organism has 5 genes: A, B, C, D, and E. The following table shows the total number of transcripts and the number of each transcript in the two samples.

Sample Total number of transcripts A transcripts B transcripts transcripts transcripts transcripts
X 100  80 10
Y 500  20 20 10  50  400 

Just by looking at the total transcript number we can say sample Y's transcript library size is 5 times larger than sample X's library size. To normalize the transcript numbers, we can simply divide sample Y by 5 or multiply sample X by 5. We can also divide each transcript count by the total number of transcripts giving us:


normalized A

normalized B normalized C normalized D normalized E
X 0.8 0.1 0.06  0.03  0.01 
0.04 0.04 0.02  0.1  0.8 

Given an ideal situation where we have this magical machine to tell us the exact number of transcripts, this would be the correct way of normalizing the abundance of transcripts.

These normalized abundance values are essentially the percentage of each transcript in the total population of transcripts. The average of these normalized values for both samples is 0.2:

sample X = (0.8 + 0.1 + 006 + 0.03 + 0.01) / 5 = 0.2
sample Y = (0.04 + 0.04 + 0.02 + 0.01 + 0.8) / 5 = 0.2

Since these normalized values are just percentages of the total transcript number, they have to sum up to 1 (100%). We are essentially just dividing 1 by the number of genes, which is 5, to get the average normalized abundance.

This shows that, in an ideal situation, the composition of transcripts in the two samples doesn't matter in calculating average normalized abundance. The number of A, B, C, D, E transcripts does not matter because after normalization, the average abundance will always be 0.2 as long as the two samples contain the same set of genes. We can even go as far as saying that the average normalized abundance will be constant among any sample with the same number of genes.

Inconsistency with RPKM

Any metric of mRNA abundance will be a proportion of the true transcript numbers in the sample. As a result, it should also conform to the described above constancy of average normalized abundance. However, RPKM normalization does not produce constant average RPKMs among samples.

Let's reuse the data from the example above. Instead of transcript numbers, let's just say they are tag counts instead. The gene lengths are:

Gene Gene length (kilobases)
B 50
C 25
D 5
E 1

The RPKMs for sample X and Y according to the gene lengths are:

Sample Total number of tags A RPKM RPKM RPKM RPKM RPKM
X 100  8,000 2,000 2,400 6,000 10,000 
Y 500  400 800 800  20,000  800,000 

The average RPKM for sample X is 5,680. The average RPKM for sample Y is 164,400. What is causing the significant difference in the average RPKMs? 

Note that we can use transcript numbers, molar concentration of transcripts or any other units we want for transcript abundance. Also note that to normalize by library size, we divide the abundance of individual transcripts by the total abundance of transcripts. The units are the same in the numerator and denominater of this normalization because we are only propotionally changing the abundance of individual genes.

Are sequenced reads an appropriate unit for transcript abundance? Yes, they are.

If we only normalized the individual tag counts by total library tag count, we would get a constant average normalized abundance. The numerator and denominater in this normalization are both in the same unit - tag counts.

For RPKMs, we are normalizing the tags by gene length first, and then normalizing by library size. The first normalization by length produces the unit of tag count/kilobase. The second normalization by library size divides tags/kilobase by tag count. This improper unit of normalization in the denominater is what is causing the inconsistent average RPKMs.

Instead of normalizing the reads to library size by total tag count division, we should divide by the sum of all length normalized tag counts:

sum(number of reads for transcript n / length of transcript n) where N is the set of all transcripts.

This reads/kilobase sum for sample X and Y would be:

sample X = 80 / 100 + 10 / 50 + 6 / 25 + 3 / 5 + 1/ 1 = 2.84
sample Y = 20 / 100 + 20 / 50 + 10 / 25 + 50 / 5 + 400 / 1 = 411

If we now use these new numbers as the denominators for the RPKM equation, we would get these RPKMs:

Sample New total number A
X 2.84 281690.14 70422.53 84507.04 211267.60 352112.67
Y 411  486.61 973.23 973.23 24330.90 973236.00

 The average RPKM for the two samples are now both ~200,000.


This inconsistency of RPKM among samples essentially stems from the wrongful division by total tag counts in the library normalization step after normalizing by length. However, before normalization by length or any other non-library size normalization, make sure the normalization is actually necessary in your analysis. If you are comparing gene expression among samples only, there really is no reason to normalize by length as you will be dividing each gene among the samples by a constant (gene length). You only need to use RPKM when you are comparing transcript expression within one sample.

In the bigger picture, there are still many assumptions associated with using tag counts as an unit of abundance. One assumption is the even distribution of reads across your transcript of interest. For example if we have 100 tags mapping to a transcript stacked up in one region on the transcript, are the 100 reads coming from 100 different transcripts? Is the maximal depth along a transcript more representative of abundance? If there is an uneven distribution of reads mapping across a transcript, should we even use the tag count as a measure of abundance? Could that uneven-ness be due to sequencing chemistry bias or wrong fragmentation?