Tuesday, 10 April 2012 at 10:47 am

Note, April 11th, 2012: I decided to not use any of the data I generated in this analysis, as I felt I was overcomplicating the problem.

I gave an overview of the 3 planarian transcriptomes in my last blog post. Since we do not have immediate access to all the raw data that went into the transcriptomes, we have to resort to merging the assembled transcriptomes according to their strengths and weaknesses.

Here are some thoughts on the 3 transcriptomes that'll need to be address when performing the merge:

  • The BIMSB transcriptome contains the most full length, non redundant set of transcripts, but it also has the least amount of transcripts.
  • The AAA and Heidelberg transcriptomes have better coverage of the transcriptome due to high depth of sequencing.
  • The BIMSB and AAA transcriptomes both have a small population of very short sequences. It looks like Heidelberg discarded transcripts below 100 base pairs.
  • The Heidelberg transcriptome contains a lot of Ns.
  • There maybe some isoform information contained in all the transcriptomes.
  • The AAA transcriptome may contain elements of both asexual and sexual sequences since the SOLiD reads were reference assembled.
  • Strandness in BIMSB and Heidelberg transcripts are mainly based on ORF or homology evidence.

Data quality

There are 5 aspects of the data that we want to conserve as much as possible from the 3 transcriptomes. In order of high priority to least priority:

  • Completeness - BIMSB is the most complete
  • Coverage - AAA and Heidelberg have more coverage
  • Parsimony - BIMSB is least redundant
  • Noise - Heidelberg seems to be the most noisy based on genome mapping
  • Strandness - Will prioritize strand based on: homology, ORF, tag counts of quantification libraries

While having isoform data would be ideal, for the sake of functional analysis of RNA-seq data, parsimony would be more important.

Curation of individual transcriptomes

First step is to do a strict curation of the individual transcriptomes based on their weaknesses. We can afford to be strict under the assumption that a transcript filtered out in one dataset probably would not be filtered out in another. However, there are going to be cases where a transcript, due to it's sequence chemistry is harder to sequence/assemble resulting in a poor assembly in all three datasets. 

The AAA transcriptome has redundancies either due to pseudo-genes from the SOLiD reference assembly, isoforms, or fragmented coverage of reads. A quick way to remove/detect redundant transcripts is to just use CAP3 and CDHIT-EST. CAP3 is a basic sequence assembler and CDHIT is a sequence clustering program used by NCBI to remove redundant sequences. A second filter would also need to be done to remove short transcripts.

Adamidi et al. did a great job of making sure transcripts in the BIMSB transcriptome code for something. They had homology and proteomics evidence to back it up. The only curation that really needs to be done on the BIMSB dataset is filter out the really short transcripts as their shortest transcript is a 8bp stretch of Ts repeats; most likely an artifact of the Newbler assembly. We will also do a CAP3 + CDHIT-EST just to see the level of redundancy.

First thing to do with the Heidelberg transcriptome is to remove transcripts with a lot of Ns. The long stretches of Ns seen in some of the transcripts are probably due to PE read assembly where the inner sequence between the pair-ends are not defined. Instead of removing transcripts with a lot of Ns, we can split transcripts by variable stretches of Ns to conserve sequence information even though they will be fragmented. A CAP3 + CDHIT-EST step should also be done.

CAP3 results

CAP3 generates several files after it finishes assembling the sequences. The three files we want to look at are the assembled contig fasta file, the singlet non-assembled sequences file and the stdout file that gets printed to the terminal screen (piped to a file). 

The first portion of the stdout file looks something like this:

******************* Contig 1 ********************
AAA.454ESTABI.18480+ is in AAA.454ESTABI.36+ ******************* Contig 10 ******************** AAA.454ESTABI.1006+ AAA.454ESTABI.1005+

This tells us what each assembled contig consists of and also roughly how it was assembled. A contig can be assembled due to overlapping regions, due to one sequence being within another sequence, or a combination of both. 

Using a python script to parse the stdout file we get:

   AAA  BIMSB  Heidelberg
Number assembled 157 386 1,379
Within sequences only 77 303 813
Overlapping sequences only 74 75 402
Both 6 8 164
Transcripts assembled 329 868 3,218

At first glance, it seems strange that AAA contains the least amount of clusters. However, it could just be the case that AAA is heavily fragmented and there are no overlapping fragments. BIMSB has more clusters than expected. This could be due to Newbler's assembly method where tries to generate multiple isoforms of a gene. The Heidelberg transcriptome contains the largest amount of assembled contigs. Probably caused by a combination of N repeats and redundancy.

CDHIT results

CDHIT generates a cluster file that lists the representative sequence of each cluster, cluster sequences, and respective identity scores. The number of clusters for each data set are:

   AAA  BIMSB  Heidelberg
Clusters 443 806 1,462
Transcripts clustered 957 1,859 3,346

Compared to CDHIT, the 'within sequence' numbers in CAP3 is significantly smaller. CDHIT's similarity algorithm is based on a short word filter and is not as comprehensive as CAP3, resulting in a more relaxed clustering. Read about CDHIT's algorithm in my previous post.

Filter results by length

I've decided to just use the CAP3 results because CDHIT's similarity scoring system seems too relaxed. Since it was designed to work fast on large datasets, it sacrifices a lot of similarity resolution. To filter the transcripts by length, we first look at the length distributions of the transcripts before and after CAP3. Here are the distributions of the transcripts under 1k base pairs:

AAA pre-CAP3 AAA post-CAP3
Heidelberg pre-CAP3 Heidelberg post-CAP3

The length distributions do not really change much pre/post CAP3. CAP3 didn't assembled that many contigs, so It could just be that this small amount of change is not well reflected by these graphs.

However, these graphs do show something interesting. The AAA transcripts have a large peak around 50bp. Aside from that large first peak, the AAA transcripts and BIMSB trancripts have very similar length distributions, both seem to have a peak around 400-500bp. The Heidelberg transcripts peaks very early at 100bp. 

Based on these graphs, I am just going to bin any transcripts less than 100bp from the AAA and BIMSB transcripts. Maybe more information can be extracted from these short transcripts later by seeing if these transcripts are found in both AAA and BIMSB datasets.

Dealing with Ns in the Heidelberg transcriptome

Out of 28,926 transcripts in the Heidelberg transcriptome, 8,659 have at least one N in it. Here are a few graphs of the 8,659 transcripts that have at least one N:

Distribution of percentage of Ns
Count of Ns vs Length of transcript
Percentage of Ns vs Length of transcript

It doesn't look like there is any correlation between the amount of Ns and length of the transcript. The assembly process used velvet/oases to generate multiple k-mer assemblies which were all merged together again with velvet/oases as pseudo long reads. The 454 and EST sequences were then added in as scaffolds to again, velvet/oases.

The large number of Ns should probably be due to insufficient coverage between pair-reads. However, would that be reflected in a N vs Length graph? Can we assume that transcripts with lengths closer to the pair-end distance are more likely to have more Ns as longer assemble transcripts would more likely to consist of multiple scaffolds?  Would there be more Ns in transcripts that are similar in length to the pair-distance? 

To conserve as much data as possible, I've decided to split any transcript that has a more than 10 stretch of Ns.  Then get rid of any transcript fragments less than 50bp.

8,625 out of the 8,659 transcripts have a stretch of Ns 10 or longer. 22,358 fragments were generated from this splitting. 22,026 fragments out of the 22,358 are longer than 50 bp. In total, the Heidelberg transcriptome after curation consists of 22,026 fragments from 8,625 transcripts and 18,462 unfragmented transcripts.

  Fragmented Unfragmented
Number of transcripts 22,026 18,462
Mean Length 615.93 748.52
Median Length 359 480
Range of Lengths 51-13,056 101-12,244


With the three transcriptomes individually curated, we now need to assemble them together. Several options are available. We can just use CAP3 again; however CAP3 will only get rid of redundancy or extend transcripts, maybe even falsely merging two separate transcripts together. 

BIMSB being the most complete transcriptome, perhaps we should use that as a base set of transcripts and only add new transcripts from AAA and Heidelberg.

In the next few days, I will report part 2 of this series: