Sunday, 28 April 2013 at 4:22 pm

Genome biology recently hosted a 5 part bioinformatics challenge event, with the last challenge concluding on the 60th anniversary of the discovery of DNA (Thursday, April 25th. 2013).

I was able to solve the first 4 parts pretty easily. Unfortunately, I gave up after half an hour on the last challenge because I was hungry and wanted to go home.

In this post, I'll show what I did to solve/hack/cheat the first 4 challenges.

  Thursday, 18 April 2013 at 10:03 am

When using HMMScan with various HMM databases (Pfam, TIGRFAM, HMMSmart, ...), you can choose to set a thresholding method for filtering out false positives:

  • Gathering threshold - This was introduced and mainly used by PFAM. This is the threshold PFAM curators manually determined for inclusion into the HMM alignment. The main criteria for inclusion is minimizing domain/sequence overlaps with other protein families. 
  • Trusted cutoff - Score of the the lowest scoring sequence within the HMM alignment. 
  • Noise cutoff - Score of the highest scoring sequence that is NOT in the HMM alignment. Obtained by scanning all other protein family sequences with the HMM in question.

Anything above gathering threshold or trusted cutoff is most likely not a false positive as it is the most strict cutoff. Anything between noise cutoff and gathering/trusted cutoff is a maybe. And anything below noise cutoff can be discarded as false positives.