Sunday, 28 April 2013 at 4:22 pm

Genome biology recently hosted a 5 part bioinformatics challenge event, with the last challenge concluding on the 60th anniversary of the discovery of DNA (Thursday, April 25th. 2013).

I was able to solve the first 4 parts pretty easily. Unfortunately, I gave up after half an hour on the last challenge because I was hungry and wanted to go home.

In this post, I'll show what I did to solve/hack/cheat the first 4 challenges.


Step 1

The first challenge requires the contestant to find the longest 7-mer in a given fasta file. I wrote a quick and dirty python script to solve it:

import sys
from Bio import SeqIO
inFile = open(sys.argv[1],'r')
mer = {}
for record in SeqIO.parse(inFile,'fasta'):
        seq = str(record.seq)
        for i in range(len(seq) - 7):
                m = seq[i:i+7]
                if not mer.has_key(m):
                        mer[m] = 0
                mer[m] += 1
largest = 0
motiff = ''
for m, count in mer.items():
        if count > largest:
                largest = count
                motiff = m
print motiff

The answer for this challenge is: TAGCGAC


Step 2

The second challenge required the contestant to find all open read frames in a given genomic sequence, translate all ORFs, sort ORFs by size, then get the 25th amino acids from the top 15 ORFs. A bit more involved, but ultimately pretty trivial if you already have an ORF finding tool.

I used the EMBOSS getorf tool and generated all possible ORFs. Then I used a python script to sort and extract the answer:

import sys
from Bio import SeqIO
inFile = open(sys.argv[1],'r')
seqs = []
for record in SeqIO.parse(inFile,'fasta'):
        seqs.append(str(record.seq))
seqs.sort(key = lambda x : len(x), reverse = True)
print ''.join(x[24] for x in seqs[:15])

The answer is: THESECRETQFLIFE


Step 3

This step required the contestants to find the most differentially expressed gene given a set of fastq files and gene annotations on a genome. I cheated on this step. I didn't bother with mapping the reads or finding differentially expressed genes. 

In these challenges, a contestant advance to the next round by appending the answer of the previous round to the url: http://genomebiology.com/about/update/DNA60_

There were 90 genes in the annotation file, meaning the answer has to be one of the 90 genes. So I extracted the possible gene names from the file and wrote a script to try all possible urls until I found the correct url:

import sys,os
inFile = open('geneNames.data','r')
for line in inFile:
        name = line.strip().upper()
        url = 'http://genomebiology.com/about/update/DNA60_' + name
        cmd = 'curl ' + url +  " > temp"
        os.system(cmd)
        outFile = open('temp','r')
        data = outFile.read()
        outFile.close()
        if data.find('Sorry! We have looked everywhere') == -1:
                print url       
                print 'FOUND'
                sys.exit()

The answer is: CARB

*I would not have done it this way if the possible answers were thousands of genes. I do not codone DDoS-ing the genome biology website.


Step 4

In this step, given a set of 16s RNA metagenomic sequences file, the contestant needs to find the genus of the bacteria that is most abundant. 

Given that this is a genome biology contest, I assumed the data given will be ideal and most likely not real-world, noisy and messy sequences. I simply CD-HIT-EST all the sequences, found the CDHIT cluster with the most members, blasted the cluster representative on NCBI blastn and got my answer.

The answer is: HELICOBACTER








Search

Categories


Archive