Sunday, 28 April 2013 at 4:22 pm

Genome biology recently hosted a 5 part bioinformatics challenge event, with the last challenge concluding on the 60th anniversary of the discovery of DNA (Thursday, April 25th. 2013).

I was able to solve the first 4 parts pretty easily. Unfortunately, I gave up after half an hour on the last challenge because I was hungry and wanted to go home.

In this post, I'll show what I did to solve/hack/cheat the first 4 challenges.

  Thursday, 18 April 2013 at 10:03 am

When using HMMScan with various HMM databases (Pfam, TIGRFAM, HMMSmart, ...), you can choose to set a thresholding method for filtering out false positives:

  • Gathering threshold - This was introduced and mainly used by PFAM. This is the threshold PFAM curators manually determined for inclusion into the HMM alignment. The main criteria for inclusion is minimizing domain/sequence overlaps with other protein families. 
  • Trusted cutoff - Score of the the lowest scoring sequence within the HMM alignment. 
  • Noise cutoff - Score of the highest scoring sequence that is NOT in the HMM alignment. Obtained by scanning all other protein family sequences with the HMM in question.

Anything above gathering threshold or trusted cutoff is most likely not a false positive as it is the most strict cutoff. Anything between noise cutoff and gathering/trusted cutoff is a maybe. And anything below noise cutoff can be discarded as false positives.

  Friday, 22 March 2013 at 2:29 pm

Here are a few tips when processing or analyzing data:

  1. Use conventions when naming files and organizing folders. There are no widely adopted conventions for organizing folders. Develop your own and keep to it. Having a consistent naming scheme or folder structure will allow you to find that essential file later. Here is a great BioStar post on how various people organize their folders:

  2. Keep a log file in folders where naming conventions are not enough to describe the files. There are going to be situations where you are performing many complex operations on your data and your naming conventions won't adequately explain what you did. In these situations, create a simple log file that briefly described what you've done.

  3. Keep complex terminal commands in a file and run them using 'sh'. When running software where you need to include a bunch of flags and parameters, save the command in a file. Chances are you will be running the same command again or will need to change the parameter later. This also serves as a log to record your parameters (ie. e-value thresholds for blasts, filter settings for samtools...).

  4. Rename your fasta files and keep a separate naming index. This might be a bit controversial, but I almost always rename fasta files that I get. For example if I have a set of species X ESTs, I'll rename each sequence to X.EST.001, X.EST.002... And I'll keep an index file for mapping the new names to the old names. The reason for this is that fasta headers tend to be kind of a mess and in turn can affect how some software or your own scripts deal with them. 

  5. Ad hoc files do not have to be human legible. When parsing or processing raw data, you often will generate a bunch of temporary ad hoc files with your own formatting. These do not have to be human legible as you will probably check them with a script anyways. Instead of generating data with variable number of rows or columns which can be a bitch of parse, feel free to throw in as many fixed columns or rows you want for easier parsing down the line.

  6. Don't get distracted by the method. You can easily waste hours trying to perfect the most elegant script or linux one-liner for processing your data. Always be aware of the time-cost and benefit tradeoff of what you are trying to accomplish. Is it really worth the extra 20 minutes of micro-optimization for a 5 second saving? Is it really worth the effort of constructing a cool linux one-liner when you can accomplish the same task with a perl/python script in faster time? 

  7. Automate EVERYTHING. Use a script or terminal commands even when your data can be processed by hand in a couple of minutes. This is not because of laziness (maybe a little bit). It's for the sake of consistency and error prevention. Making a tiny mistake while manually processing data can propogate that mistake down the line and eventually become a headache to debug. If you make a mistake with your script, at least the mistake will be consistent and be more easily detected.

  8. 'wc -l' is not always accurate. The word count command, "wc' with the '-l' flag counts newline characters in a file. If the last line in your file doesn't have a newline character following it, you will get one less line in the line count. For example, create a file like this: echo -n 'data' > where the file contains a line without a newline after it. When you 'wc -l' you will get 0 because there are no newline characters in the file.
  9. There is no perfect data. BIological data is fuzzy and noisy. There needs to be a compromise between striving for perfect data and knowing when to stop massaging the data and start interpreting.
  10. There aren't any hard rules. Everyone has their own workflow for data wrangling. You will read articles and blogs stating very strict and hard rules of what to do or not to do. I would just take them under advisement and figure out what works best for you.

  Tuesday, 19 March 2013 at 9:00 pm

Came across this ycombinator supported data analysis site today, Fivetran. It's essentially spreadsheets on steroids. You can upload huge files as spreadsheets and perform sql/linux like commands on the data similar to how you would write formulas in excel. It uses cloud computing in the backend to perform the computations. 

This may not be that useful for seasoned bioinformaticians, but it looks quite promising for non-computational inclined biologists who are tired of their spreadsheets crashing when loading thousands of rows. The most useful features are probably going to be the sql-like joins and filters on rows/columns the site provides. 

However, there does seem to be a bit of a learning curve with their "step" system for analyzing data. Hopefully the gratuitous amount of tool tips and tutorials will be appreciated by their intended audience.

  Monday, 25 February 2013 at 8:03 pm

Just discovered this crowd-funding site for scientific research:

It's an extremely cool idea that gives a bit of power to the public in steering the direction of research and at the same time act as scientific outreach. It also allows undergraduate or graduate students to have some extra money for reagents or equipment in addition to being a great platform for communicating their projects.

I do have a few questions that doesn't seem to be clarified on the website:

  • What criteria does the site use to approve proposals?
  • What is really the incentive for public to donate money? For example, Kickstarter gives rewards for donations.
  • How does academic institutions view this source of money? The site says money is given officially as "gifts" to the academic institution and put under the researcher's project budget. 
  • Does this only work for US institutions currently?

Looks like a potentially good platform to crowdfund some sequencing projects...

  Sunday, 10 February 2013 at 1:24 pm

There has been a lot of heated discussion on what exactly the role of the bioinformatics is and what contributions it has made to biological sciences. The discussion started with the re-discovery of Fred Ross's farewell to bioinformatics blog entry posted mid last year, in which he used many colorful words to describe the inadequacies of the field. And from there, it was posted on several biology, bioinformatics, and programming news aggregator sites (This post on BioStar tracks the discussions), sparking debates.

I can't claim to be very experienced in the bioinformatics field. I am currently still trying finish my phd. However, I have been a hobbyist programmer for quite a while now and I've also got a decent amount of experience in academia as a lab technician, out-sourced programmer, lab manager, and grad student. 

So here is my two-cents on this discussion.

  Monday, 04 February 2013 at 3:18 pm

For phd students currently writing their thesis/dissertation, try out It automatically saves and allows you to compile on the fly and view the resulting .pdf. Here is a pretty standard latex template for any report type document:

\title{my thesis}
\author{my name}
\date{my date}
\section{Organism X biology}
\subsection{Life cycle}
\subsubsection{Phase 1}
My text content
\subsubsection{Phase 2} 
My text content  \section{Organism X bioinformatics}
\subsection{Available Data}
My text content
My text content  
\chapter{Organism X genome assembly}



  Tuesday, 22 January 2013 at 12:47 pm

The star (*) symbol can be used before a collection object in python to mark it for unpacking when passing to a function. For example:

def doSomething(x, y, z):
return x + y + z

data = [1,2,3]

#Your data is in a list and you want to use doSomething on it.
#The annoying way would be to do this:

#Using star notation

The star notation before the 'data' variable unpacks the array into 3 separate arguments for the function to process. 

Similarly, you can use a double star (**) notation to unpack values of a dictionary:

dataDict = {'a':1,'b':2,'c':3}

You can also write a function with star notation argument list to accept a variable number of arguments:

def doAnything(*myValues):
return sum(myValues)

doAnything(1,2,3,4) #return 10
doAnything(1,2) #return 3
doAnything(1) #return 1 

An useful example of the star notation is for transposing tables (make rows into columns and columns into rows):

#data of two rows and 4 columns
#1 2 3 4
#5 6 7 8
data = [(1,2,3,4),(5,6,7,8)]

transposedData = zip(*data)
#transposedData = [(1, 5), (2, 6), (3, 7), (4, 8)]