Thursday, 29 November 2012 at 1:30 pm




Big Data


  Tuesday, 27 November 2012 at 12:18 pm

Recursion is the most common way to traverse a tree data structure. However, there are language specific limitations to recursion. Python cannot recurse at depths more than a set system limit (usually around 1000). Recursion also can potentially use up a lot of memory storing the temporary depths. An iterative approach is possible with infite loops for tree traversal, but it is often longer to write and requires more conditional checks. 

Here I'll go through the basics of using recursion and iteration for tree traversal in python.

  Friday, 23 November 2012 at 1:06 pm

Here is a list of some great data visualization libraries in javascript:


  • D3.js 
    Binds data to DOM elements and a agile DOM manipulator. Great for customizing data visualizations. 
  • Recline.js - 
    Extremely impressive set of objects that allows you to perform data exploration in a web browser. 
  • Processing.js 
    Javascript port of the visual programming language, processing
  • NVD3.js 
    Still a very young library. It uses D3.js to create reusable functions for common charts. There were some controversy earlier this month about it's status as an open source project, but that has been resolved now. 
  • Javascript InfoVis Toolkit 
    Not as fancy and flashy as the other libraries. It plots common charts and some uncommon ones. It gets the job done. 
  • Highcharts 
    In my opinion, the best looking charting library.
  • Dygraphs 
    Simple and responsive. Seems to work well with large amount of data. 

Data manipulation

Specialized graphs


  Thursday, 22 November 2012 at 6:00 pm

Here is a list of science/programming articles I found interesting this week, mostly collected from reddit, twitter, and biostar



Big Data


  Monday, 19 November 2012 at 4:32 pm

For cases where a linux process is taking longer than you expected and you want to put the process on nohup so it won't terminate when you log out, you can:

  • If the process is not in the background, you can press ctrl + z to pause the program and then type 'bg' to put the process into the background.
  • typing 'disown -h' will make sure all active jobs will not get a SIGHUP when you log out or close the terminal.

  Thursday, 15 November 2012 at 9:42 pm

With so many different data formats out there for various tasks and programs, it seems like most of what we do is converting between formats. I am sure most data scientists can sympathize.

However what is common among all these data formats is an underlying pattern. Whether that means data is delimited by commas, tabs, pipes; certain columns are set as ids, attributes; or binary data where offsets of bytes are known to get at the information. A pattern has to be there for us to parse the document.

Can we describe this pattern in a set language? Essentially, come up with a data format to describe all other data formats? Can we then write an universal parser that can take descriptions of the data format in this language and parse any file types we want?

A quick search on the internet yielded the Data Format Description Language (DFDL). It is a XML based language that is used to describe text or binary formats. It doesn't seem to be widely used as I couldn't find examples of people using it. I am not sure whether they've implemented an universal parser that can take DFDL as input to parse files.

While XML allows for a very rich description of data formats, the verbosity of the schema and loose structuring leaves much to be desired. I think perhaps just a simple JSON way to describe formats would be more ideal and easier to parse.

So can we describe the popular formats used in the bioinformatics field with a controlled language? I've always found .VCF files to be the most complicated file format in the field. How would we go about creating a language that can describe the genotype fields of a .VCF file?

  Wednesday, 07 November 2012 at 3:20 pm

I came across this new R package called Shiny today. It allows users to build interactive charts/graphs in R and display it in a web browser without much knowledge of javascript/html/css. The user can run the R script like this:

R -e "shiny::runApp('~/myApplication')"

A minimal server will run in localhost serving the application. Looks promising.

Sign up for the beta:

  Tuesday, 06 November 2012 at 7:56 pm

The GL and PL genotype fields in the .VCF file format contain probabilities for a specific genotype defined by the REF and ALT bases (column 3 and 4 of the file). 

The ordering of the allele combinations is defined in the vcf 4.1 specifications as:

If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.  In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc. 

To get this ordering, you can use this python snippet:

bases = ['A','G','C']  #the first base in this array, A, is the REF base. 
#G,C are the ALT bases
  for i in range(len(bases)): base = bases[i] for a in range(i + 1): print bases[a] + base

Running this snippet will output the correct order:


  Tuesday, 06 November 2012 at 7:16 pm

I was looking for something that'll give me allele counts from samtools' mpileup output today. I've tried: mpileup into a .vcf file. -> use vcftools to give me base frequencies/counts. But that didn't seem to work for me. I ended up with 0 counts and 'NaN' (not a number) frequencies. 

I read over the mpileup format and decided to write my own script to get base counts. The script seems to work, but I have a few concerns hopefully someone can address:

  • I interpreted deletions as: when there is a -[0-9]-[AGTC] in the read base column, that means the next [0-9] base is deleted. The current base is not deleted. Is that correct?
  • Insertions mean there is an insertion between the current base and the next base?
  • What does the "^" do? I read over the description several times and still don't understand it (from the pileup format guide):
    Also at the read base column, a symbol `^' marks the start of a 
    read segment which is a contiguous subsequence on the read
    separated by `N/S/H' CIGAR operations.