Tuesday, 28 May 2013 at 12:02 pm

I am in the middle of my last year as a PhD student, working on my thesis and trying to get some last minute analysis finished. I've also decided to start building up a github profile by working on a few visualization web tools. Hopefully it will add to my CV. The goal of these tools will be to:

  • Client-side html/javascript/css only web tools. No server required.
  • Clean and minimalistic design
  • Help bioinformaticians who work at sequencing facilities by giving them an interactive output to give to clients

The only javascript dependency I'll be using is D3.js

The first tool I'll be making is an annotation viewer for visualizing features on a set of sequences. Something that's useful for people who want to load 50-100 sequences with protein domain annotations and look at the domain compositions; or for loading various genomic loci with transcript annotations.  

I've been working on it past couple of days and I've got most of the feature rendering code finished. You can check out a very preliminary demo here. The sample data is a bunch of SET domain containing genes and PFAM domain annotations:


It might not look like much, but coding contextual menus that will respond smartly to edge of the window and refactoring the code for loose coupling took some effort.

It will probably only work on Chrome/Safari/Firefox. There are contextual menus for each feature and sequence for the user to manipulate what features they want shown (left click on features and sequence names). The end goal is to let the user save the rendering as a svg file for further editing.

Here is my github page.

  Sunday, 28 April 2013 at 4:22 pm

Genome biology recently hosted a 5 part bioinformatics challenge event, with the last challenge concluding on the 60th anniversary of the discovery of DNA (Thursday, April 25th. 2013).

I was able to solve the first 4 parts pretty easily. Unfortunately, I gave up after half an hour on the last challenge because I was hungry and wanted to go home.

In this post, I'll show what I did to solve/hack/cheat the first 4 challenges.

  Tuesday, 19 March 2013 at 9:00 pm

Came across this ycombinator supported data analysis site today, Fivetran. It's essentially spreadsheets on steroids. You can upload huge files as spreadsheets and perform sql/linux like commands on the data similar to how you would write formulas in excel. It uses cloud computing in the backend to perform the computations. 

This may not be that useful for seasoned bioinformaticians, but it looks quite promising for non-computational inclined biologists who are tired of their spreadsheets crashing when loading thousands of rows. The most useful features are probably going to be the sql-like joins and filters on rows/columns the site provides. 

However, there does seem to be a bit of a learning curve with their "step" system for analyzing data. Hopefully the gratuitous amount of tool tips and tutorials will be appreciated by their intended audience.

  Monday, 25 February 2013 at 8:03 pm

Just discovered this crowd-funding site for scientific research: https://www.microryza.com/

It's an extremely cool idea that gives a bit of power to the public in steering the direction of research and at the same time act as scientific outreach. It also allows undergraduate or graduate students to have some extra money for reagents or equipment in addition to being a great platform for communicating their projects.

I do have a few questions that doesn't seem to be clarified on the website:

  • What criteria does the site use to approve proposals?
  • What is really the incentive for public to donate money? For example, Kickstarter gives rewards for donations.
  • How does academic institutions view this source of money? The site says money is given officially as "gifts" to the academic institution and put under the researcher's project budget. 
  • Does this only work for US institutions currently?

Looks like a potentially good platform to crowdfund some sequencing projects...

  Sunday, 10 February 2013 at 1:24 pm

There has been a lot of heated discussion on what exactly the role of the bioinformatics is and what contributions it has made to biological sciences. The discussion started with the re-discovery of Fred Ross's farewell to bioinformatics blog entry posted mid last year, in which he used many colorful words to describe the inadequacies of the field. And from there, it was posted on several biology, bioinformatics, and programming news aggregator sites (This post on BioStar tracks the discussions), sparking debates.

I can't claim to be very experienced in the bioinformatics field. I am currently still trying finish my phd. However, I have been a hobbyist programmer for quite a while now and I've also got a decent amount of experience in academia as a lab technician, out-sourced programmer, lab manager, and grad student. 

So here is my two-cents on this discussion.

  Saturday, 12 January 2013 at 7:39 pm

Aaron Swartz is perhaps most well known for being a co-founder of Reddit among many other accomplishments including co-author of RSS 1.0 specifications, creator of web.py, and co-found of Demand Progress. Recently he was indicted with federal charges and possibly facing a long term prison sentence. His crime was for downloading around 4 millions academic articles off of JSTOR and planning to distribute them on p2p networks. Aaron Swartz committed suicide on January 11th, 2013 in New York City. He was 26 years old.

  Thursday, 15 November 2012 at 9:42 pm

With so many different data formats out there for various tasks and programs, it seems like most of what we do is converting between formats. I am sure most data scientists can sympathize.

However what is common among all these data formats is an underlying pattern. Whether that means data is delimited by commas, tabs, pipes; certain columns are set as ids, attributes; or binary data where offsets of bytes are known to get at the information. A pattern has to be there for us to parse the document.

Can we describe this pattern in a set language? Essentially, come up with a data format to describe all other data formats? Can we then write an universal parser that can take descriptions of the data format in this language and parse any file types we want?

A quick search on the internet yielded the Data Format Description Language (DFDL). It is a XML based language that is used to describe text or binary formats. It doesn't seem to be widely used as I couldn't find examples of people using it. I am not sure whether they've implemented an universal parser that can take DFDL as input to parse files.

While XML allows for a very rich description of data formats, the verbosity of the schema and loose structuring leaves much to be desired. I think perhaps just a simple JSON way to describe formats would be more ideal and easier to parse.

So can we describe the popular formats used in the bioinformatics field with a controlled language? I've always found .VCF files to be the most complicated file format in the field. How would we go about creating a language that can describe the genotype fields of a .VCF file?

  Wednesday, 07 November 2012 at 3:20 pm

I came across this new R package called Shiny today. It allows users to build interactive charts/graphs in R and display it in a web browser without much knowledge of javascript/html/css. The user can run the R script like this:

R -e "shiny::runApp('~/myApplication')"

A minimal server will run in localhost serving the application. Looks promising.

Website: http://rstudio.github.com/shiny/tutorial/
Sign up for the beta: http://shiny.rstudio.org/