Friday, 20 September 2013 at 1:49 pm

Enrichment analysis are applied when you have categorical data associated with your dataset. For example gene ontology, pfam families, molecular pathways, enzymatic activity...etc. The gist of the analysis is to see whether a certain category (GO term, pfam…) are over-represented in a subset of your data.

Let’s take an example. Let’s say I have:

  •  A transcriptome of 20,000 genes.
  • 400 genes out of 20,000 are categorized as “cell cycle”.
  • We found 1,000 genes to be differentially expressed under a certain condition.
  • 300 genes have the “cell cycle” category out of the 1,000 differentially expressed genes.

What is the significance of this? In other words, if we pick 1,000 genes randomly from the total pool of 20,000 genes, what are the chances there will be more than 300 genes with the cell cycle category?

In this post I will go through the basics of how enrichment analysis is performed and some thoughts on how informative this analysis is as applied to biological systems.

  Monday, 02 September 2013 at 5:13 pm

I've been attending the UK NGS/Genomic Sciences meetings since it started 4 years ago. While there are great talks every year, this year, they were able to get Clive Brown to do the keynote talk about Oxford Nanopore. For people in the NGS field, I don't think I need to say much about what Nanopore is (check out Oxford Nanopore's website for more details).

Before the talk, Clive put up a slide telling people he prefers there to be no tweets about the talk since he will be covering a great deal of technical details (which he did). I found that kind of strange. It seems like he doesn't want the content of his talk to be public? Why not just have all of us sign a NDA if that's the case? However, I will comply with his request and will not write much about the technical aspects of his talk. Instead, I will talk about what I think about Oxford Nanopore and its potential impact on the field.

  Saturday, 06 July 2013 at 10:11 pm

I put some finishing touches on Seeker: Annotation Viewer last week for visualizing sequence features such as protein domains, primers, etc... Now I am working on a genome browser. Here is an extremely early prototype (there is around 1.8mb of files to load):

It should work on latest versions of Chrome/Safari/Firefox. It will most likely NOT work on IE or Opera. Hopefully, this won't crash your browser. This is completely client-side only. You can distribute these files on a USB stick and anyone with a modern browser will be able to open it.

The loaded data is human chromosome 1 parsed from a .gtf file downloaded from UCSC. The parsed data is around 1MB (980KB). These interactions are possible right now:

  • Dragging on the tracks will allow you to scroll through the reference chromosome
  • WASD movement. Press 'A' to scroll left, 'S' to scroll right, 'W' to scroll up, 'S' to scroll down. Anyone who plays computer games should be familiar with this layout.
  • Clicking on the bottom overview bar (blue bar) will let you jump to that position.
  • You can also click and drag on the bottom bar, but depending on how good your computer is, it might be jittery.
  • The line graph on the bottom overview bar represents feature density. The higher the amplitude, the more features there are at that loci.
  • Right now it's displaying 1 million base pair windows. I've tested up to 5 million with little trouble on my early 2012 Macbook Pro. I'll probably set maximum window size to 1 million. 

I'll go in to more detail about how the rendering works in the future. I've implemented a "rubber-banding" scrolling system instead of the normal Google Maps style tiling system.

  Monday, 01 July 2013 at 09:24 am

I've wanted to learn how to build web apps with webGL ever since I saw the crazy Unreal engine ported to HTML5 and webGL (as a side-note, three.js is a very popular javascript 3d library that leverages webGL). It has a lot of potential for data visualizations. Imagine a genome browser running on a GPU. It will be able to render millions of objects easily. 

I came across this developer preview library today of a framework that allows for data visualizations using webGL and webworkers for multi-threading:

It is only a developer preview. But it looks extremely cool.

Of course the down-side (as with anything running in a browser) is cross-browser compatbility. The framework seems to also use webCL which doesn't seem like it will be widely adopted anytime soon. Perhaps someone can make a modified Node-webkit?

  Thursday, 20 June 2013 at 8:02 pm

After several refactoring, version 1.0 of the annotation viewer is finished. You can use the app here:

Input to the app right now is either HMMScan domain table result or a tab delimited file. The tab delimited file is formatted with 5 columns: sequence name, feature name, start position, end position, sequence length. There are sample input data in the app for clarity.

I am not sure how cross-browser it is. It was developed mostly with Chrome in mind, however it should work on latest versions of Chrome/Safari/Firefox. 

On the technical side of things, this web app uses D3.js heavily for the SVG rendering and many DOM manipulations. All I can say is that D3.js is almost magical in how fast it re-renders objects. I also rolled my own MVC system instead of going with the popular backbone.js, angular.js,...etc frameworks. It was definintely an eye opening experience to see how much work goes into these MVC systems.

My MVC system is not really a full MVC. A more proper description is a view-centric MVC.

Components like menus, checkboxes, sliders, drop-downs were built with a data binding system that allows them to react to changes in the data. These components are the view of the MVC pattern. However, there are no formal models in this system, hence the "view-centric". Data are just native javascript objects or arrays, allowing JSON-typed input. When the data is bound to a view, methods are added to the data that allows them to update the view on data change. Yes, I am aware that adding methods to the data object is dirty and a hack. 

Here is an example of this view-centric MVC system:

var data = {'name':'next gen sequencing conference','attending':false};
var checkbox = new seeker.checkbox()

The data is an object consisting of two key:value pairs. To bind this piece of data to a checkbox where the label of the textbox correspond to "name" and the checkbox itself correspond to "attending", we use the .bind function. This function takes in two argument objects: data and keys.

There are specific keys that let's the checkbox component understand which data corresponds to the label or the checkbox. Both the 'text' key which corresponds to the label and the 'checkbox' key which correpsonds to the checkbox are bound to the "data" object. The keys in the object that corresponds to 'text' and 'checbox' are 'name' and 'attending. 

This system is a bit unweildy in argument construction. I might have to mess around with that part to make it more elegant. I also still have to optimize data unbinding. I am sure there are tons of memory leaks right now.

  Friday, 14 June 2013 at 10:00 am

Any developer that deals with a lot of javascript and CSS understands the pain involved in positioning an element exactly right on a page and also have it respond well to window resizes.

A common frustration is creating a "drop-up" menu where a selection menu is displayed above a button on user click. The height of the menu needs to be known to correctly position it above the button. If the menu is dynamically generated, then the height will possibly change everytime the user clicks the button.

The height of the menu can be obtained by getting the element's offset properties (offsetHeight, offsetWidth...). However these properties are only available if the element can be read by the browser screen reader. 

A common technique to hide an element while still making it available to the screen reader is to simply move it off the screen: = -10000;

Versus the CSS methodwhich will not make it available to the screen reader. = 'none';

A good overview of various DOM hiding methods can be found here.

I don't really like the offscreen method as it is kind of a hack and I can envision a future scenario where new browser versions will ruin web apps that uses this method. However, this method seems to be the only effective way around this problem. 

Is there any performance difference between offscreen and CSS methods? I setup a jsperf test to find out:

I created a new wrapped DOM class and prototyped hide() and offscreen() functions for the CSS and offscreen methods respectively. Then I created 100 wrapped DOM elements as the test setup. 

For the two test cases, I either used hide() or offscreen() on the 100 DOM elements. The offscreen() method performs slower than the hide() method by almost 74%. I expected it to be slower as offscreen elements are still being read by the screen reader, but 74% is a pretty large performance hit considering most modern web apps can easily have a few hundred elements.

  Friday, 07 June 2013 at 09:25 am

I've been slowly working on my javascript library for visualizing bioinformatics data. I'v decided not to use too many external frameworks or libraries for two reasons: 1) It is a great learning opportunity for figuring out the nuts and bolts of how javascript runs on browsers. 2) There really aren't that many javascript frameworks designed to handle a lot of data simply because browsers were traditionally not seen as a platform for utilizing tons of data.

There are generally two camps among javascript framework developers for DOM creation and manipulation. One camp extends the DOM by using .prototype function, essentially defining new methods on native element/node objects. The other camp wraps the DOM by creating a new object which in itself, creates a native DOM element/node. All methods of the new object are then applied to the internal DOM element/node. 

I prefer wrapped DOM because it offers more flexibility in having private variables and let's me feel secure in not have collisions with native methods. However, the downside to wrapped DOM is the computational overhead that comes with creating every new DOM element/node. How much overhead is it really? 

I've setup a quick javascript benchmark where I compare native DOM creation vs wrapped DOM creation with varying prototype methods:

This test contains 6 cases where the first is native DOM creation with a simple document.createElement(). The other cases are wrapped DOM objects with 1, 5, 10, 15, and 100 prototyped methods. I expected the wrapped DOM to be more expensive than the native object. But do the addition of varying number of prototyped methods matter in object creation? 

I get non-consistent results on multiple runs of the test. At worst cases, I am getting around 30% of the performance when creating wrapped DOMs, on best cases, I am getting similar perforamce between native and wrapped DOMs. I am not exactly sure why. It might have something to do with the browser garbage collector as the node might be marked for collection after every iteration. read the update.

However, the differing numbers of prototyped methods doesn't seem to make much of a difference. In fact the object with 100 prototyped methods seemed to perform the best out of the wrapped DOMs in many cases. 

Even though a possible 75% 25% decrease in performance is quite a lot, that is still equivalent to 300,000 creations per second, which is probably more than enough for rendering application UI elements. The next step would be to see how wrapped SVG elements perform with native SVG elements....


I've updated the jsperf test to append the new node to the DOM after every creation to make sure it doesn't get collected by the garbage collector. After each case, the new nodes are removed for the next case. Now I seem to be getting more consistent results where wrapped DOM is around 75% of the performance of the native DOM at around 300,000 creations and additions a second.

  Tuesday, 28 May 2013 at 12:02 pm

I am in the middle of my last year as a PhD student, working on my thesis and trying to get some last minute analysis finished. I've also decided to start building up a github profile by working on a few visualization web tools. Hopefully it will add to my CV. The goal of these tools will be to:

  • Client-side html/javascript/css only web tools. No server required.
  • Clean and minimalistic design
  • Help bioinformaticians who work at sequencing facilities by giving them an interactive output to give to clients

The only javascript dependency I'll be using is D3.js

The first tool I'll be making is an annotation viewer for visualizing features on a set of sequences. Something that's useful for people who want to load 50-100 sequences with protein domain annotations and look at the domain compositions; or for loading various genomic loci with transcript annotations.  

I've been working on it past couple of days and I've got most of the feature rendering code finished. You can check out a very preliminary demo here. The sample data is a bunch of SET domain containing genes and PFAM domain annotations:

It might not look like much, but coding contextual menus that will respond smartly to edge of the window and refactoring the code for loose coupling took some effort.

It will probably only work on Chrome/Safari/Firefox. There are contextual menus for each feature and sequence for the user to manipulate what features they want shown (left click on features and sequence names). The end goal is to let the user save the rendering as a svg file for further editing.

Here is my github page.