Friday, 22 March 2013 at 2:29 pm

Here are a few tips when processing or analyzing data:

  1. Use conventions when naming files and organizing folders. There are no widely adopted conventions for organizing folders. Develop your own and keep to it. Having a consistent naming scheme or folder structure will allow you to find that essential file later. Here is a great BioStar post on how various people organize their folders:

  2. Keep a log file in folders where naming conventions are not enough to describe the files. There are going to be situations where you are performing many complex operations on your data and your naming conventions won't adequately explain what you did. In these situations, create a simple log file that briefly described what you've done.

  3. Keep complex terminal commands in a file and run them using 'sh'. When running software where you need to include a bunch of flags and parameters, save the command in a file. Chances are you will be running the same command again or will need to change the parameter later. This also serves as a log to record your parameters (ie. e-value thresholds for blasts, filter settings for samtools...).

  4. Rename your fasta files and keep a separate naming index. This might be a bit controversial, but I almost always rename fasta files that I get. For example if I have a set of species X ESTs, I'll rename each sequence to X.EST.001, X.EST.002... And I'll keep an index file for mapping the new names to the old names. The reason for this is that fasta headers tend to be kind of a mess and in turn can affect how some software or your own scripts deal with them. 

  5. Ad hoc files do not have to be human legible. When parsing or processing raw data, you often will generate a bunch of temporary ad hoc files with your own formatting. These do not have to be human legible as you will probably check them with a script anyways. Instead of generating data with variable number of rows or columns which can be a bitch of parse, feel free to throw in as many fixed columns or rows you want for easier parsing down the line.

  6. Don't get distracted by the method. You can easily waste hours trying to perfect the most elegant script or linux one-liner for processing your data. Always be aware of the time-cost and benefit tradeoff of what you are trying to accomplish. Is it really worth the extra 20 minutes of micro-optimization for a 5 second saving? Is it really worth the effort of constructing a cool linux one-liner when you can accomplish the same task with a perl/python script in faster time? 

  7. Automate EVERYTHING. Use a script or terminal commands even when your data can be processed by hand in a couple of minutes. This is not because of laziness (maybe a little bit). It's for the sake of consistency and error prevention. Making a tiny mistake while manually processing data can propogate that mistake down the line and eventually become a headache to debug. If you make a mistake with your script, at least the mistake will be consistent and be more easily detected.

  8. 'wc -l' is not always accurate. The word count command, "wc' with the '-l' flag counts newline characters in a file. If the last line in your file doesn't have a newline character following it, you will get one less line in the line count. For example, create a file like this: echo -n 'data' > where the file contains a line without a newline after it. When you 'wc -l' you will get 0 because there are no newline characters in the file.
  9. There is no perfect data. BIological data is fuzzy and noisy. There needs to be a compromise between striving for perfect data and knowing when to stop massaging the data and start interpreting.
  10. There aren't any hard rules. Everyone has their own workflow for data wrangling. You will read articles and blogs stating very strict and hard rules of what to do or not to do. I would just take them under advisement and figure out what works best for you.

  Tuesday, 19 March 2013 at 9:00 pm

Came across this ycombinator supported data analysis site today, Fivetran. It's essentially spreadsheets on steroids. You can upload huge files as spreadsheets and perform sql/linux like commands on the data similar to how you would write formulas in excel. It uses cloud computing in the backend to perform the computations. 

This may not be that useful for seasoned bioinformaticians, but it looks quite promising for non-computational inclined biologists who are tired of their spreadsheets crashing when loading thousands of rows. The most useful features are probably going to be the sql-like joins and filters on rows/columns the site provides. 

However, there does seem to be a bit of a learning curve with their "step" system for analyzing data. Hopefully the gratuitous amount of tool tips and tutorials will be appreciated by their intended audience.