Wednesday, 17 October 2012 at 10:24 am

I recently wrote a script to bootstrap correlations between RNA-seq datasets. The script randomly chooses 50 pairs of expression values from two datasets and calculated a Spearman correlation. It will then perform this process 1000 times resulting in a file of 1000 correlation values.

Input to this script is RNA-seq data in csv format of 10 conditions. I wanted to get all non-redundant pair-wise bootstrap correlations among the 10 conditions. I used the itertools.combinations function to get all possible pair-wise combinations.

The script took 48.7 seconds to run total, producing 45 files for the pair-wise comparisons.

I wasn't satisfied with the run speed, so I looked up how to perform multi-thread process in python and rewrote the script. The difference between the single-thread and multi-thread scripts is literally 5 lines. Instead of running bootstrap correlation on each comparison one after another, I ran it 10 at a time with the multiprocessing module

The multi-threaded version turned out to take 1.09 seconds to run on 10 threads!

I have no idea why 10 threads made the run-time improve by almost 50X. I am running this script on a Centos server with 24 cores and plenty of ram.

The two version of the bootstrap correlation script are included here. The scripts are commented with what the input data should look like. Requires python 2.6+, numpy, and scipy.

single-thread version

multi-thread version

update (October 17th, 2012)

As Aaron correctly pointed out in the comments, the multi-thread script is actually running all the processes at once. The nested for loops are running 10 processes at a time, however they are not waiting for the first 10 to finish before running the next 10.

And actually running 10 processes at a time really isn't how it should work in the first place. I should divide the total number of processes by the number of threads, round that number up and distribute that number among 10 processes. So for the 45 processes in this bootstrap script distributed among 10 threads, I should really give each thread 4.5 jobs to do. If I round up, that's 5 jobs per thread and 9 threads; if I round down, that's 4 jobs per thread and 11 threads (last thread with 5 jobs).

Here is a corrected multi-threading script that takes 5.8 seconds to run. I guess this is more what I was expecting. Good to know that I can run all 45 threads at once and run it in ~1 seconds though.

updated multi-thread script