Wednesday, 08 August 2012 at 11:20 am

view the demo here
Python script and HTML source is in the post.
Alternatively, you can download a zip file of source code and example data here

I posted some brief code on how to do hierarchical clustering of expression values with SciPy few days ago. I am going to expand on how that's done and also show you how you can visualize it as a heatmap in a browser (works best with Chrome or Safari) with javascript and d3.js. 

I will also post something on visualizing dendrograms in the coming weeks.

 

Input

We will use a very small dataset of 5 rows for ease of explanation: (These are z-scores, hence the negative values)

geneID cond1 cond2 cond3 cond4 cond5 cond6 cond7
geneA -1.4329212 -1.0906895 0.68953205 0.06726148 1.53164572 0.78973271 -0.5545612
geneB -0.5905256 -1.0240392 0.88541103 1.13568288 1.23907981 -0.2573127 -1.3882961
geneC -1.1951908 -0.9463196 0.24280532 -0.2322689 -0.3737134 0.45025184 2.05443572
geneD 1.24941084 1.86599302 -0.5468524 -0.6963787 -0.5777332 -0.6199323 -0.6745070
geneE -0.3257655 2.15308584 -0.1346761 -0.0635392 -0.1081008 -0.0299163 -1.4910876

Let's get this data into a 2d python array:

#open the file assuming the data above is in a file called 'dataFile'
inFile = open('dataFile','r')
#save the column/row headers (conditions/genes) into an array colHeaders = inFile.next().strip().split()[1:] rowHeaders = [] dataMatrix = []
for line in inFile: data = line.strip().split('\t') rowHeaders.append(data[0]) dataMatrix.append([float(x) for x in data[1:]])

#convert native data array into a numpy array
dataMatrix = numpy.array(dataMatrix) 


Distance matrix

Now that we have a numpy array, we can use SciPy's spatial.distance module to compute the pair-wise distances of our data: 

distanceMatrix = dist.pdist(dataMatrix)

You can define different distance metrics in the second paramter of the pdist function. The pdist API has a list of all the possible distances you can measure:

distanceMatrix = dist.pdist(dataMatrix,'hamming') #use hamming function
distanceMatrix = dist.pdist(dataMatrix,'euclidean') #use euclidean function

 If we print out this distance matrix we ge an array of 10 distances:

[ 1.94190068 3.30441329 4.94971629 4.08435735 4.19643258 4.59208862
3.81052537 4.83303044 4.83787386 2.0903242 ]

Since we have 5 genes, we have 4 + 3 + 2 + 1 = 10 possible pair-wise comparisons. (N * (N - 1) / 2).

edit** July 3rd, 2013

It looks like the documentation on linkage() is wrong in stating that it can take in a redundant squareform matrix. As brought to my attention by Hong Bo in the comments, converting the distance matrix to squareform will make the subsequent linkage() function think the distance matrix is the original set of observations. So we will input the condensed distance matrix in directly instead without converting to squareform. 

Now we need to get this distance array into a squareform array for the linkage matrix calculation:

distanceSquareMatrix = dist.squareform(distanceMatrix)

Basically it just converts the the distance array into a redundant square form matrix:

 [ 0. 1.94190068 3.30441329 4.94971629 4.08435735]
[ 1.94190068 0. 4.19643258 4.59208862 3.81052537]
[ 3.30441329 4.19643258 0. 4.83303044 4.83787386]
[ 4.94971629 4.59208862 4.83303044 0. 2.0903242 ]
[ 4.08435735 3.81052537 4.83787386 2.0903242 0. ]


Linkage matrix

To calculate the linkage matrix: (Check the linkage  API for different linkage methods)

linkageMatrix = hier.linkage(distanceMatrix)

This returns:

 [ 0  1  1.94190068  2  ]
[ 3 4 2.0903242 2 ]
[ 2 5 3.30441329 3 ]
[ 6 7 3.81052537 5 ]

Each row of this matrix is a cluster where:

  • The first two numbers are indeces of clusters, explained in more detail below.
  • Third number is the linkage distance
  • Fourth number is the number of genes within the cluster

The original data of 5 genes are represented by indeces 0-4 (5 rows). Any of the first two numbers in a row of the linkage matrix that range from 0-4 are referring to these 5 genes by row number. Numbers greater than 4 are new cluster nodes formed.

For example, the first row representes a cluster of geneA (index = 0) and geneB (index = 1) with a linkage distance of 2.92. The cluster has two genes as represented by the last number in the row.

The third row represents a cluster of geneC (index = 2) and the cluster defined by the first row (index = 5) with a linkage distance of 5.24 and containing 3 genes (geneC + geneA and gene B). Here is the same matrix annotaed with the cluster ids:

 [ 0  1  2.92241962 2  ] # geneA, index 0 + geneB, index 1 -> index 5
[ 3 4 3.17783279 2 ] # geneD, index 3 + geneE, index 4 -> index 6
[ 2 5 5.2442932 3 ] # geneC, index 2 + index 6 -> index 7
[ 6 7 6.34827534 5 ] # index 6 + index 7 -> root


Re-order data matrix according to clustering

To generate a heatmap, we need to reorder the rows of the original data according to the clustering. This order is essentially the ordering of the "leaves" in a dendrogram. If we are to plot a dendrogram with the linkage matrix above we would generate something like this: (excuse the crude representation)

          D E C A B
|_| | |_|
| |__|
|___|

A heatmap would require us to reorder the rows (genes) according to the leaves of this dendrogram. The order would be:

#gene    D   E   C   A   B
#index  3 4 2 0 1

To get this order we can use the 'leaves_list' function:

heatmapOrder = hier.leaves_list(linkageMatrix)
#outputs  [3 4 2 0 1]

Now we need to reorder the original data. To do this, we can use one of numpy's useful slice notation: (note that this only works with numpy arrays)

orderedDataMatrix = dataMatrix[heatmapOrder,:]
#basically what this mean is to take all columns from dataMatrix that
#matches the indeces in heatmapOrder array. It will perform this in the same
#order as the heatmapOrder array, creating a new array in the same order. 

rowHeaders = numpy.array(rowHeaders)
orderedRowHeaders = rowHeaders[heatmapOrder,:]
#do the same for the row headers 


Output JSON data for javascript/D3.js

Now that we have an ordered data matrix, we can start outputting data into a JSON format so we can visualize it in a browser with javascript using D3.js. Refer to my tutorial series on D3.js for more information on loading external data using pseudo javascript libraries.

I am not going to go into great detail on this as you can peruse the javascript/html code yourself. The data structure I am outputting includes:

  • A 2d array in the same order as the sorted data matrix where each element contains the expression z-score, row index, column index.
  • The maximum expression z-score 
  • The minimum expression z-score
  • The ordered row headers
  • The column headers

Here is the entire python script with JSON output at the end of the script: 

import sys, numpy, scipy
import scipy.cluster.hierarchy as hier
import scipy.spatial.distance as dist

#import the data into a native 2d python array inFile = open(sys.argv[1],'r') colHeaders = inFile.next().strip().split()[1:] rowHeaders = [] dataMatrix = [] for line in inFile: data = line.strip().split() rowHeaders.append(data[0]) dataMatrix.append([float(x) for x in data[1:]])
#convert native python array into a numpy array dataMatrix = numpy.array(dataMatrix)

#calculate distance matrix and convert to squareform distanceMatrix = dist.pdist(dataMatrix) distanceSquareMatrix = dist.squareform(distanceMatrix)

#calculate linkage matrix linkageMatrix = hier.linkage(distanceSquareMatrix)

#get the order of the dendrogram leaves heatmapOrder = hier.leaves_list(linkageMatrix)

#reorder the data matrix and row headers according to leaves orderedDataMatrix = dataMatrix[heatmapOrder,:] rowHeaders = numpy.array(rowHeaders) orderedRowHeaders = rowHeaders[heatmapOrder,:]
#output data for visualization in a browser with javascript/d3.js matrixOutput = [] row = 0 for rowData in orderedDataMatrix: col = 0 rowOutput = [] for colData in rowData: rowOutput.append([colData, row, col]) col += 1 matrixOutput.append(rowOutput) row += 1 print 'var maxData = ' + str(numpy.amax(dataMatrix)) + ";" print 'var minData = ' + str(numpy.amin(dataMatrix)) + ";" print 'var data = ' + str(matrixOutput) + ";" print 'var cols = ' + str(colHeaders) + ";" print 'var rows = ' + str([x for x in orderedRowHeaders]) + ";"

Save this script as yourName.py. Use it on the expression data, which I assume to be space delimited and output it as a javascript file.

yourName.py yourData > data.js


HTML/Javascript with D3.js

Here is the html/javascript code to display the previously generated output (data.js) as a heatmap. I added some interactivity where mouse over will reveal the gene name and expression values. Code is commented:

<html>
<head>
<script type="text/javascript" src="d3.v2.js"></script>
<script type="text/javascript" src="data.js"></script>
<style>
body {
margin: 0px;
padding: 0px;
font: 12px Arial;
}
</style>
</head>
<body>
<script type="text/javascript">
//height of each row in the heatmap
var h = 5;
//width of each column in the heatmap
var w = 70;

//attach a SVG element to the document's body
var mySVG = d3.select("body")
.append("svg")
.attr("width", (w * cols.length) + 400)
.attr("height", (h * rows.length + 100))
.style('position','absolute')
.style('top',0)
.style('left',0);

//define a color scale using the min and max expression values
var colorScale = d3.scale.linear()
.domain([minData, 0, maxData])
.range(["blue", "white", "red"]);
//generate heatmap rows
var heatmapRow = mySVG.selectAll(".heatmap")
.data(data)
.enter().append("g");

//generate heatmap columns var heatmapRects = heatmapRow
.selectAll(".rect")
.data(function(d) {
return d;
}).enter().append("svg:rect")
.attr('width',w)
.attr('height',h)
.attr('x', function(d) {
return (d[2] * w) + 25;
})
.attr('y', function(d) {
return (d[1] * h) + 50;
})
.style('fill',function(d) {
return colorScale(d[0]);
});

//label columns
var columnLabel = mySVG.selectAll(".colLabel")
.data(cols)
.enter().append('svg:text')
.attr('x', function(d,i) {
return ((i + 0.5) * w) + 25;
})
.attr('y', 30)
.attr('class','label')
.style('text-anchor','middle')
.text(function(d) {return d;});
//expression value label
var expLab = d3.select("body")
.append('div')
.style('height',23)
.style('position','absolute')
.style('background','FFE53B')
.style('opacity',0.8)
.style('top',0)
.style('padding',10)
.style('left',40)
.style('display','none');
//heatmap mouse events
heatmapRow
.on('mouseover', function(d,i) {
d3.select(this)
.attr('stroke-width',1)
.attr('stroke','black')
output = '<b>' + rows[i] + '</b><br>';
for (var j = 0 , count = data[i].length; j < count; j ++ ) {
output += data[i][j][0] + ", ";
} expLab
.style('top',(i * h))
.style('display','block')
.html(output.substring(0,output.length - 3));
})
.on('mouseout', function(d,i) {
d3.select(this)
.attr('stroke-width',0)
.attr('stroke','none') expLab
.style('display','none')
});
</script>
</body>
</html>







Search

Categories


Archive