Wednesday, 18 July 2012 at 12:07 pm

XSQ files are binary files conforming to the HDF5 format used to store ABI SOLiD's color-space and quality data. Manipulating the data can be done with NGS_plumbing and XSQ_tools. If you just want to do a simple converstion or basic stats on the XSQ file, I recommend to just use NGS_plumbing. In this post, I will go through how the data is stored in the XSQ file and how to access it with python. 

 

Requirements

 

Using h5py

Here is how to open a file for reading with h5py:

import numpy
import h5py

xsq = h5py.File('path to your file','r') 

The variable xsq is essentially now a python dictionary formatted according to the XSQ structure.

You can use these standard python dictionary commands to get key, value data:

xsq.keys()   #list of keys
xsq.value    #value of the dictionary if there is data present 

To get an overview of the internal file structure, you can use the python command line and go through it by printing the keys and values. You can also download HDF5View.

 

XSQ file structure

There are plenty of other information like the run meta data and filtering data in the XSQ file, however we are only concerned with getting the sequence and quality scores for now.

Here are the key pieces of data within an XSQ file: 

barcode -> panels ->  adaptor -> ColorCallQV
                  -> fragment -> xylocation  

HDF5 format is essentially just a hierarchical format similar to how folders and files work on most operating systems. The above rough scheme shows the hierarchical structure. The uppermost groups (similar to folders) of a XSQ file are barcodes. If there was only one barcode, there will only be one group available called "DefaultLibrary".

Within the barcode groups, there are thousands of groups (around 2000) corresponding to the panels of flow cell. Each panel group contain groups for various adaptors (F3, R3, F5..) depending on your sequencing chemistry. Within the adaptor groups contain the colorspace sequence and quality data (in the ColorCallQV group). It also contain a group called "fragment" containing x, y coordinates of the read. 

Using python dictionary syntax we can access the data by:

#return a numpy array of bit-encoded colorspace/quality of the first read in the F3 adaptor of panel 1020
xsq['DefaultLibrary']['1020']['F3']['ColorCallQV'].value[0]

#get xylocation of the above read
xsq['DefaultLibrary']['1020']['Fragments']['xylocation'].value[0]

Read IDs are named based on the flow cell panel, x, y coordinate, and adaptor sequence. On old BioScope software, input .csfsata and .qual files needed to have read ids in numerical order. This format and order is also how SOLiD outputs it's .csfasta and .qual files:

x_y_panel_adaptor

The new SOLiD 5 sequencer outputs read IDs in a slightly different format: *this is wrong

panel_x_y_adaptor

The current LifeScope software doesn't seem to care how the reads are named, so I wouldn't worry too much about naming the reads correctly for LifeScope. They just have to be unique.


Parsing bit-encoded data

The colorspace sequence and quality scores are bit-encoded where the first 6 bits of the byte (8 bits in a byte) is the quality score and the last 2 bits contain information on the sequence. Here is how to get the sequence and quality:

#get the bit-compressed data
bitData = xsq['DefaultLibrary']['1020']['F3']['ColorCallQV'].value[0]

#parse the first value of the data array for sequence using the AND bitwise operator
sequence = bitData[0] & 3

#parse the first value of the data array for quality score by shifting the bits 2 to the right
quality = bitData[0] >> 2 


Putting it all together

The following script will parse the header, color-space sequence, and quality score from an XSQ file:

import sys
import numpy
import h5py

xsq = h5py.File(sys.argv[1],'r')

for barcode, barcodeData in xsq.items():
	if barcode != 'RunMetadata':
		for panelID, panelData in barcodeData.items():
			numReads = len(panelData['F3'])

			for i in range(numReads):
				seqData = panelData['F3']['ColorCallQV'][i]
				xyLocation = panelData['Fragments']['yxLocation'][i]
                                #modify this accordingly
				readHeader = panelID + "_" + str(xyLocation[0]) + "_" + str(xyLocation[1]) + "_F3"

				readSeq = []
				readQual = []

				for d in seqData:
					readSeq.append(str(d & 3))
					readQual.append(str(d >> 2))

				#now do what you want with readSeq, readQual and readHeader variables

		


Conclusion

While the HDF5 format is great for storing read data due to it's fast read access, the internal data structure of XSQ files are still a bit muddy. ABI should probably have thought ahead and made a more rigid structure that will allow reads from various platforms to be stored. As of now, the XSQ isn't platform agnostic as not all sequencing platforms will have panels, x, y locations defined in the same way as ABI SOLiD.









Search

Categories


Archive