Tuesday, 06 November 2012 at 7:16 pm

I was looking for something that'll give me allele counts from samtools' mpileup output today. I've tried: mpileup into a .vcf file. -> use vcftools to give me base frequencies/counts. But that didn't seem to work for me. I ended up with 0 counts and 'NaN' (not a number) frequencies. 

I read over the mpileup format and decided to write my own script to get base counts. The script seems to work, but I have a few concerns hopefully someone can address:

  • I interpreted deletions as: when there is a -[0-9]-[AGTC] in the read base column, that means the next [0-9] base is deleted. The current base is not deleted. Is that correct?
  • Insertions mean there is an insertion between the current base and the next base?
  • What does the "^" do? I read over the description several times and still don't understand it (from the pileup format guide):
    Also at the read base column, a symbol `^' marks the start of a 
    read segment which is a contiguous subsequence on the read
    separated by `N/S/H' CIGAR operations.

This script will output the count of A,G,T,C,insertion,deletions of each base. It will also include a comma delimited list of the inserted sequence and a comma delimited list of ambiguous base, if any exist. It will ignore forward/reverse orientation of reads.

Here is an example output:

bp    A    G    C    T    del   ins   inserted  ambiguous
4 0 0 1 0 0 0 . .
5 0 0 709 0 0 0 . .
6 0 0 0 715 0 0 . .
7 0 0 1 730 0 0 . .
8 0 736 0 0 0 0 . .
9 0 2 0 742 0 2 G,G .
10 0 740 1 0 2 2 C,C .
11 0 0 741 4 1 1 T .
12 0 3 0 740 3 1 G R,R

It also ignores "^" symbols which I honestly don't really understand. I would appreciate if anyone can enlighten me on what "^" signify.

Save this script as yourName.py and use by:

python yourName.py sample.mpileup > sample.counts
import sys

inFile = open(sys.argv[1],'r')

print 'bp\tA\tG\tC\tT\tdel\tins\tinserted\tambiguous'
for line in inFile:
        data = line.strip().split('\t')
        bp = data[1]
        bases = data[4].upper()
        ref = data[2].upper()
        
        types = {'A':0,'G':0,'C':0,'T':0,'-':0,'+':[],'X':[]}

        i = 0
        while i < len(bases):
                base = bases[i]
                if base == '^' or base == '$':
                        i += 1
                elif base == '-':
                        i += 1
                elif base == '*':
                        types['-'] += 1
                elif base == '+':
                        i += 1
                        addNum = int(bases[i])
                        addSeq = ''
                        for a in range(addNum):
                                i += 1
                                addSeq += bases[i]

                        types['+'].append(addSeq)
                elif base == '.' or base == ',':
                        types[ref] += 1
                else:
                        if types.has_key(base):
                                types[base] += 1
                        else:
                                types['X'].append(base)

                i += 1

        adds = '.'
        if len(types['+']) > 0:
                adds = ','.join(types['+'])

        amb = '.'
        if len(types['X']) > 0:
                amb = ','.join(types['X'])

        out = [bp,types['A'],types['G'],types['C'],types['T'],types['-'],len(types['+']),adds,amb]
        print '\t'.join([str(x) for x in out])







Search

Categories


Archive