Friday, 20 April 2012 at 2:06 pm

Phred quality scores are the de facto standard for evaluating the quality of individual bases in DNA sequences. It was used originally in Phred, a software used to call bases from sequencing trace files. The concept of Phred quality scores has not changed much over the years; however the scale of the scores has changed due to Illumina's decision to shift the ASCII scale.

In this post, I will talk about what Phred scales represent and what the various scales are.


Phred score

In the original implementation of Phred scores, information about the size and shape of peaks in sequencing trace files were used to generate an error probability and subsequently converted to a log score. As next-generation sequencers have different base calling methods depending on the sequencing chemistry, how Phred scores are assigned varies among the sequencers. 

Phred scores are basically a log value of the error probabilities:

Phred score = - 10 * log10(error probability)

Here are some basic algebra. A Phred score of 10 would mean:

10 = -10 * log10(error probability)
-1 = log10(error probability)
10^-1 = error probability *this means 10 to the -1 power
1 / 10 = error probability

An error probability of 1/10 or 0.1 means out of 10 base calls, there is a chance 1 base call is incorrect.


ASCII

If Phred scores are numbers, why are the quality scores we see in our quality files a combination of alphanumerics and symbols? For example here is a .fastq entry:

@MyRead sequence
TTACTCTGCGTTGATACCACTGCTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACT
+MyRead quality
FFFFFFFFFFFFDB<4444340///29==:766234466666777689<344=<<;444744442<<9889<888@?<<<<<<<<888<?9444<==@A?888<<<<<>?>?A===????DDDDDDDDAAA=:89<?44488<<<<888=>432;=64

The quality scores are encoded based on ASCII (American Standard Code for Informational Interchange). ASCII is basically a character encoding system that maps a number to a character. For example, the character 'A' is represented by 65 in the ASCII encoding table. The character "%' is represented by 37 in the table. The character '1' is represented by 49.

To truely understand why ASCII is used, you'll have to understand binary encoding. You can read about binary encoding in my previous blog post on bitwise flags. 

Each ASCII number can be converted to a character using the ASCII table. Each character you type into a simple text-editor like 'textedit' in OSx or 'notepad' in windows is represented by a number in this ASCII table. The number is then encoded in binary as 1 byte, which consist of 8 bits. 8 bits allows for 256 possible numbers. This means that each byte can possibly represent 256 different characters.

One important thing to note about the ASCII table and how computers understand characters is that non-displayed characters are also in the ASCII table. This includes characters like a tab, white-space, line-breaks. Even though our text editing programs do not display them as a character, they are still represented with a number in the ASCII table.

Using the ASCII system of mapping numbers to characters allows us to represent numbers with a single character. For example, instead of saving the quality score of '10' as two characters(2 bytes, 16 bits), '0011000100110000'; we can save it as one ASCII equivalent character (1 byte).


Phred scales

When people talk about Phred scales, they mean the range of ASCII numbers used to define the quality scores. The original Phred scale ranged from 33 to 126. Here are some of the ASCII character representations of this number range.

! - 33
" - 34
# - 35
.
..
...
~ - 126

Why did they choose to start with 33? It is because ASCII characters before 33 were mostly non-displayed characters like 32 which is a blank space or 10 which is a line-break. 

The original Phred scaling of 33-126 is also called the Sanger scale. There are also 2 other scales used by Illumina that shifts the range up:

  • Illumina 1.0 format uses ASCII 59 - 126 representing scores -5 - 62.
  • Illumina 1.3+ format uses ASCII 64 - 126 representing scores 0 - 62.

To convert between these scores, one will have to be able to get the ASCII number representation of the character and subtract from it to account for the scale shifts. One useful function python to do this is the 'ord' function which will display the ASCII number representation of a character. For example:

ord('A') = 65







Search

Categories


Archive