What should a bioinformatician know about DNA sequencing, and why?
What are the error types and rates of the different platforms?
Quality scores Phred Q = -10 log 10 (e) Quality scoreProb wrong base callAccuracy of base call 101/1090% 201/10099% 301/ % 401/10, % 501/100, %
FASTQ format 4 lines, sequence + quality (+optional description) GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + optional repeat of line 1, often left as just the + character to save space !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 But beware! At least 3 different FASTQ file standards, indistinguishable in format, but incompatible with each other Wikipedia.org
FASTQ variants NameASCII range, offsetQ score typeQ score range Sanger standard; fastq-sanger , 33PHRED0 to 93 (raw 0-40) Solexa/Illumina <1.3 fastq-solexa , 64Solexa-5 to 62 (raw -5-40) Illumina 1.3+ fastq-illumina , 64PHRED0 to 62 (raw 0-40) Illumina , 64PHRED3 to 62 (raw 3-40) Illumina , 33PHRED0 to 93 (raw 0-41)
What use is the quality score?
What factors should be considered in the choice of a DNA sequencing platform?