Download presentation
Presentation is loading. Please wait.
Published byEric Tate Modified over 9 years ago
1
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor Laboratory May 5-6. 2008
2
Read accuracy / quality values quality values help distinguish between sequencing error and allelic difference some aligners (e.g. MAQ) use quality values to find correct read mapping position the more accurate the read, the easier and faster it is to map the more errors the aligner must tolerate, the less reads can be uniquely aligned Read accuracy Well-calibrated quality values
3
How to tabulate sequencing error rates? …atggatgagtataacgtcaggctaaactgtagtatatggataaaatgacca*acga… tatgcataaaatgaccatacg SI tggatgagtataa*gtcagg D measured fragment length (L) Align a set of reads to the corresponding organismal reference genome sequences PE read reference sequence Register positions of mismatches / gaps
4
Caveat #1 – paralogous mapping aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc accttactttaccttgtactg correct map location spurious error can be avoided by using exhaustive alignment that reveals the fact of multiple possible map locations. Reads that don’t map uniquely should not be used for error analysis… accttactttaccttgtactg incorrect map location
5
Caveat #2 – local misalignment aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc accttactttgccttgtact*a correct alignment D spurious S typically happens at the ends of reads consequence of the scoring scheme… difficult to fix… accttactttgccttgtacta incorrect alignment
6
Caveat #3 – polymorphic dataset / ref errors aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc aaccttactttgccttgcactgaaattactacgtacaccttactttaccttgtactgc reference sequence resequenced individual actttgccttgcactgaaatt spurious S important source of error: Θ ≈ 1/1,000 for humans use resequencing data from haploid DNA source (e.g. BAC) in polymorphic datasets, maybe do SNP calling, and exclude reads that overlap a SNP (this should also work for errors in the reference sequence) SNP
7
Caveat #4 – very low quality reads aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc acaatgcgttgca***agatt these reads are hard to even align elude error statistics only tabulate error stats for reads that will be aligned?
8
Study design Took 3 random lanes each from 3 random runs of PE Illumina reads from Sanger (100,000 random pairs per lane) Mapped the reads to NCBI build36 using MosaikPE Alignment conditions: paired-end alignments, unique-unique end-reads maximum 4 mismatches per end-read Fraction aligned: on average, X fraction of the reads used Analysis by Derek Barnett
9
Fragment length distribution
10
Base error rate and error type
11
Base error rate by substitution type
12
Per-read error rates
13
Position-specific base error rates
14
Accuracy across lanes and runs
15
Assigned Q-value vs. base error rate
16
Does the Q-value depend on base cycle?
17
Quality value calibration base cycle raw Q 30 10 32.031.130.930.031.1 31.3 31.430.230.7 31.330.028 26.0 29.728.327.726.725.4 28.328.627.826.225.1
18
Q-values for read simulations 0.70.90.8 0.1 3.62.32.12.01.0 20.119.418.318.216.3 9.010.19.27.97.5 7.96.77.15.44.3 Weichun Huang, see poster at Genome Meeting base cycle Q 30 10 introduce error in read at base cycle 10 with P=0.001
19
Thanks DerekAaronWeichun Michael Chip
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.