Presentation is loading. Please wait.

Presentation is loading. Please wait.

Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Similar presentations


Presentation on theme: "Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor."— Presentation transcript:

1 Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor Laboratory May 5-6. 2008

2 Read accuracy / quality values quality values help distinguish between sequencing error and allelic difference some aligners (e.g. MAQ) use quality values to find correct read mapping position the more accurate the read, the easier and faster it is to map the more errors the aligner must tolerate, the less reads can be uniquely aligned Read accuracy Well-calibrated quality values

3 How to tabulate sequencing error rates? …atggatgagtataacgtcaggctaaactgtagtatatggataaaatgacca*acga… tatgcataaaatgaccatacg SI tggatgagtataa*gtcagg D measured fragment length (L) Align a set of reads to the corresponding organismal reference genome sequences PE read reference sequence Register positions of mismatches / gaps

4 Caveat #1 – paralogous mapping aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc accttactttaccttgtactg correct map location spurious error can be avoided by using exhaustive alignment that reveals the fact of multiple possible map locations. Reads that don’t map uniquely should not be used for error analysis… accttactttaccttgtactg incorrect map location

5 Caveat #2 – local misalignment aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc accttactttgccttgtact*a correct alignment D spurious S typically happens at the ends of reads consequence of the scoring scheme… difficult to fix… accttactttgccttgtacta incorrect alignment

6 Caveat #3 – polymorphic dataset / ref errors aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc aaccttactttgccttgcactgaaattactacgtacaccttactttaccttgtactgc reference sequence resequenced individual actttgccttgcactgaaatt spurious S important source of error: Θ ≈ 1/1,000 for humans use resequencing data from haploid DNA source (e.g. BAC) in polymorphic datasets, maybe do SNP calling, and exclude reads that overlap a SNP (this should also work for errors in the reference sequence) SNP

7 Caveat #4 – very low quality reads aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc acaatgcgttgca***agatt these reads are hard to even align  elude error statistics only tabulate error stats for reads that will be aligned?

8 Study design Took 3 random lanes each from 3 random runs of PE Illumina reads from Sanger (100,000 random pairs per lane) Mapped the reads to NCBI build36 using MosaikPE Alignment conditions: paired-end alignments, unique-unique end-reads maximum 4 mismatches per end-read Fraction aligned: on average, X fraction of the reads used Analysis by Derek Barnett

9 Fragment length distribution

10 Base error rate and error type

11 Base error rate by substitution type

12 Per-read error rates

13 Position-specific base error rates

14 Accuracy across lanes and runs

15 Assigned Q-value vs. base error rate

16 Does the Q-value depend on base cycle?

17 Quality value calibration base cycle raw Q 30 10 32.031.130.930.031.1 31.3 31.430.230.7 31.330.028 26.0 29.728.327.726.725.4 28.328.627.826.225.1

18 Q-values for read simulations 0.70.90.8 0.1 3.62.32.12.01.0 20.119.418.318.216.3 9.010.19.27.97.5 7.96.77.15.44.3 Weichun Huang, see poster at Genome Meeting base cycle Q 30 10 introduce error in read at base cycle 10 with P=0.001

19 Thanks DerekAaronWeichun Michael Chip


Download ppt "Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor."

Similar presentations


Ads by Google