Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor Laboratory May 5-6. 2008

Read accuracy / quality values quality values help distinguish between sequencing error and allelic difference some aligners (e.g. MAQ) use quality values to find correct read mapping position the more accurate the read, the easier and faster it is to map the more errors the aligner must tolerate, the less reads can be uniquely aligned Read accuracy Well-calibrated quality values

How to tabulate sequencing error rates? …atggatgagtataacgtcaggctaaactgtagtatatggataaaatgacca*acga… tatgcataaaatgaccatacg SI tggatgagtataa*gtcagg D measured fragment length (L) Align a set of reads to the corresponding organismal reference genome sequences PE read reference sequence Register positions of mismatches / gaps

Caveat #1 – paralogous mapping aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc accttactttaccttgtactg correct map location spurious error can be avoided by using exhaustive alignment that reveals the fact of multiple possible map locations. Reads that don’t map uniquely should not be used for error analysis… accttactttaccttgtactg incorrect map location

Caveat #2 – local misalignment aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc accttactttgccttgtact*a correct alignment D spurious S typically happens at the ends of reads consequence of the scoring scheme… difficult to fix… accttactttgccttgtacta incorrect alignment

Caveat #3 – polymorphic dataset / ref errors aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc aaccttactttgccttgcactgaaattactacgtacaccttactttaccttgtactgc reference sequence resequenced individual actttgccttgcactgaaatt spurious S important source of error: Θ ≈ 1/1,000 for humans use resequencing data from haploid DNA source (e.g. BAC) in polymorphic datasets, maybe do SNP calling, and exclude reads that overlap a SNP (this should also work for errors in the reference sequence) SNP

Caveat #4 – very low quality reads aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc acaatgcgttgca***agatt these reads are hard to even align  elude error statistics only tabulate error stats for reads that will be aligned?

Study design Took 3 random lanes each from 3 random runs of PE Illumina reads from Sanger (100,000 random pairs per lane) Mapped the reads to NCBI build36 using MosaikPE Alignment conditions: paired-end alignments, unique-unique end-reads maximum 4 mismatches per end-read Fraction aligned: on average, X fraction of the reads used Analysis by Derek Barnett

Fragment length distribution

Base error rate and error type

Base error rate by substitution type

Per-read error rates

Position-specific base error rates

Accuracy across lanes and runs

Assigned Q-value vs. base error rate

Does the Q-value depend on base cycle?

Quality value calibration base cycle raw Q 30 10 32.031.130.930.031.1 31.3 31.430.230.7 31.330.028 26.0 29.728.327.726.725.4 28.328.627.826.225.1

Q-values for read simulations 0.70.90.8 0.1 3.62.32.12.01.0 20.119.418.318.216.3 9.010.19.27.97.5 7.96.77.15.44.3 Weichun Huang, see poster at Genome Meeting base cycle Q 30 10 introduce error in read at base cycle 10 with P=0.001

Thanks DerekAaronWeichun Michael Chip

Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Similar presentations

Presentation on theme: "Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Similar presentations

Presentation on theme: "Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor."— Presentation transcript:

Similar presentations

About project

Feedback