Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.

Slides:



Advertisements
Similar presentations
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Advertisements

Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Introduction to experimental errors
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
De-novo Assembly Day 4.
Li and Dewey BMC Bioinformatics 2011, 12:323
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
Quick introduction to genomic file types Preliminary quality control (lab)
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Informatics challenges for next-generation sequence analysis
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Aaron R. Quinlan and Gabor T. Marth Department of Biology, Boston College, Chestnut Hill, MA 02467
Short Read Workshop Day 5: Mapping and Visualization
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO.
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Virginia Commonwealth University
Lesson: Sequence processing
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
The FASTQ format and quality control
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
Genome organization and Bioinformatics
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Sequence alignment BI420 – Introduction to Bioinformatics
Data formats Gabor T. Marth Boston College
Sequence the 3 billion base pairs of human
BF nd (Next) Generation Sequencing
Presentation transcript:

Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor Laboratory May

Read accuracy / quality values quality values help distinguish between sequencing error and allelic difference some aligners (e.g. MAQ) use quality values to find correct read mapping position the more accurate the read, the easier and faster it is to map the more errors the aligner must tolerate, the less reads can be uniquely aligned Read accuracy Well-calibrated quality values

How to tabulate sequencing error rates? …atggatgagtataacgtcaggctaaactgtagtatatggataaaatgacca*acga… tatgcataaaatgaccatacg SI tggatgagtataa*gtcagg D measured fragment length (L) Align a set of reads to the corresponding organismal reference genome sequences PE read reference sequence Register positions of mismatches / gaps

Caveat #1 – paralogous mapping aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc accttactttaccttgtactg correct map location spurious error can be avoided by using exhaustive alignment that reveals the fact of multiple possible map locations. Reads that don’t map uniquely should not be used for error analysis… accttactttaccttgtactg incorrect map location

Caveat #2 – local misalignment aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc accttactttgccttgtact*a correct alignment D spurious S typically happens at the ends of reads consequence of the scoring scheme… difficult to fix… accttactttgccttgtacta incorrect alignment

Caveat #3 – polymorphic dataset / ref errors aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc aaccttactttgccttgcactgaaattactacgtacaccttactttaccttgtactgc reference sequence resequenced individual actttgccttgcactgaaatt spurious S important source of error: Θ ≈ 1/1,000 for humans use resequencing data from haploid DNA source (e.g. BAC) in polymorphic datasets, maybe do SNP calling, and exclude reads that overlap a SNP (this should also work for errors in the reference sequence) SNP

Caveat #4 – very low quality reads aaccttactttgccttgtactgaaattactacgtacaccttactttaccttgtactgc acaatgcgttgca***agatt these reads are hard to even align  elude error statistics only tabulate error stats for reads that will be aligned?

Study design Took 3 random lanes each from 3 random runs of PE Illumina reads from Sanger (100,000 random pairs per lane) Mapped the reads to NCBI build36 using MosaikPE Alignment conditions: paired-end alignments, unique-unique end-reads maximum 4 mismatches per end-read Fraction aligned: on average, X fraction of the reads used Analysis by Derek Barnett

Fragment length distribution

Base error rate and error type

Base error rate by substitution type

Per-read error rates

Position-specific base error rates

Accuracy across lanes and runs

Assigned Q-value vs. base error rate

Does the Q-value depend on base cycle?

Quality value calibration base cycle raw Q

Q-values for read simulations Weichun Huang, see poster at Genome Meeting base cycle Q introduce error in read at base cycle 10 with P=0.001

Thanks DerekAaronWeichun Michael Chip