Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Functional Genomics with Next-Generation Sequencing
The Good, Bad, and Ugly of Next-Gen Sequencing
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Greg Phillips Veterinary Microbiology
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Canadian Bioinformatics Workshops
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
Some new sequencing technologies. Molecular Inversion Probes.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
CS 6293 Advanced Topics: Current Bioinformatics
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Next generation sequencing platforms Applications
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Next generation sequencing Xusheng Wang 4/29/2010.
Todd J. Treangen, Steven L. Salzberg
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Massive Parallel Sequencing
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Next Generation Sequencing
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Accessing and visualizing genomics data
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Next-generation sequencing technology
Lesson: Sequence processing
Next generation sequencing
Next-generation sequencing technology
Discovery tools for human genetic variations
Genome organization and Bioinformatics
Data formats Gabor T. Marth Boston College
Next-generation DNA sequencing
Presentation transcript:

Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July 14-15, 2008, Boston, MA

T1. Roche / 454 FLX system pyrosequencing technology variable read-length the only new technology with >100bp reads tested in many published applications supports paired-end read protocols with up to 10kb separation size

T2. Illumina / Solexa Genome Analyzer fixed-length short-read sequencer read properties are very close traditional capillary sequences very low INDEL error rate tested in many published applications paired-end read protocols support short (<600bp) separation

T3. AB / SOLiD system ACGT A C G T 2 nd Base 1 st Base fixed-length short-read sequencer employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy requires color-space informatics published applications underway / in review paired-end read protocols support up to 10kb separation size

T4. Helicos / Heliscope system experimental short-read sequencer system single molecule sequencing no amplification variable read-length error rate reduced with 2- pass template sequencing

A1. Variation discovery: SNPs and short-INDELs 1. sequence alignment 2. dealing with non-unique mapping 3. looking for allelic differences

A2. Structural variation detection structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations copy number (for amplifications, deletions) from depth of read coverage

A3. Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. Robertson et al. Nature Methods, 2007

A4. Novel transcript discovery (genes) Mortazavi et al. Nature Methods

A5. Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006

A6. Expression profiling by tag counting aligned reads Jones-Rhoads et al. PLoS Genetics, 2007 gene

A7. De novo organismal genome sequencing assembled sequence contigs short reads longer reads read pairs Lander et al. Nature 2001

C1. Read length read length [bp] ~ (var) (fixed) (fixed) (var) 400

When does read length matter? short reads often sufficient where the entire read length can be used for mapping: SNPs, short-INDELs, SVs CHIP-SEQ short RNA discovery counting (mRNA miRNA) longer reads are needed where one must use parts of reads for mapping: de novo sequencing novel transcript discovery aacttagacttaca gacttacatacgta Known exon 1Known exon 2 accgattactatacta

C2. Read error rate error rate dictates the stringency of the read mapper error rate typically % the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned

Error rate grows with each cycle this phenomenon limits useful read length

Substitutions vs. INDEL errors

C3. Representational biases / library complexity fragmentation biases amplification biases PCR sequencing biases sequencing low/no representation high representation

Dispersal of read coverage this affects variation discovery (deeper starting read coverage is needed) it should have major impact is on counting applications

Amplification errors many reads from clonal copies of a single fragment early PCR errors in “clonal” read copies lead to false positive allele calls early amplification error gets propagated onto every clonal copy

C4. Paired-end reads fragment amplification: fragment length bp fragment length limited by amplification efficiency circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity Korbel et al. Science 2007 paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

Technologies / properties / applications Technology Roche/454Illumina/SolexaAB/SOLiD Read properties Read length bp20-50bp25-50bp Error rate<0.5%<1.0%<0.5% Dominant error typeINDELSUB Quality values availableyes not really Paired-end separation< 10kb (3kb optimal) bp500bp - 10kb (3kb optimal) Applications SNP discovery●●○ short-INDEL discovery ●○ SV discovery○○● CHIP-SEQ○●● small RNA/gene discovery○●● mRNA Xcript discovery●○○ Expression profiling○●● De novo sequencing● ??

Resequencing-based SNP discovery (iv) read assembly REF (iii) read mapping (pair-wise alignment to genome reference) IND (v) SNP calling (vi) SNP validation (ii) micro-repeat analysis (vii) data viewing, hypothesis generation

The “toolbox” base callers microrepeat finders read mappers SNP callers structural variation callers assembly viewers

…AND they give you the cover on the box Reference guided read mapping Reference-sequence guided mapping: …you get the pieces… Some pieces are more unique than others

MOSAIK: an anchored aligner / assembler Step 1. initial short-hash scan for possible read locations Step 2. evaluation of candidate locations with SW method Michael Stromberg

Non-unique mapping, gapped alignments 1. Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented) 2. Gapped alignments: allow for mapping reads with insertion or deletion sequencing errors, and reads with bona fide INDEL alleles

Read types aligned, paired-end read strategy 3. Aligns and co-assembles customary read types: ABI/capillary Illumina/Solexa AB/SOLiD Roche/454 Helicos/Heliscope ABI/capillary 454 FLX 454 GS20 Illumina 4. Paired-end read alignments

Other mainstream read mappers ELAND (Tony Cox, Illumina) -- the “official” read mapper supplied by Illumina, fast MAQ (Li Heng + Richard Durbin, Sanger) -- the most widely used read mapper, low RAM footprint SOAP (Beijing Genomics Institute) -- a new mapper developed for human next-gen reads SHRIMP (Michael Brudno, University of Toronto) -- full Smith-Waterman

Speed

Polymorphism / mutation detection sequencing error polymorphism

Determining genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A

Software SNP INS

Data visualization 1.aid software development: integration of trace data viewing, fast navigation, zooming/panning 2.facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays 3.promote hypothesis generation: integration of annotation tracks Weichun Huang

Applications 1. SNP discovery in shallow, single-read 454 coverage (Drosophila melanogaster) 3. Mutational profiling in deep 454 and Illumina read data (Pichia stipitis) 2. SNP and INDEL discovery in deep Illumina short-read coverage (Caenorhabditis elegans) (image from Nature Biotech.)

Our software is available for testing

Credits Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby

Accuracy As is the case for all heuristic alignment algorithms accuracy and speed are option- and parameter-dependent

C3. Quality values are important for allele calling PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles inaccurate or not well calibrated base quality values hinder allele calling Q-values should be accurate … and high!

Software tools for next-gen sequence analysis

Next-generation sequencing technologies and applications