Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Genetic Map and Forward Genetics Tools for C. briggsae Presented by Dan Koboldt Ray Miller’s Group.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Click to edit Master title style Irys data analysis January 10 th, 2014.
Next-generation sequencing
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Greg Phillips Veterinary Microbiology
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Canadian Bioinformatics Workshops
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
CS273a Lecture 9, Aut08, Batzoglou CS273a Lecture 9, Fall 2008 Quality of assemblies—mouse N50 contig length Terminology: N50 contig length If we sort.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
$399 Personal Genome Service $2,500 Health Compass service $985 deCODEme (November 2007) (April 2008) $350,000 Whole-genome sequencing (November 2007)
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Next generation sequencing platforms Applications
Next generation sequencing Xusheng Wang 4/29/2010.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees
Todd J. Treangen, Steven L. Salzberg
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology
Massive Parallel Sequencing
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Next Generation Sequencing
Identification of Copy Number Variants using Genome Graphs
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Accessing and visualizing genomics data
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Virginia Commonwealth University
Lesson: Sequence processing
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Very important to know the difference between the trees!
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
Genome organization and Bioinformatics
Data formats Gabor T. Marth Boston College
Next-generation DNA sequencing
Canadian Bioinformatics Workshops
Presentation transcript:

Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November 19-20, 2007

Why we care about genetic variations? underlie phenotypic differences cause inherited diseases allow tracking ancestral human history

Variation types Structural variations SNPs Mikkelsen et al. Nature 2007 Epigenetic variations

Sequence resources for polymorphism discovery read length bases per machine run 10 bp1,000 bp100 bp 100 Mb 10 Mb 1Mb 1Gb Illumina/Solexa, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer (100 Mb in ~250 bp reads) (1-4 Gb in bp reads)

Resequencing-based SNP discovery (iv) read assembly REF (iii) read mapping (pair-wise alignment to genome reference) IND (v) SNP calling (vi) SNP validation (ii) micro-repeat analysis (vii) data viewing, hypothesis generation

Talk topics Tools for resequencing read analysis Data mining projects base calling resequenceability analysis read mapping / alignment / assembly SNP calling structural variation discovery read data visualization SNP and short-INDEL discovery in C. elegans Complete mutational profiling in Pichia stipitis

…AND the cover on the box Reference-guided read alignment Reference-sequence guided assembly: …they give you the pieces…

Some pieces are easier to place than others… …pieces with unique features pieces that look like each other…

Resequenceability: unique read placement Reads from repeats cannot be uniquely mapped RepeatMasker does not capture all repeats at the read length scale Near-perfect repeats can be also a problem because of sequencing errors and / or SNPs fraction of reads number of mismatches

Finding micro-repeats is not easy Hash based methods (fast but only work out to a couple of mismatches) Exact methods (very slow but find every repeat copy) Heuristic methods (fast but miss a fraction of the repeats)

Presenting repeats for downstream analysis masking bases masking fragments bases in repetitive fragments may be resequenced with reads representing other, unique fragments  fragment-level repeat annotations spare a higher fraction of the genome than base-level repeat masking

Fragment level annotation is economical

Paired-end reads will not make the question go away

Read alignment

INDELs require gapped alignment ABI/cap. 454/FLX Illumina 454/GS20 sequences, often from different machine types, must be assembled together billions of sequences must be aligned

MOSAIK: an anchored aligner / assembler Step 1. initial short-hash scan for possible read locations Step 2. evaluation of candidate locations with SW method Michael Stromberg

MOSAIK – performance Solexa read alignments to C. elegans genome: 100 million reads aligned in 95 minutes 18,000 reads / second 454 reads to Pichia (yeast-size) genome GS20: 2,000 reads / second FLX: 300 reads / second Solexa read alignments to masked human genome: 40 seconds for 1 million reads 18,000 reads / second 5.5 GB RAM used (more for longer initial hash sizes)

Polymorphism detection Goal: to discern true variation from sequencing error sequencing error polymorphism

Using base quality values use base quality values to help us decide if mismatches are true polymorphisms or sequencing errors

Bayesian detection algorithm AAAAAAAAAA CCCCCCCCCC TTTTTTTTTT GGGGGGGGGG polymorphic combination monomorphic combination Bayesian posterior probability i.e. the SNP score Base call + Base quality Expected polymorphism rate Base composition Depth of coverage

The PolyBayes software Marth et al. Nature Genetics

Data visualization 1.aid software development: integration of trace data viewing, fast navigation, zooming/panning 2.facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays 3.promote hypothesis generation: integration of annotation tracks Weichun Huang

SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs) Bristol, N2 strain (3 ½ machine runs) 5 runs (~120 million) Illumina reads from Wash. U. (Elaine Mardis) detect polymorphisms between the Pasadena and the Bristol strain aligned / assembled the reads (< 4 hours on 1 CPU) found 44,642 SNP candidates (2 hours on our 160-CPU cluster) SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

Polymorphism discovery in C. elegans SNP calling error rate very low: Validation rate = 97.8% (224/229) Conversion rate = 92.6% (224/242) Missed SNP rate = 3.75% (26/693) SNP INS INDEL candidates validate and convert at similar rates to SNPs: Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)

Mutational profiling: deep 454/Illumina data collaboration with Doug Smith at Agencourt Pichia stipitis converts xylose to ethanol (bio-fuel production) one mutagenized strain had especially high conversion efficiency determine where the mutations were that caused this phenotype we resequenced the 15MB genome with 454 Illumina, and SOLiD reads Pichia stipitis reference sequence Image from JGI web site

Mutational profiling: comparisons TechnologyCoverageNominal coverageFPFNTotal error 454/FLX2 runs12.9x /FLX1 run9.8x617 Illumina7 lanes53.5x000 Illumina3 lanes23.4x000 Illumina2 lanes15.6x202 Illumina1 lane7.6x222 SOLiD-30.0X000 SOLiD-20.0X000 SOLiD-10.0X000 SOLiD-8.0X044 SOLiD-6.0X066

Our software is available for testing

Credits Elaine Mardis (Washington University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby

Resequencing of diploid individual genomes Ind. 1 Ind. 2 Ind. 3 Ind. 4

How do we find sequence variations? compare multiple sequences from the same genome region

Resequencing applications of next-gen sequencers Emerging applications: DNA-protein interaction analysis (CHiP-Seq) epigenetic analysis (methylation profiling) novel transcript discovery quantification of gene expression Polymorphism discovery: organismal SNP discovery complete mutational profiling individual human resequencing for SNP, INDEL and structural variation discovery DEL SNP reference genome resequenced individual

Task 5. Dealing with massive data volumes Short-read format working group (Asim Siddiqui, UBC) Assembly format working group Boston College two connected working groups to define standard data formats

SNP calling in low 454 coverage with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) 10 different African and American melanogaster isolates 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total) can we detect SNPs in survey-style 454 read coverage? DNA courtesy of Chuck Langley, UC Davis base-calling with PYROBAYES alignment to 120 Mb euchromatic reference sequence with MOSAIK SNP detection with POLYBAYES

SNP calling results iso-1 reference read 46-2 ABI reads (2 fwd + 2 rev) 92.9 % validation rate (1,342 / 1,443) 2.0% missed SNP rate (25 / 1247) 658,280 SNPs Ѳ ≈ 5x10 -3 (1 SNP / 200 bp)

Flow signal vs. actual base number

Reference-guided read alignment

PYROBAYES: A 454 base caller program better correlation between assigned and measured quality values higher fraction of high-quality bases Aaron Quinlan

454 errors: over and under-calls

Validation / score calibration

Traditional SNP discovery data capillary sequences (ABI) clonal (haploid) sequences

The SNP score polymorphism specific variation

SNPs and short INDELs Single-base substitutions (SNPs) Insertion-deletion polymorphisms (INDELs)

Structural variations Translocations: DNA exchange between different chromosomes Inversion of long chromosomal tracts “Simple” duplications and deletions Multiple duplications (copy number changes)

Epigenetic variations Epigenetic variations e.g. changes in methylation / chromatin structure that do not strictly involve base changes Mikkelsen et al. Nature 2007

Task 1. Base calling / base accuracy estimation how do we translate the machine readouts to base calls? how do we estimate and represent sequencing errors (base quality values)?

454 pyrosequencing errors

454 pyrosequencer error profile INDEL errors dominate

454 base quality values most bases have low quality values, not optimal for SNP discovery native 454 base quality values underestimate true accuracy

Illumina/Solexa base accuracy Most errors are substitutions  PHRED quality values work Measured base quality is a function of base position within the read (i.e. there is need for quality value calibration)

Illumina/Solexa base accuracy error rate grows as a function of base position within the read a large fraction of the reads contains 1 or 2 errors

Task 2. Read mapping and assembly … is similar to a jigsaw puzzle… … that you have to put together all by yourself De novo assembly:

Structural variation discovery copy number variations (deletions & amplifications) can be detected from variations in the depth of read coverage structural rearrangements (inversions and translocations) require paired-end reads

Task 4. Data visualization make screenshot with annotation

Applications 2. Mutational profiling in deep 454 and Illumina read data (Pichia stipitis) 1. SNP and INDEL discovery in deep Illumina short-read coverage (Caenorhabditis elegans) (image from Nature Biotech.)