Next-generation sequencing: the informatics angle

Slides:



Advertisements
Similar presentations
The Good, Bad, and Ugly of Next-Gen Sequencing
Advertisements

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Next–generation DNA sequencing technologies – theory & practice
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Next-generation sequencing
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Canadian Bioinformatics Workshops
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Greg Phillips Veterinary Microbiology
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
$399 Personal Genome Service $2,500 Health Compass service $985 deCODEme (November 2007) (April 2008) $350,000 Whole-genome sequencing (November 2007)
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
High Throughput Sequencing
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Department of Bioinformatics and Computational Biology
CS 6293 Advanced Topics: Current Bioinformatics
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Next generation sequencing platforms Applications
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
Next generation sequencing Xusheng Wang 4/29/2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Library Preparation Application dependant, using standard molecular biological techniques. Fragment library oligo kit: (per library)$35 GeneAmp dNTP blend:
Todd J. Treangen, Steven L. Salzberg
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Massive Parallel Sequencing
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Next Generation Sequencing
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
Lesson: Sequence processing
Next generation sequencing
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
Data formats Gabor T. Marth Boston College
Next-generation DNA sequencing
Presentation transcript:

Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department

Next-generation sequencing Illumina, AB/SOLiD short-read sequencers 10 Gb (5-15Gb in 25-70 bp reads) 1 Gb 454 pyrosequencer (100-400 Mb in 200-450 bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length

Individual human resequencing

Whole-genome mutational profiling

Expression analysis

Technologies

Roche / 454 system pyrosequencing technology variable read-length the only new technology with >100bp reads 7

Illumina / Solexa Genome Analyzer fixed-length short-read sequencer very high throughput read properties are very close to traditional capillary sequences low INDEL error rate 8

AB / SOLiD system fixed-length short-reads very high throughput 2-base encoding system color-space informatics A C G T 2nd Base 1st Base 1 2 3 9

Helicos / Heliscope system short-read sequencer single molecule sequencing no amplification variable read-length error rate reduced with 2-pass template sequencing 10

Data characteristics

Read length read length [bp] 20-60 (variable) 25-50 (fixed) 100 200 300 400 read length [bp] 12

Paired fragment-end reads fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity Korbel et al. Science 2007 paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends) instrumental for structural variation discovery 13

Representational biases “dispersed” coverage distribution this affects genome resequencing (deeper starting read coverage is needed) will have major impact is on counting applications 14

Amplification errors early amplification error gets propagated into every clonal copy many reads from clonal copies of a single fragment early PCR errors in “clonal” read copies lead to false positive allele calls 15

Read quality

Error rate (Solexa) Derek: please make the numbers BIGGER so people from the back rows can see them!

Error rate (454)

Per-read errors (Solexa) Ask Derek to label axes, and change title: Distribution of reads according to number of errors

Per read errors (454)

Applications

Genome resequencing for variation discovery SNPs short INDELs structural variations the most immediate application area 22

Genome resequencing for mutational profiling Organismal reference sequence Ask Chip to provide images for this one slide likely to change “classical genetics” and mutational analysis 23

De novo genome sequencing Lander et al. Nature 2001 difficult problem with short reads promising, especially as reads get longer 24

Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) natural applications for next-gen. sequencers 25

Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 high-throughput, but short reads pose challenges 26

Transcriptome sequencing: expression profiling Cloonan et al. Nature Methods, 2008 Jones-Rhoads et al. PLoS Genetics, 2007 high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays 27

Analysis software

Individual resequencing REF (ii) read mapping IND (iii) read assembly (v) SV calling (iv) SNP and short INDEL calling (i) base calling IND (vi) data validation, hypothesis generation

The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

1. Base calling base sequence base quality value sequence

Base quality value calibration

Recalibrated base quality values (Illumina)

2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… … and they give you the picture on the box Problem is, some pieces are easier to place than others…

Strategies to deal with non-unique mapping

Mapping probabilities (qualities) 0.8 0.19 0.01 read

Paired-end read alignments Paired-end read alignments helps unique read placement PE sequences are now the “norm” for genome sequencing 37

Gapped alignments Gapped alignments: allow mapping reads with insertion or deletion errors, and reads with bona fide INDEL alleles The ability to map reads with INDEL errors also improves the certainty of unique mapping 38

3. SNP and short-INDEL discovery capillary sequences: either clonal or diploid traces

SNP and short-INDEL discovery (II) New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection INS

New demands on SNP calling

Rare alleles in 100s / 1,000s of samples

More samples or deeper coverage / sample?

Determining genotype directly from sequence AACGTTAGCATA AACGTTCGCATA individual 1 A/C C/C A/A AACGTTCGCATA individual 2 AACGTTAGCATA individual 3

4. Structural variation discovery software Navigation bar Fragment lengths in selected region Depth of coverage in selected region

5. Data visualization (assembly viewers) software development data validation hypothesis generation

New analysis tools are needed Tailoring existing tools for specialized applications (e.g. read mappers for transcriptome sequencing) Analysis pipelines and viewers that focus on the essential results e.g. the few mutations in a mutant, or compare 1000 genome sequences (but hide most details) Work-bench style tools to support downstream analysis

Data storage and data standards

What level of data to store? traces images base quality values base-called reads 49

Data standards different data storage needs (archival, transfer, processing) often poses contradictory requirements (e.g. normalized vs. non-normalized storage of assembly, alignment, read, image data) even different analysis goals often call for different optimal storage / data access strategies (e.g. paired-end read analysis for SV detection vs. SNP calling) requirements include binary formats, fast sequential and / or random access, and flexible indexing (e.g. an entire genome assembly can no longer reside in RAM)

Data standards (II) Sequence Read Format, SRF (Asim Siddiqui, UBC) ssrformat@ubc.ca Assembly format working group http://assembly.bc.edu Genotype Likelihood Format (Richard Durbin, Sanger)

Summary

Conclusions: next-gen sequencing software Next-generation sequencing is a boon for mass-scale human resequencing, whole-genome mutational profiling, expression analysis and epigenetic studies Informatics tools already effective for basic applications There is a need both for “generic” analysis tools e.g. flexible read aligners and for specialized tools tailored to specific applications (e.g. expression profiling) Move toward tools that focus on biological analysis Most challenges are technical in nature (e.g. data storage, useful data formats, fast read mapping)… many of these will be addressed at this conference

Credits Michael Stromberg Chip Stewart Aaron Quinlan Michele Busby Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang

Credits Elaine Mardis Andy Clark Aravinda Chakravarti Doug Smith Michael Egholm Scott Kahn Francisco de la Vega Kristen Stoops Ed Thayer