Next-generation DNA sequencing

Slides:



Advertisements
Similar presentations
Next-Generation Sequencing: Methodology and Application
Advertisements

High throughput sequencing Barbera van Schaik
The Good, Bad, and Ugly of Next-Gen Sequencing
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Next–generation DNA sequencing technologies – theory & practice
Next-generation sequencing
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Greg Phillips Veterinary Microbiology
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
CS273a Lecture 9, Aut08, Batzoglou CS273a Lecture 9, Fall 2008 Quality of assemblies—mouse N50 contig length Terminology: N50 contig length If we sort.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
High Throughput Sequencing
Department of Bioinformatics and Computational Biology
CS 6293 Advanced Topics: Current Bioinformatics
Next Generation DNA Sequencing Platforms: Evolving Tools for
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
High-Throughput Sequencing Technologies
Next generation sequencing Xusheng Wang 4/29/2010.
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Whole Exome Sequencing for Variant Discovery and Prioritisation
High Throughput Sequencing Methods and Concepts
Library Preparation Application dependant, using standard molecular biological techniques. Fragment library oligo kit: (per library)$35 GeneAmp dNTP blend:
Todd J. Treangen, Steven L. Salzberg
Introduction to next generation sequencing Rolf Sommer Kaas.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Canadian Bioinformatics Workshops
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
Next-generation sequencing technology
Virginia Commonwealth University
Research Techniques Made Simple: Next-Generation Sequencing:
DNA Sequencing Second generation techniques
Lesson: Sequence processing
Next generation sequencing
Sequencing technologies
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Next-generation sequencing technology
Very important to know the difference between the trees!
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
2nd (Next) Generation Sequencing
Genome organization and Bioinformatics
Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine
Introduction to Sequencing
Canadian Bioinformatics Workshops
Presentation transcript:

Next-generation DNA sequencing Boston College Biology Department BI420 Introduction to Bioinformatics Fall 2012

Traditional DNA sequencing

Genetics of living organisms Chromosomes DNA

Radioactive label gel sequencing

Four-color capillary sequencing ~1 Mb ~100 Mb >100 Mb ~3,000 Mb ABI 3700 four-color sequence trace

Individual human resequencing

Next-generation sequencing

… offer vast throughput … & many applications Illumina, SOLiD 1 Tb 100 Gb 10 Gb 454 bases per machine run 1 Gb 100 Mb 10 Mb ABI / capillary 1 Mb 10 bp 100 bp 1,000 bp read length

Sequencing chemistries DNA base extension DNA ligation Church, 2005 9

Template clonal amplification Church, 2005 10

Massively parallel sequencing Church, 2005

Chemistry of paired-end sequencing Double strand DNA is folded into a bridge shape then separated into single strands. The end of each strand is then sequenced. (Figure courtesy of Illumina) 12

Paired-end reads fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity Korbel et al. Science 2007 13

Features of NGS data Short sequence reads 100-200bp 25-35bp (micro-reads) Huge amount of sequence per run Up to gigabases per run Huge number of reads per run Up to 100’s of millions Higher error as compared with Sanger sequencing Error profile different to Sanger 14

T1. Roche / 454 FLX system Pyrosequencing technology. This involves addition of a very small quantity of the ddNTP, which will only fluoresce after it binds to the template. The procedure is repeated base-by-base and one checks which color lights up Lengths of homopolymer runs (AAA, CCCC, etc) quantified by brightness of signal. This is the largest source of error variable read-length 15

T2. Illumina / Solexa Genome Analyzer fixed-length short-read sequencer read properties are very close to traditional capillary sequences very low INDEL error rate 16

T3. AB / SOLiD system 1 2 3 fixed-length short-read sequencer 2nd Base 1st Base 1 2 3 fixed-length short-read sequencer employs a 2-base encoding system 17

T4. Pacific Biosciences Single Molecule Real Time DNA polymerase fixed in place Polymerase altered so that as bases are added onto second strand, a base-specific fluorescence signal will be emitted Single-molecule optical readout finely controlled using waveguides Long readlengths (>1000bp) SMRT Technology overview http://www.pacificbiosciences.com/aboutus/video-gallery?videoImage=80k_1.jpg 18

Applications

Application areas Genome resequencing variant discovery somatic mutation detection mutational profiling De novo sequencing Identification of protein-bound DNA chromatin structure methylation transcription binding sites RNA-Seq expression transcript discovery Mikkelsen et al. Nature 2007 Cloonan et al. Nature Methods, 2008

SNP and short-INDEL discovery 21

Mutational profiling in deep 454 data Pichia stipitis reference sequence Image from JGI web site Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production) one specific mutagenized strain had especially high conversion efficiency goal was to determine where the mutations were that caused this phenotype we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB genome) found 39 mutations Smith et al. Genome Research 2008

Structural variation detection structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations copy number (for amplifications, deletions) from depth of read coverage Ask Chip to provide images for this one slide 23

Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) 24

Novel transcript discovery (genes) Mortazavi et al. Nature Methods novel exons novel transcripts containing known exons 25

Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006 26

Expression profiling tag counting (e.g. SAGE, CAGE) gene gene aligned reads aligned reads Jones-Rhoads et al. PLoS Genetics, 2007 tag counting (e.g. SAGE, CAGE) shotgun transcript sequencing 27

De novo genome sequencing Lander et al. Nature 2001 short reads read pairs longer reads assembled sequence contigs 28

Technologies / properties / applications   Technology Roche/454 Illumina/Solexa AB/SOLiD Read properties Read length 200-450bp 75-150bp 25-50bp Error rate <0.5% <1.0% Dominant error type INDEL SUB Quality values available yes not really Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal) Applications SNP discovery ● ○ short-INDEL discovery SV discovery CHIP-SEQ small RNA/gene discovery mRNA Xcript discovery Expression profiling De novo sequencing ? 29

The Bioinformatics angle

Trace extraction Trace extraction

Base calling Base calling machine read-outs are quite different read length, read accuracy, and sequencing error profiles are variable (and change rapidly as machine hardware, chemistry, optics, and noise filtering improves) 32

Base error rate error rate typically 0.4 - 1% the more errors the aligner allows, the lower the fraction of the reads that can be uniquely aligned 33

Error rate grows with each cycle This phenomenon limits useful read length A key challenge in sequencing technology is how to get long reads that remain accurate. 34

Read mapping read mapping is similar to a jigsaw puzzle… …where they give you the cover on the box

Some pieces are easier to place than others… pieces that look like each other… …pieces with unique features

Repeats  multiple mapping problem Lander et al. 2001

Dealing with multiple mapping 38

Mapping quality values 0.8 0.19 0.01 39

Paired-end (PE) reads fragment length: 100 – 600bp Korbel et al. Science 2007 fragment length: 1 – 10kb PE reads are now the standard for whole-genome short-read sequencing 40

Gapped alignments (for INDELs) 41

Read mapping programs Many mappers are available Handling of read pairs Handling non-unique mapping Speed and accuracy Flexibility vis-à-vis sequencing technologies Stability and support

Data storage requirements 43

Duplicate reads

Local misalignments

Base quality value recalibration

Multiple read types ABI/cap. 454/FLX 454/GS20 Illumina

Alignment visualization integrating genomic context (e.g. gene annotations) too much data – indexed browsing too much detail – color coding, show/hide structural variant visualization more difficult

Standard data formats SRF/FASTQ GVF/VCF SAM/BAM

Standard data formats Reads: FASTQ Alignments: SAM/BAM Variants: VCF