Next–generation DNA sequencing technologies – theory & practice
Outline Next-Generation sequencing (NGS) technologies – overview NGS targeted re-sequencing – fishing out the regions of interest NGS workflow: data collection and processing – the exome sequencing pipeline
PART I: NGS technologies Next-Generation sequencing (NGS) technologies – overview
DNA Sequencing – the next generation The automated Sanger method is considered as a ‘first- generation’ technology, and newer methods are referred to as next- generation sequencing (NGS).
Landmarks in DNA sequencing 1953 Discovery of DNA double helix structure 1977 A Maxam and W Gilbert "DNA seq by chemical degradation" F Sanger"DNA sequencing with chain-terminating inhibitors" 1984 DNA sequence of the Epstein-Barr virus, 170 kb 1987 Applied Biosystems - first automated sequencer 1991 Sequencing of human genome in Venter's lab 1996 P. Nyrén and M Ronaghi - pyrosequencing 2001 A draft sequence of the human genome 2003 human genome completed 2004 454 Life Sciences markets first NGS machine
2005
DNA Sequencing – the next generation Random genome sequencing 25 Mb 300k reads 110bp Sanger sequencing Targeted 700-1000 bp
DNA Sequencing – the next generation The newer technologies constitute various strategies that rely on a combination of Library/template preparation Sequencing and imaging
DNA Sequencing – the next generation Commercially available technologies Roche – 454 GSFLX titanium Junior Illumina HiSeq2000 MySeq Life – SOLiD 5500xl Ion torrent Helicos BioSciences – HeliScope Pacific Biosciences – PacBio RS
DNA Sequencing – the next generation
Template preparation: STEP1 Produce a non-biased source of nucleic acid material from the genome
Template preparation: STEP1 Produce a non-biased source of nucleic acid material from the genome
Template preparation Produce a non-biased source of nucleic acid material from the genome Current methods: randomly breaking genomic DNA into smaller sizes Ligate adaptors attach or immobilize the template to a solid surface or support the spatially separated template sites allows thousands to billions of sequencing reactions to be performed simultaneously
Template preparation Clonal amplification Single molecule sequencing Roche – 454 Illumina – HiSeq Life – SOLiD Single molecule sequencing Helicos BioSciences – HeliScope Pacific Biosciences – PacBio RS
Template preparation: Clonal amplification In solution – emulsion PCR (emPCR) Roche – 454 Life – SOLiD Solid phase – Bridge PCR Illumina – HiSeq
Template preparation: Clonal amplification - emPCR
Sequencing SOLiD 454
Pyrosequencing Picotitre plate Pyrosequencing
Pyrosequencing
Sequencing by ligation
Sequencing by ligation
Sequencing by ligation
Template preparation: Clonal amplification – Bridge PCR
Template preparation: Single molecule templates Heliscope BioPac
HiSeq Heliscope
DNA Sequencing – the next generation The major advance offered by NGS is the ability to cheaply produce an enormous volume of data The arrival of NGS technologies in the marketplace has changed the way we think about scientific approaches in basic, applied and clinical research
PART II: NGS targeted resequencing fishing out the regions of interest
Random genome sequencing The beginning Random genome sequencing ??? Sanger sequencing Targeted 700-1000 bp
DNA Sequencing – the next generation Library/template preparation Library enrichment for target Sequencing and imaging
Target enrichment strategies Random genome sequencing Hybrid Capture PCR based Sanger sequencing
Target enrichment strategies
Target enrichment strategies
Target enrichment strategies
Target enrichment strategies: MIP
Hybrid Capture In solution Agilent Nimblegen ... Solid phase Febit
Hybrid Capture In solution Relatively cheap High throughput is possible Small amounts of DNA sufficient Solid phase Straightforward method Flexible Higher amounts of DNA
Target enrichment strategies
PCR based approaches Uniplex Multiplex Fluidigm Raindance Multiplicon Longrange PCR products
PCR based approaches: Raindance
PCR based approaches: Fluidigm 48.48 Access Array
PCR based approaches: Fluidigm 48.48 Access Array
PCR based approaches: Fluidigm 48.48 Access Array
Target enrichment strategies
PART III: NGS workflow data collection and processing – the exome sequencing pipeline
Whole Exome Sequencing The human genome Genome = 3Gb Exome = 30Mb 180 000 exons Protein coding genes constitute only approximately 1% of the human genome It is estimated that 85% of the mutations with large effects on disease-related traits can be found in exons or splice sites
Exome sequencing gDNA 3 Gb Exome 38Mb NGS
The past, present & future
Exome sequencing capacity HiSeq specifications: 2 flow cells 16 lanes (8 per flow cell) 200-300 Gbases per flow cell 10 days for a single run Exome throughput 96 @ 60x coverage per run 3000 @ 60x coverage per year
Data processing workflow Data formatting & QC Mapping & QC Variant calling Variant annotation Variant filtering/comparison
Data processing
DATA GENERATION DATA PROCESSING DATA STORAGE INTERPRETATION RESULTS REPORTING & VALIDATION
Prepare sample library DATA GENERATION Prepare sample library Perfom exome capture Perform sequencing
Prepare sample library DATA GENERATION Prepare sample library Perfom exome capture Perform sequencing
Prepare sample library DATA GENERATION Prepare sample library Perfom exome capture Perform sequencing
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome
NGS data processing: overview 1 Mapping 2 Duplicate marking 3 Local realignment 4 Base quality recalibration 5 Analysis-ready mapped reads
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp
DATA PROCESSING QC NGS Mapping QC HC
DATA PROCESSING QC NGS Mapping QC HC
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp Mapping results 5 Gb / exome Variant Calling Variant Annotation
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp Mapping results 5 Gb / exome Variant Calling Variant Annotation Variant Calls 100Mb / exome
SNPs vs Indels
exonic vs non-exonic
Exonic
Exonic
Variants Public & Private DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp Mapping results 5 Gb / exome Variant Calling Variant Annotation Variant Calls 100Mb / exome Variant Filtering Database known Variants Public & Private
Validated variants in candidate genes DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp Mapping results 5 Gb / exome INTERPRETATION RESULTS Variant Calling Variant Annotation Variant Calls 100Mb / exome Validated variants in candidate genes Variant Filtering Database known Variants Public & Private REPORTING & VALIDATION