Next Generation Sequencing The past, present, and future of DNA sequencing *DNA sequencing: Determining the number and order of nucleotides that make up a given molecule of DNA. Alex V. Postma, PhD Department of Anatomy, Embryology & Physiology Academic Medical Center 1 1
(Relevant) Trivia How many base pairs (bp) are there in a human genome? How much did it cost to sequence the first human genome? How long did it take to sequence the first human genome? When was the first human genome sequence complete? Whose genome was it?
(Relevant) Trivia ~3 billion (haploid) ~$2.7 billion ~13 years 2000-2003 How many base pairs (bp) are there in a human genome? How much did it cost to sequence the first human genome? How long did it take to sequence the first human genome? When was the first human genome sequence complete?
Genome Sequencing Goal Problem Solution figuring the order of nucleotides across a genome Problem Current DNA sequencing methods can handle only short stretches of DNA at once (<1-2Kbp) Solution Sequence and then use computers to assemble the small pieces
Genome Sequencing AC..GC TT..TC CG..CA TG..GT TC..CC GA..GC TG..AC CT..TG GT..GC AT..AT TT..CC AA..GC Short DNA sequences Genome Short fragments of DNA ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA… ACGTGGTAATGGCGTATACACCCTTAGGCCATA ACGTGACCGGTACTGGTAACGTACA CCTACGTGACCGGTACTGGTAACGT ACGCCTACGTGACCGGTACTGGTAA CGTATACACGTGACCGGTACTGGTA ACGTACACCTACGTGACCGGTACTG GTAACGTACGCCTACGTGACCGGTA CTGGTAACGTATACCTCT... Sequenced genome
Sanger Sequencing Mix DNA with dNTPs and ddNTPs Amplify Run in Gel Fragments migrate distance that is proportional to their size
Sanger Sequencing
Sanger Sequencing Advantages Disadvantages Long reads (~900bps) Suitable for small projects Disadvantages Low throughput Expensive
Sanger Sequencing 2007: Global Ocean Sampling Expedition ~3,000 organisms, 7Gbp (Venter et al.) 1994: H. Influenzae 1.8 Mbp (Fleischmann et al.) 1980 1990 2000 1982: lambda virus DNA stretches up to 30-40Kbp (Sanger et al.) 2001: H. Sapiens, D. Melanogaster 3 Gbp (Venter et al.)
Next Generation Sequencing: Why Now? Motivation: HGP and its derivatives, personalized medicine Short reads applications: (re-)sequencing, other methods (e.g. gene expression) Advancements in technology NGS is a general term refering to all post-Sanger sequencing technologies that enable massive sequencing at low cost. NGS may be further divided into polony-sequencing based technologies which require the amplification of DNA prior to sequencing, and single molecule sequencing which do not. Motivation for new technologies drives its roots not only from potentially commercial usage such as in personalised medicine, but also from government supported projects suce as the HGP or the 1000 genomes projects aiming to sequence the genomes of 1000 individuals around the world with price tag for genome sequencing single genomes set to 50,000$. other than de-novo sequencing Potential applications include re-sequencing, and also gene expression analysis, both can make use of short reads which are offered by all current technologies. So despite the read-length barrier of the new technologies, sequencers still became commercial. And of course – advancements in chemistry, microscopy and other related technologies enabled the new sequencing technologies. 10
High Parallelism is Achieved in Polony Sequencing Sanger Polony Polony sequencing refers to all commercial technologies except for Helicos. Polony sequencing takes place using array of polonies, in which all amplicons of the same DNA fragment are clustered together on the same region of the array. These groups of amplicons were termed polonies, shortcut for polymerase colonies. The degree of parallelism that can be achieved through Sanger sequencing is only a fraction of what can be achieved in polony sequencing 11
Generation of Polony array: DNA Beads (454, SOLiD) Generation of polony array is done as follows: The process begins with the mixing of the DNA fragments ligased to connectors with beads, PCR components and primers in water. The components are mixed with oil in order to create “microreactors”, which are droplets of water containing all necessary components for PCR. Next, PCR is performed with the new copies in each microreactor being attached to the bead. Finally, the emulsion and empty beads are removed and we are left with only DNA containing beads. DNA Beads are generated using Emulsion PCR 12
Generation of Polony array: DNA Beads (454, SOLiD) The beads are loaded onto an array containing pico-liter scale wells. Together with small beads containing the enzymes required for the reactions the DNA beads are placed into the wells. DNA Beads are placed in wells
Generation of Polony array: Bridge-PCR (Solexa) Create DNA library Place on array Perform bridge-PCR (primers are attached to an array) Results: ~1M colonies with ~1K sequences at each DNA fragments are attached to array and used as PCR templates 14
Single Molecule Sequencing: HeliScope Direct sequencing of DNA molecules: no amplification stage DNA fragments are attached to array Potential benefits: higher throughput, less errors DNA fragments are attached to array as in Illumina Sequencing is asynchronous, using highly sensitive fluorescence detection system Based on work from Stephen Quake’s group (Harvard) In a work published by Quake’s lab a human genome was fully sequenced at a cost of 40K $. 15
Genome Sequencer 20 (454) Genome Analyzer (Solexa) Ion torrent MinION
*Source: Shendure & Ji, Nat Biotech, 2008 Technology Summary Read length Sequencing Technology Throughput (per run) Cost (1mbp)* Sanger ~800bp 400kbp 500$ 454 ~400bp Polony 500Mbp 60$ Solexa 75bp 20Gbp 2$ SOLiD 60Gbp Helicos 30-35bp Single molecule 25Gbp 1$ Instrument cost should be taken into account: 454, Solexa and ABI is ~40% of HeliScope 454 Life Sciences: FLX Titnium series. Run=10 hours, a cluster of computers is required (only a single processor for the standard FLX) . http://www.454.com/products-solutions/system-features.asp#titanium ABI SOLiD 3 (http://www3.appliedbiosystems.com/AB_Home/applicationstechnologies/SOLiDSystemSequencing/overviewofsolidsystem/index.htm) *Source: Shendure & Ji, Nat Biotech, 2008 17
Comparing Different Technologies Sanger Sequencing Advantages Disadvantages Lowest error rate Long read length (~750 bp) Can target a primer High cost per base Long time to generate data Need for cloning Amount of data per run
Comparing Different Technologies 454 Sequencing Advantages Disadvantages Low error rate Medium read length (~400-600 bp) Relatively high cost per base Must run at large scale Medium/high startup costs
Comparing Different Technologies Ion Torrent Sequencing Advantages Disadvantages Low startup costs Scalable (10 – 1000 Mb of data per run) Medium/low cost per base Low error rate Fast runs (<3 hours) New, developing technology Cost not as low as Illumina Read lengths only ~100-200 bp so far
Comparing Different Technologies Illumina Sequencing Advantages Disadvantages Low error rate Lowest cost per base Tons of data Must run at very large scale Short read length (50-75 bp) Runs take multiple days High startup costs De Novo assembly difficult
Comparing Different Technologies PacBio Sequencing Advantages Disadvantages Can use single molecule as template Potential for very long reads (several kb+) High error rate (~10-15%) Medium/high cost per base High startup costs
NGS Platforms Overview Differ in design and chemistries Fundamentally related-sequencing of thousands to millions of clonally amplified molecules in a massively parallel manner Orders of magnitude more information-will continue to evolve Attractive for clinical applications – individual sequencing assays costly and laborious- serial “gene by gene” analysis Pacific Biosciences Helicos Biosciences NABsys VisiGen Biotechnologies Complete Genomics Oxford Nanophore Technologies
What, When and Why Sanger: 454: Solexa, SOLiD, Heliscope: Small projects (less than 1Mbp) 454: De-novo sequencing, metagenomics Solexa, SOLiD, Heliscope: Gene expression, protein-DNA interactions Resequencing 24
Sequencing the Human Genome 2001: Human Genome Project 2.7G$, 11 years 10 2001: Celera 100M$, 3 years 2007: 454 1M$, 3 months 8 2008: ABI SOLiD 60K$, 2 weeks 6 Log10(price) 2010: 5K$, a few days? 2009: Illumina, Helicos 40-50K$ I would like to begin with an overview of the history of human genome sequencing. Despite significant improvements … it was clear that Sanger sequencing would not make massive DNA sequencing at a low cost and high speed feasible. Several technologies were developed at the time, of which the 454 Life Sciences sequencer was the first to become commercial in 2005. 2 years later it was used for … Whether …, but the direction is clear: in a few years from now very fast and cheap sequencing technologies will be available for commercial and research purposes 4 2012: 100$, <24 hrs? 2 2000 2005 2010 Year 25
Sequencing costs have fallen
Next Generation Sequencing Applications Mutation dectection Foreign DNA detection Non invasive diagnosis aneuplody Population characterization Cancer genetics Ancient DNA (Neanderthaler) Expression analysis Transcription binding Chromosomal interaction Etc etc
chromosomal aneuploidy – מספר לא נורמלי של כרומוזומים In this work the authors were able to detect abnormalities in the number of chromosomes using massive sequencing of plasma extracted from a blood sample collected from the mother. chromosomal aneuploidy – מספר לא נורמלי של כרומוזומים amniosentesis - מי שפיר chorionic villus sampling - סיסי שלייה. Cell free fetal DNA 28 28
Exome Sequencing Identifies a Tibetan Adaptation Yi et al. Science 2010 The widespread mutation in Tibetans is near a gene called EPAS1, a so-called “super athlete gene” identified several years ago and named because some variants of the gene are associated with improved athletic performance. The gene codes for a protein involved in sensing oxygen levels and perhaps balancing aerobic and anaerobic metabolism.
Ancient Genomes Resurrected Degraded state of the sample mitDNA sequencing Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp ) Problems: contamination modern humans and coisolation bacterial DNA
NGS Application Examples- Inherited Conditions Discovery tool: Single gene disorders i.e. AD – Kabuki syndrome (MLL) Causative mutations for multigenic diseases –superior to “one by one” approach of traditional sequencing Diagnostic advancements for diseases with overlapping symptoms, multiple possible syndromes/genes
Variant detection through next generation sequencing Meyerson et al. NRG 2010
Inherited Conditions- Challenges and Opportunities Example: Monogenic disorders Novel missense mutations Structural aberrations Germ line mosaicism Imprinting effects Epigenetic factors Opportunities Example: Multifactorial disease Risk loci more often in non-coding or inter-gene regions Pathogenicity of variants often unclear- less testing vs. monogenic disease Reference human genome cataloguing of variants = more test offerings
Sequencing of a Single Individual with Family Data Lupski et al. NEJM 2010
The First 8 Human Genomes
SNP Distribution in Proband
Nonsynonymous SNPs in Known Disease Genes
NGS Application Examples- Neoplastic Conditions Cancer susceptibility genes Risk assessment Risk management Tumor sub-typing Micro-RNAs Prognosis Alterations in gene expression Molecular profiling Patient stratification Predictions of therapeutic response personalized treatment Therapeutic monitoring Somatic/driver mutations Methylation Epigenetic changes
Exome Sequencing in Prostate Cancer Barbieri et al. Nature Genetics 2012
Exome Sequencing in Prostate Cancer Barbieri et al. Nature Genetics 2012
Nonsynonymous Somatic Mutations in Neuroblastoma Molenaar et al. Nature 2012
Mutation count associated with age, stage, and survival Molenaar et al. Nature 2012
Next Generation Sequencing NGS diagnostics - shifted towards data analysis rather than the technical component NGS infrastructures must consist of appropriate expertise and computational hardware Unprecedented amounts of medical data and various processing algorithms necessitate adequate tools for Data management (alignment and assembly) QC of image processing, base calling, filtering, alignment, SNP finding/application steps archiving
Considerations Evaluation of the variant positions “called” involves queries of all known relevant databases Lack of databases curated to accept clinical standards likely the most significant challenge in managing and reporting genome sequencing data EHR considerations – test ordering, archiving of NGS reports, patient consent, data (reinterpretation?)
NGS-Post-Analytical Considerations Expert interpretation and guidance-correlation of age, gender, clinical presentation, family hx Team approach ideal -pathologists, geneticists, other providers Proficiency testing and alternative assessment are challenging Proficiency testing schemes based on NGS methods vs. specific genes are likely
Professional Considerations-Reimbursement and Gene Patents Challenging reimbursement issues Genome sequencing may potentially involve numerous patented gene sequences Development of an affordable system of common access to genes? What about mutations in known disease genes, not evident to patient phenotype?