George Church Thu 27-Apr-2006 9:30-11 Broad-MPG Thanks to: New Sequencing Technologies & Diploid Personal Genomes NHGRI Seq Tech 2004: Agencourt, 454,

Slides:



Advertisements
Similar presentations
Lecture 2 Strachan and Read Chapter 13
Advertisements

applications of genome sequencing projects
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Personal Genome Project (PGP) Harvard Medical School IRB Human Subjects protocol Approved Aug Highly-informed individuals consenting to potentially.
Recombinant DNA Introduction to Recombinant DNA technology
Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome Jay Shendure, Gregory J. Porreca, Nikos B. Reppas, Xiaoxia Lin, John P. McCutcheon.
1 8-Jul-08 Secretary’s Advisory Committee on Genetics, Health, & Society Thanks to: Personal Genomes Services : Research.
Molecular Genomic Imaging Center (CEGS) Harvard / Wash U George Church, Rob Mitra Greg Porreca, Jay Shendure Sequencing by Ligation on Polony Beads with.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
The SOLiD System: Next-Generation Sequencing Overview of the SOLiD System –  Scalable  Accurate Ultra High Throughput  Flexible  Mate Pairs.
Single Cell, RNA, & Chromosome Sequencing Technologies
Bacterial Physiology (Micr430)
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Microarrays: Theory and Application By Rich Jenkins MS Student of Zoo4670/5670 Year 2004.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
CS 6293 Advanced Topics: Current Bioinformatics
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Paola CASTAGNOLI Maria FOTI Microarrays. Applicazioni nella genomica funzionale e nel genotyping DIPARTIMENTO DI BIOTECNOLOGIE E BIOSCIENZE.
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
DNA basics DNA is a molecule located in the nucleus of a cell Every cell in an organism contains the same DNA Characteristics of DNA varies between individuals.
Genome Sequencing & App. of DNA Technologies Genomics is a branch of science that focuses on the interactions of sets of genes with the environment. –
DNA Technology- Cloning, Libraries, and PCR 17 November, 2003 Text Chapter 20.
Genetic and Molecular Epidemiology Lecture III: Molecular and Genetic Measures Jan 19, 2009 Joe Wiemels HD 274 (Mission Bay)
BUDDING TECHNOLOGIES AND BUDDING YEAST 2012 HHMI Summer Workshop for High School Science Teachers.
Library Preparation Application dependant, using standard molecular biological techniques. Fragment library oligo kit: (per library)$35 GeneAmp dNTP blend:
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
DNA Fingerprinting. Use of DNA to Determine Identity DNA controls production of proteins DNA controls production of proteins Results in phenotype (eye.
Data Type 1: Microarrays
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
DNA Cloning and PCR.
Investigating the use of Multiple Displacement Amplification (MDA) to amplify nanogram quantities of DNA to use for downstream mutation screening by sequencing.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Amplification of Genomic DNA Fragments OrR. Amplification To get particular DNA in large amount Fragment size shouldn’t be too long The nucleotide sequence.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Steady-state flux optima AB RARA x1x1 x2x2 RBRB D C Feasible flux distributions x1x1 x2x2 Max Z=3 at (x 2 =1, x 1 =0) RCRC RDRD Flux Balance Constraints:
CS177 Lecture 10 SNPs and Human Genetic Variation
19.1 Techniques of Molecular Genetics Have Revolutionized Biology
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
HaloPlexHS Get to Know Your DNA. Every Single Fragment.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
DNA Fingerprinting Project Lead the Way Human Body Systems.
ABC for the AEA Basic biological concepts for genetic epidemiology Martin Kennedy Department of Pathology Christchurch School of Medicine.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Polymerase Chain Reaction (PCR) Nahla Bakhamis. Multiple copies of specific DNA sequences; ‘Molecular Photocopying’
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
PCR With PCR it is possible to amplify a single piece of DNA, or a very small number of pieces of DNA, over many cycles, generating millions of copies.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Unit 1 – Living Cells.  The study of the human genome  - involves sequencing DNA nucleotides  - and relating this to gene functions  In 2003, the.
Covariance in RNA ref " Covariance M ij =  fx i x j log 2 [fx i x j /(fx i fx j )] M=0 to 2 bits; x=base type x i x j see Durbin et al p
Chapter 8 Additional DNA Markers: Amelogenin, Y-Chromosome STRs, mtDNA, SNPs, Alu Repeats ©2002 Academic Press.
Green with envy?? Jelly fish “GFP” Transformed vertebrates.
DNA Technology & Genomics CHAPTER 20. Restriction Enzymes enzymes that cut DNA at specific locations (restriction sites) yielding restriction fragments.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Next-generation sequencing technology
Next generation sequencing
Lecture 6: Genotype by sequencing
Polymerase Chain Reaction (PCR) and Its Applications
Next-generation sequencing technology
Microarray Technology and Applications
Lecture 6: Genotype by sequencing
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Data Type 1: Microarrays
Presentation transcript:

George Church Thu 27-Apr :30-11 Broad-MPG Thanks to: New Sequencing Technologies & Diploid Personal Genomes NHGRI Seq Tech 2004: Agencourt, 454, Microchip, 2005: Nanofluidics, Network, VisiGen Affymetrix, Helicos, Solexa-Lynx

‘Next Generation’ Technology Development Multi-molecule Our role Affymetrix Software Gorfinkel Polony to Capillary 454 LifeSci Paired ends, emulsion Lynx/Solexa Multiplexing & polony Agencourt Seq by Ligation (SbL) Single molecules Helicos Biosci SAB, cleavable fluors Pacific Biosci - Agilent Nanopores Visigen Biotech - Complete GenomicsSbL

Sequencing components 1.Applications & goals 2.Cost, accuracy, continuity goals 3.Source, consent, ELSI 4.Sample prep 5.Technology development, deployment, scaling 6.Software: data acquisition to interpretation 7.Human interface, education

Sequencing applications 1.Environment (genetic): maternal, allergens, microbes 2.Small mutations: whole genome vs targeted 3.DNA copy number & rearrangements (paired ends) 4.Exons conserved &/or mutable regions 5.Haplotype: LD &/or causative combinations in cis 6.RNA Digital Analysis of Gene Expression (by counting) 7.RNA splicing (that arrays can’t handle) 8.Proteomics: MS, Ab, aptamers 9.Metabolomics: MS, Ab, aptamers 10.Microbial evolution resequencing (needs consensus accuracy) 11.Cancer resequencing 12.Gene synthesis by sequencing (needs raw accuracy) 13. DNA methylation

Why single chromosome sequencing? (or single cell or single particle?) (1) When we only have one cell as in Preimplantation Genetic Diagnosis (PGD) or environmental samples (2) Sequence relations >100 kbp (haplotypes) (3) Prioritizing or pooling (rare) species based on an initial DNA screen (4) Anything relating 2 or more chromosomes (in a cell or virus) (5) Cell-cell interactions (e.g. predator-prey, symbionts, commensals, parasites, etc)

Zhang et al. Nature Genet. Mar 2006 Method#1: ‘in situ’ haplotyping Sequencing/genotyping on single human chromosomes 153 Mbp

Method#2: Chromosome dilution library QC: Reverse-FISH of amplicons Sequencing/genotyping on single human chromosomes Amplicon 19 Amplicon 6q

Single chromosome molecule sequencing How? –Isothermal Strand Displacement Amplification from a single chromosome (Ploning) –Shotgun sequencing on the amplicon Challenges –Non-specific amplification competes with a single template molecule –Amplicons have high-order DNA structures, which creates issues in sequencing library construction

Reduce chimeras when cloning from SDA Plones Single cell chromosome molecule sequencing Phi-29 debranching S1 nuclease digestion DNA pol I nick translation From 19% to 6%

Single cell chromosome molecule sequencing Chromosome# #1#2 # Good seq reads7,16610,660 Average length (bp) Total length (bp)5,513,5207,212,556 # unkown seqs1210 # vectors2344 # other seqs742 % genome sampled63%67% Plone amplification errors: < 1.7×10 -5 Ploning & sequencing 2.5 Mbp molecules

In vitro paired tag libraries Bead polonies via emulsion PCR Monolayer gel immobilization Enrich amplified beads SOFTWARE Images → Tag Sequences Tag Sequences → Genome SBE or SBL sequencing Epifluorescence & Flow Cell Shendure, Porreca, Reppas, Lin, McCutcheon, Rosenbaum, Wang, Zhang, Mitra, Church (2005) Science 309:1728. Integrated Polony Sequencing Pipeline (open source hardware, software, wetware)

R Paired-end libraries + ligate dilute, ligate amplify Shear or Nla III digest select hRCA digest Mme I ligate amplify ePCR Shendure, Porreca, et al. (2005) Science 309: 1728 Margulies et al. (2005) Nature 437: 376. L M

Distribution of Distances Between Mate-Paired Tags distance (bp) frequency 980 ± 96 bp 1.0 kb 2.0 kb 10.7 bp FT

3’ 5’ Tag 1 ePCR bead 7 bp 6 bp 7 bp 6 bp Tag 2 Each yields 6 to 7 bp of contiguous sequence 34 bp new sequence per 135 bp amplicon 4 positions for paired-end anchor 'primers' L M R

ACUCAUC… (3’)…TAGAGT????????????????TGAGTAG…(5’) 5’-Cy5-nnnnAnnnn-3’ 5’-Cy3-nnnnGnnnn-3’ 5’-TR-nnnnCnnnn-3’ 5’-Cy3+Cy5-nnnnTnnnn-3’ 5'PO 4 Sequencing by Ligation (SBL) with fluorescent combinatorial 9-mers Excitation Emission nm Shendure, Porreca, et al. (2005) Science 309:1728

HPLC autosampler (96 wells) syringe pump Automation Schematic microscope & xyz stage flow-cell temperature control

Off the Shelf Instrumentation $140,000 Mitra Porreca Shendure

Image Collection & Data Processing 514 raster positions x 4 images per cycle 26 cycles of sequencing 2 additional image sets for object-finding algorithms images (1000 x 1000, 14-bit) Porecca et al. 100GBytes 5M reads $500 run

Open Source Readmapper Hash all the reads (n) Scan genome (m), and for each window: –Does current window exist in hash? –If so, move downstream, scan d positions & test hash for membership Hash all possible reads from genome (m) Scan the reads (n), and for each: –Does it occur in the hash? –If so, does the second exist? –If so, take union (k) m + (n * d) = 10+ hours, 20 nodes, 1.6e6 reads n * k = 10 hours, 1 node, 1.6e6 reads v1.0 (Shendure, Porreca et al) v2.0 (Gary Gao, Sasha Wait)

Error quantitation Median raw Polony = 3E-3 (99.7%) 454 raw = 4E-2 (96%) Shendure, Porreca et al, X consensus <3E-7 [>Q65, %]

$7 $ M 300K $30K Paired ends yes no yes Device $ 300K 500K 140K Cost vs consensus error rate 454 Sep05 ABI 454 Sep05 Polony Polony Sep05 Feb 06

Consensus error rate Total errors (E.coli) (Human) 1E-4 Bermuda/Hapmap ,000 4E ,000 3E E-8 Goal for Goal of genotyping & resequencing  Discovery of variants E.g. cancer somatic mutations ~1E-6 (or lab evolved cells) Why low error rates? Also, effectively reduce (sub)genome target size by enrichment for exons or common SNPs to reduce cost & # false positives.

PositionType GeneLocation ABI Confirm Comments 986,334 T > GompFPromoter-10 Only in evolved strain 985,797 T > GompFGlu > Ala Only in evolved strain 931,960 ▲ 8 bplrpframeshift Only in evolved strain 3,957,960 C > TppiC5' UTR MG1655 heterogeneity T > CcIGlu > Glu  red heterogeneity T > CORF61Lys > Gly  red heterogeneity Mutation Discovery in Engineered/Evolved E.coli Shendure, Porreca, et al. (2005) Science 309:1728

Sequence monitoring of evolution (optimize small molecule synthesis/transport) Sequence trp - Reppas, Lin & Church

Glu-117 → Ala (in the pore) Charged residue known to affect pore size and selectivity Promoter mutation at position (-12) Makes -10 box more consensus-like A AAGAT C AAGAT Can increase import & export capability simultaneously ompF - non-specific transport channel

3 independent lines of Trp/Tyr co-culture frozen. OmpF: 42R-> G, L, C, 113 D->V, 117 E->A Promoter: -12A->C, -35 C->A Lrp: 1bp deletion, 9bp deletion, 8bp deletion, IS2 insertion, R->L in DBD. Heterogeneity within each time-point reflecting colony heterogeneity. Co-evolution of mutual biosensors sequenced across time & within each time-point

proximal tag placement distal tag placement 1,206k1,210k Incorrect distance Red=same strand Black opposite strand Mixture of wild & 2kb Inversion (pin) Using paired ends, rearrangement & copy-number detection is >1000X easier than point mutation detection (6X vs 6000X)

1M Causative Genome Changes CGCs (10X MIP pool $20) Strand displacement amplification (ploning) Polony sequencing 7E8 pixels Chip Genotyping/ Haplotyping Exons & conserved 3% (6X $9K) Diplome chromosome dilution shotgun (0.01X $300) 40K RNA diplome (10X MIP pool $20) Personal Genome Project (ELSI) Open source hardware, software, wetware Human Diplome Sequencing Strategies

Padlock, Molecular Inversion Probes (MIPs) Causative Genomic Changes (CGCs, e.g. conserved 3%) (not restricted to Single Nucleotides or Polymorphisms >1%) Hardenbol.. Landegren Davis et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol : “10,000 targeted SNPs genotyped in a single tube assay.” Genome Res :269 Vitkup, Sander, Church (2003) The Amino-acid Mutational Spectrum of Human Genetic Disease. Genome Biol. 4: R72. (CG to CA, TG) CG CA TG Genomic DNA Alternative alleles Universal primers R L Optional multiplex tag

MIPs for VDJ Polonies xxx Over the whole field of human T-cells 1 TRAC + 2 TRBC primers cDNA 47 TRAV * 50 TRAJ + 46 TRAV * 13 TRBJ = 2948 MIP oligos or 47 TRAV * 1 TRAC + 46 TRAV * 2 TRBC = 139 MIP oligos In situ RCA or PCR for each T-cell Polony sequencing of tag &/or gap fill (e.g. 18 to 33bp in CDR3) (two tags per cell sufficient?)

‘Next Generation’ Technology Development Multi-molecule Our role Affymetrix Software Gorfinkel Polony to Capillary 454 LifeSci Paired ends, emulsion Lynx/Solexa Multiplexing & polony Agencourt Seq by Ligation (SbL) Single molecules Helicos Biosci SAB, cleavable fluors Pacific Biosci - Agilent Nanopores Visigen Biotech - Complete GenomicsSbL

Human subjects consent “Because the database will be public, people who do identity testing, such as for paternity testing or law enforcement, may also use the samples, the database, and the HapMap, to do general research. However, it will be very hard for anyone to learn anything about you personally from any of this research because none of the samples, the database, or the HapMap will include your name or any other information that could identify you or your family.” YRI=Yoruba, Ibadan, Nigeria JPT= Japan, Tokyo CHB=China (Han) Beijing CEU=CEPH (N&W Europe) Utah

Is anonymity in genomics realistic? 1) Re-identification after “de-identification” using other public data. Group Insurance Commission list of birth date, gender, and zip code was sufficient to re- identify medical records of Governor Weld & family via voter-registration records (1998) (2) Hacking. “Drug Records, Confidential Data vulnerable via Harvard ID number & PharmaCare loophole” (2005). A hacker gained access to confidential medical info at the U. Washington Medical Center files (names, conditions, etc, 2000) (3) Combination of surnames from genotype with geographical info An anonymous sperm donor was traced on the internet 2005 by his 15 year old son who used his own Y chromosome genealogy to access surname relations. (4) Inferring phenotype from genotype Markers for eye, skin, and hair color, height, weight, racial features, dysmorphologies, etc. are known & the list is growing. (5) Unexpected self-identification. An example of this at Celera undermined confidence in the investigators. Kennedy D. Science :1237. Not wicked, perhaps, but tacky. (6) A tiny amount of DNA data in the public domain with a name leverages the rest. This would allow the vast amount of DNA data in the HapMap (or other study) to be identified. This can happen for example in court cases even if the suspect is acquitted. (7) Identification by phenotype. If CT or MR imaging data is part of a study, one could reconstruct a person’s appearance. Even blood chemistry can be identifying in some cases.

"Open-source" Personal Genome Project (PGP) Harvard Medical School IRB Human Subjects protocol submitted Sep-2004, approved Aug-2005 renewed Feb Start with 3 highly-informed individuals consenting to non- anonymous genomes & extensive phenotypes (medical records, imaging, omics). Cell lines in Coriell NIGMS Repository G M Church GM (2005) The Personal Genome Project Nature Molecular Systems Biology doi: /msb Kohane IS, Altman RB. (2005) Health-information altruists--a potentially critical resource. N Engl J Med. 10;353(19):

It is likely that less-privileged citizens ‘might be’ less likely to volunteer & will be more likely to volunteer due to higher financial risk. These same people ‘might be’ even less likely to volunteer is the data might become public. These same folks might be especially impacted socially if identifying (genome and/or phenome) data were to get out after they were assured that it would not. Discussion: Ascertainment bias vs. risk of disclosure without consent.

Five categories: 1)Withdrawal from studies due to new information on risks (all data destroyed). 2) Highest security (possibly higher than the original study) encryption, aggressive de-identification, only expert access with IRB-approval of each person, not whole teams. Consent form clearly states the risks (see previous slides). 3) Medium security, similar to current practice, but consented as above. IRB approval for teams to download de-identified data. 4) Open-PGP-type security. Click-through agreement. IRB- approval only for data collection, not for data reading. 5) Fully open. No IRB approval; full web access e.g. subject initiated. Proposal for multi-tiered (re)consent of subjects in genomic studies

.