Canadian Bioinformatics Workshops www.bioinformatics.ca
Introduction to next-gen sequencing Informatics on High Throughput Sequencing Data Introduction to next-gen sequencing Francis Ouellette francis@oicr.on.ca July 25th 2008
Outline Sequencing DNA Next Generation Technologies Solexa SOLiD 454 Helicos AB’s color space What next, & things to keep in mind!
Adapted from John McPherson, OICR Biological Research
History of DNA Sequencing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) Avery: Proposes DNA as ‘Genetic Material’ Watson & Crick: Double Helix Structure of DNA Holley: Sequences Yeast tRNAAla 1870 1953 1940 1965 1970 1977 1980 1990 2002 Miescher: Discovers DNA Wu: Sequences Cohesive End DNA Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation Messing: M13 Cloning Hood et al.: Partial Automation Cycle Sequencing Improved Sequencing Enzymes Improved Fluorescent Detection Schemes 1986 Next Generation Sequencing Improved enzymes and chemistry Improved image processing Efficiency (bp/person/year) 1 15 150 1,500 15,000 25,000 50,000 200,000 50,000,000 100,000,000,000 2008
Basics of the “old” technology Clone the DNA. Generate a ladder of labeled (colored) molecules that are different by 1 nucleotide. Separate mixture on some matrix. Detect fluorochrome by laser. Interpret peaks as string of DNA. Strings are 500 to 1,000 letters long 1 machine generates 57,000 nucleotides/run Assemble all strings into a genome.
Basics of the “new” technology Get DNA. Attach it to something. Extend and amplify signal with some color scheme. Detect fluorochrome by microscopy. Interpret series of spots as short strings of DNA. Strings are 30-300 letters long Multiple images are interpreted as 0.4 to 1.2 GB/run (1,200,000,000 letters/day). Map or align strings to one or many genome.
From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
Differences between the various platforms: Nanotechnology used. Resolution of the image analysis. Chemistry and enzymology. Signal to noise detection in the software Software/images/file size/pipeline Cost $$$
Next Generation DNA Sequencing Technologies Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk Next Generation DNA Sequencing Technologies 3 Gb ==
Solexa
Solexa-based Whole Genome Sequencing Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk
Solexa-based Whole Genome Sequencing Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” http://tinyurl.com/5f3alk Solexa flow cell ~50M clusters are sequenced per flow cell.
Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
454
Roche / 454 : GS FLX Real Time Sequencing by Synthesis Chemiluminescence detection in pico titer plates Amplification: emulsion PCR Pyrosequencing up to 400,000 reads / run on average 250 bases / read (and longer) up to 100 Mb / run
Roche / 454 : GS FLX Made for de novo sequencing. Too expensive for resequencing. For example, this platform will be used a lot by laboratories doing new bacterial genomes. Baylor Genome Center involved in Sea Urchin, Bee, Platypus genomes: They have a number of 454.
Helicos
Single Molecule Sequencing Adapted from: Barak Cohen, Washington University, Bio5488 http://tinyurl.com/6zttuq http://tinyurl.com/6k26nh Single Molecule Sequencing Microscope slide * * * Single DNA molecule Super-cooled TIRF microscope primer dNTP-Cy3 * Helicos Biosciences Corp.
Helicos Approximate Data Production per Run at Current Peak Throughput (1 strand/µ2) Single Pass Dual Pass 7 day run 14 day run Image Data: 35 TB 60 TB Diagnostic Images: 350 GB 600 GB Object Table: 3.5 TB 6 TB Sequence Data: 350 GB 600 GB Log Files: 350 GB 600 GB Total ~4.5 TB ~7.8 TB (w/o full image stack)
ABI SOLiD
File management
SOLiD color space
It’s more complicated! Get files with quality scores Get files with miss-matches Need to align them to a reference genome Multiple tools do this today … and there will be more later. What do you do? Do it all!
Things to keep in mind All people are learning, if you don’t know, ask, and they probably won’t know either, and you can figure it out together! The technology is changing – This workshop next year will be totally different! We can only do so much in two days – you will need to find things, find people who can help you, and you will need to teach your friends!
Other factors Changing technology Changing price structure New and disappearing companies? Changing price structure Cost of machine Cost of operation (reagents/people) Service from the company 1 machine vs (2 or 3 machines) vs 40 machines. Changing software and processing
Pacific Biosystems (PacBio)
Questions? Coffee break!
Day 1