Sequence the 3 billion base pairs of human DNA and identify the 100,000 genes contained in the human genome
Goals of the Human Genome Project 1. Sequence: Human 3.0 x 109 Mouse 3.0 x 109 Drosophila 1.1 x 108 Worm 1.0 x 108 Dictyostellium 3.4 x 107 Yeast 1.2 x 107 Bacteria 1.0 - 5.0 x 106 BCM- HGSC 2
Goals of the Human Genome Project 2. Characterize all genes and enable studies of genetics, evolution and function. BCM- HGSC 2
BERMUDA 1996 ‘Primary Genomic Sequence Should be in the Public Domain’ Should be Rapidly Released’ BCM- HGSC
Quality < 1 error/ 10,000 (Polymorphism rate is 1/1,000) No gaps or ‘mis-assemblies’ Merit for high quality data only ‘Slippery Slope’ Arguments BCM- HGSC
Technology ABD 4 color Fluorescence Mapped-Clone Approach Random Phase Directed Phase Modular, 96 well Automation BCM- HGSC
3.0 Gb by Oct 2005? (Feb ‘98) ? x X X X BCM- HGSC
- New Capillary Instrument - >10 runs/day x 96 samples May 98: P/E :Celera Scheme - New Capillary Instrument - >10 runs/day x 96 samples - Total 230 Instruments - $330M Private Funds - Total 250,000 reads/day - Whole Genome Shotgun BCM- HGSC
- ‘Public Release’, 3 months Delay P/E ‘Celera’ Scheme:Release Policy - ‘Public Release’, 3 months Delay - Consensus sequence only - All SNPs held - Drosophila, Mouse BCM- HGSC
Regional mapping
Regional mapping
Regional mapping Minimal tiling path selected for sequencing.
Restriction fragment fingerprinting Molecular weight marker every 5th lane Restriction fragment fingerprinting >20 kbp ~300 bp - BAC clones are grown in 96-well format - Hind III digest - 1% agarose
Contig assembly FPC* Overlap identification by restriction pattern similarities Facilitated contig assembly *Sanger Centre C. Soderlund, I Longden and R. Mott Clone A B C D E F G * All restriction fragments within a clone selected for the tiling path must be verified by their presence in overlapping clones. : insert fragments : vector fragments
Shotgun Sequencing I :RANDOM PHASE Sheared DNA: 1.0-2.0 kb Bac Clone: 100-200 kb Random Reads Sequencing Templates: BCM- HGSC
Shotgun Sequencing II:ASSEMBLY Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus BCM- HGSC
Shotgun Sequencing III: FINISHING Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus BCM- HGSC
Shotgun Sequencing III: FINISHING Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus BCM- HGSC
Shotgun Sequencing III: FINISHING Mis-Assembly (Inverted) Sequence Gap Consensus BCM- HGSC
Shotgun Sequencing III: FINISHING Mis-Assembly (Inverted) Consensus BCM- HGSC
Shotgun Sequencing III: FINISHING High Accuracy Sequence: < 1 error/ 10,000 bases BCM- HGSC
Whole Genome Shotgun Sequencing Sheared DNA: 1.0-2.0 kb Whole Genome: 3,000 Mb Random Reads Sequencing Templates: BCM- HGSC
Whole Genome Shotgun Sequencing:Assembly Low Base Quality Single Stranded Region Mis-Assembly (Inverted) Sequence Gap Consensus BCM- HGSC
Whole Genome Shotgun Sequencing:Assembly Sequence Gap Low Base Quality Consensus BCM- HGSC
- Regions very densely covered - Contigs 1.0 -15 kb P/E ‘Celera’ Scheme:10 X coverage in three years - Regions not covered - Regions very densely covered - Contigs 1.0 -15 kb - # Gaps? >100,000? - Base Quality High or Low? - Mis-Assemblies? - Duplications? BCM- HGSC
‘That (draft) sequence will be of lower accuracy and contiguity….. ‘Complete an accurate, high quality sequence of the human genome by the end of 2003, …….a working draft can be completed…within the next three years…’ ‘That (draft) sequence will be of lower accuracy and contiguity….. …will be useful for finding genes…and other features….’ BCM- HGSC
Integrating Multiple Sources of Data Human Genome Sequencing Project Integrating Multiple Sources of Data Chromosome Map location Clone Fingerprint Project XYZ ??? Celera NHGRI Random sequences 500 bp reads consensus (3-5 kb) Mapped projects (~100kb) 5-20 contigs (10-20kb) How to use Celera data in NHGRI assemblies? Lichtarge Lab - HGSC Baylor College of Medicine