Download presentation
Presentation is loading. Please wait.
1
Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley
2
A History of Genome Sequencing 1981: Sanger et al. sequence Lambda (50Kbp) by the shotgun method. Cloning: BACs permit 100-250Kbp inserts BACs permit 100-250Kbp insertsTechnology: Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Capillaries improve accuracy & efficiency Capillaries improve accuracy & efficiency 1998: 3% of the human genome has been sequenced using a BAC- based hierachical plan. Common wisdom is that shotgun approach does not scale beyond BACs save for simple bacterial sequences.
3
Whole Genome Shotgun Sequencing ~ 55million reads reads – Collect 6-10x sequence in a 5-5-1 ratio of three types of read pairs. Short Long 2Kbp 10Kbp + single highly automated process + only a handful of library constructions – assembly is much more difficult Contig Gap (mean & std. dev. Known) Read pair (mates) – Assemble into “scaffolds”, ordered runs of contigs with known spacing. – Map scaffolds to genome with STS or other markers. Extra Long 50-150Kbp
4
How to accomplish WGA in a nutshell – Identify and assembly all the unique genomic segments – Link together into scaffolds with paired reads – Back-fill interspersed repeats with “anchored reads”
5
A History of Genome Sequencing 1981: Sanger et al. sequence Lambda (50Kbp) by the shotgun method. 1998: 3% of the human genome has been sequenced using a BAC- based hierachical plan. Common wisdom is that shotgun approach does not scale beyond BACs save for simple bacterial sequences. 2001: 97% of the chromatin of the human genome has been determined. Mouse, Drosophila, Rice, Fugu, and Anopheles have all been sequenced with a whole genome shotgun approach. Cloning: BACs permit 100-250Kbp inserts BACs permit 100-250Kbp insertsTechnology: Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Capillaries improve accuracy & efficiency Capillaries improve accuracy & efficiency
7
Case Study: 3 Dros. Assemblies vs. Release 3 Input: (Celera) 3.2M reads, 732K 2Kbp pairs, 548K 10Kbp pairs, (BDGP), 12K BAC pairs. WGS1: Dec. 1999, reported in Science 2000. Repeat walking removed, Stones debugged, SNP handling WGS2: March 2001, time of Human publication Error correction introduced, improvements in unitig classification WGS3: July 2002, last run on melanogaster
8
Coverage of Release 3 # of Scaffolds Covering Rel. 3 55635313 Total Mb Spanned 116.39117.44117.6116.91 Total Mb of Rel. 3 Spanned 116.4116.5116.8-------- Total Mb of Sequence 114.15115.83116.42116.87 Total Mb of Rel. 3 Sequence 114.1115115.6-------- N50 Scaffold Length (in Mb) 10.8514.4513.8918.5 Number of Gaps 2,1732,3151,13044 Mean Contig Length (in kb) 52.249.51022,335 WGS1WGS2WGS3 Rel. 3 Mean Gap Length (in bp) 1,5319121,335--------- In addition 20.7Mbp of heterochromatic sequence was assembled (WGS3), containing 31 known proteins and 266 newly predicted genes. 98.93% 99.91% 58% of Rel. 3 gaps were interspersed repeat, 12% were tandem repeats (WGS3).
9
O&O Errors vs. Release 3 WGS1WGS2WGS3 Aligned Segments 2,125 113.30 Mb 2,270 114.41 Mb 1,087 114.99 Mb Local Errors 9 68.33 kb 7 9.80 kb 3 5.64 kb # segs # base pairs # segs # base pairs # segs # base pairs Repeat Errors 25 42.52 kb 1 0.66 kb 1 0.98 kb Gross misassemblies 3 10.69 kb 0 0
10
Sequencing Error Rates vs. Release 3 All Sequence 4.122.231.1 In Tandem Repeats 95.261.448.8 In Interspersed Repeats 78.215.89.62 In Unique Sequence 1.821.310.38 > 10 bp from gap 1.371.020.29 Errors / 10 kb WGS1WGS2WGS3 > 50 bp from gap 1.320.950.26
12
Solid State Sequencing in Pico-wells: Operational next year 25-50Mbp per instrument/day in 50bp reads,.3-1Kbp pairs (vs. 1-2Mbp per inst./day in 800bp, 2-10Kbp pairs) Applications: Resequencing, BAC drafts at 99% Detecting dNTP incoporations by fixed PolII complex: Operational 5-10 years from now 1-10Gbp per instrument/day in 100Kbp reads (they can be 30-50% noise)! Assembly will not be difficult. Nanopore My opinion: not knowable, could be 50 years.
14
Mouse is smaller than Human: ~15% expansion of euchromatin Human (21) (21) Mouse (16) (16) Mbp Sequence anchor: >50bp at >75% id. & bidirectionally unique Mbp Syntenic Anchors
15
Based on sequence anchor blocks Courtesy Lisa Stubbs Oak Ridge National Laboratory Evolution as Genomic Rearrangements
16
Orthologous Pairs of Proteins
17
Human chromosome 6 Mouse chromosome 17 Protein-level synteny
18
Computational Gene Finding Computational Gene finding: Identification of coordinates of coding regions. ‘Clues’ that differentiate coding from non-coding regions. Cellular machinery (ribosome,spliceosome) recognizes specific signals that mark gene boundaries. Start Codon TRANSCRIPT: Donor Site Acceptor Site GTAG ATG Stop Codon GENE:
19
Computational Gene Finding (Homology ) Comparative (Genewise, Procrustes, Sim4) Perform well when homolog has strong similarity. Performance tapers off with decrease in sequence similarity. Performance is (or, should be) independent of sequence composition. Difficult to find good homologs.
20
Full Length cDNA’s: Alternate Splicing Courtesy Terry Gaasterland, Rockefeller
21
Gene Finding (Ab Initio Methods) Gene structure is identified by the most likely parse of the sequence through an appropriate HMM (weighted finite automaton) (ex: Genscan, Genie…). Fairly accurate, with well understood procedures for training models and parsing. Recent results (multi-gene examples) indicates that further improvements are desirable (Guigo’99).
22
1D Methods: Summary Homology: Very specific and accurate Can sample only abundunt genes and full-length is hard Ab Initio: Good sensitivity for presence (85%) but weak for exon (60%) and gene (10%), also very non-specific (20%). Main drivers of recognition are: Splice site Splice site No stop codon in exon No stop codon in exon Some bias in hexamer coding frequency Some bias in hexamer coding frequency Mouse vs. Human Homology (50-100 million years): 85% of exons in a TBlastX hit 85% amino acid identity in a hit 25% of TBlastX hits contain a true exon
23
2D: Homology (Sagot et al., Huson & Bafna) Require gene models (splice sites + start + no-stop) in both genomes that have high homology: Human Mouse Performance is better than 1D HMM with weak splice site model
24
2D HMMs: Target Evidence Mask (0/1) Twinscan (Brent et al.): cDNA, other evidence Given training set of known genes and evidence mask learn HMM over {0/1} SLAM (Pachter et al., Durbin et al.): Given training set of known genes and “correctly” alignments learn HMM over k
25
Outcomes Exon prediction (must get splice junctions right) SN 63% 68% SP 58% 66% Gene prediction (must get every exon) SN 15% 24% SP 10% 14% A lot of improvement possible ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.