BME 130 – Genomes Lecture 5 Genome assembly I The good old days
Administrivia Homework 1 – on the website today, due Friday; homework policy Student-led paper discussion; choose groups and pick paper Guest lecture Friday – Bob Kuhn will demo the UCSC genome browser
Genomic Fossils Calibrate the Long-Term Evolution of Hepadnaviruses Genomics in the news Genomic Fossils Calibrate the Long-Term Evolution of Hepadnaviruses Citation: Gilbert C, Feschotte C (2010) Genomic Fossils Calibrate the Long-Term Evolution of Hepadnaviruses. PLoS Biol 8(9): e1000495. doi:10.1371/journal.pbio.1000495
Figure 4.10 Genomes 3 (© Garland Science 2007)
Figure 4.10 part 1 of 2 Genomes 3 (© Garland Science 2007)
Figure 4.10 part 2 of 2 Genomes 3 (© Garland Science 2007)
Sequence assembly de novo reference- guided overlap layout consensus Reference sequence s1 s5 s3 s4 s2 s6
Most CPU and memory demanding stage de novo sequence assembly Most CPU and memory demanding stage overlap s1 s2 s3 s4 s5 s6 Phrap: “banded” alignment of reads around k-mer matches; tolerate alignment mismatches of low-quality bases Phusion: group reads sharing >= 11 k-mers of 17 bases Celera: k-mer seed and extend alignment of reads Arachne: 24-mer seed and extend alignment of reads newbler: flowgram similarities (?)
de novo sequence assembly Generate alignments s1 s2 s3 s4 s5 s6 s5 s1 s5 Find connected components s1 s2 s2 s3 s3 s4 s6 s4 s6 Wide range of strategies for the layout stage, many using mate-pair information
de novo Sequence assembly consensus PHRAP Consensus base is base with highest quality score Quality score for position is based on all reads quality scores s1 s5 s2 s3 PCAP/CAP3 Sum up quality scores for each base take base with highest sum Quality score for position: highest sum – all other sums s4 s6
Reference-guided sequence assembly Advantages (much) faster Reference sequence Advantages (much) faster (much) less memory Disadvantages Indels/rearragements Lack of closely related reference Bias towards reference similarity Pop M et al., “Comparative Genome Assembly” Brief Bioinform. 2004 Sep;5(3):237-48.
Why is this called a sequence gap and not a physical gap? Figure 4.11a Genomes 3 (© Garland Science 2007)
Closing a physical gap means finding a physical clone to sequence that will span the gap
Genomic DNA is template for this PCR Figure 4.11b Genomes 3 (© Garland Science 2007)
Chromosome walking (is slow) Figure 4.12 Genomes 3 (© Garland Science 2007)
PCR from clone library Insert 1 connects to who? Figure 4.13 Genomes 3 (© Garland Science 2007)
Figure 4.14 Genomes 3 (© Garland Science 2007)
Figure 4.15 Genomes 3 (© Garland Science 2007)
Figure 4.15a Genomes 3 (© Garland Science 2007)
Figure 4.15b Genomes 3 (© Garland Science 2007)
Figure 4.15c Genomes 3 (© Garland Science 2007)
Figure 4.15d Genomes 3 (© Garland Science 2007)
Assembly can by validated by mate-pair information Figure 4.16 Genomes 3 (© Garland Science 2007)
Figure 4.16a Genomes 3 (© Garland Science 2007)
Figure 4.16b Genomes 3 (© Garland Science 2007)
Figure 4.17a Genomes 3 (© Garland Science 2007)
Figure 4.17b Genomes 3 (© Garland Science 2007)
Figure 4.18 Genomes 3 (© Garland Science 2007)