Download presentation
Presentation is loading. Please wait.
Published byEvan McDowell Modified over 9 years ago
1
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University of California-San Diego
2
Genome Reconstruction: A Puzzle With a Billion Pieces Outline 1.Introduction to Genome Sequencing 2.The Newspaper Problem 3.DNA Chips: A First Shot at Sequencing with Short Reads 4.Two Mathematical Detours 5.Introduction to Graph Theory 6.Euler’s Theorem 7.ECP vs. HCP and Algorithmic Complexity 8.From Euler and Hamilton to Fragment Assembly 9.De Bruijn and a Final Solution to Fragment Assembly 10.Generalizing Fragment Assembly
3
Genome Reconstruction: A Puzzle With a Billion Pieces Section 1: Introduction to Genome Sequencing
4
Genome Reconstruction: A Puzzle With a Billion Pieces What Is Genome Sequencing? A genome can be represented as a book written in an alphabet containing only 4 letters, called nucleotides: A,T,G, and C. A human genome has roughly 3 billion nucleotides. Genome sequencing is the process of determining the sequence of nucleotides that make up a genome....CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGA TCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACA GATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATATAGCCGAGCGGCTACGATG ATGCTAGCTGTACAGCTGATGATCTAGCTATCGATGCGATCGATGCGCGAGTGCGATCGATCACTTCGAGCTAGCTGATCGATCGA TGCTAGCTAGCTGACTGATCATGGCGTTAGCTAGCTAGCTGATCGTCGATCGTACGTAGCTGATTACGATCGTCCGATCGTGCTAT GACGTACGAGGCGGCTACGTAGCATGCTAGCTGACTGATGTAGCTAGCTATACGATACTATATATTCGATCGATTTATTACCATGA CTGACGCGCATCGCTGTACACGTACTAGCTGATCGATGCTAGTCGATCGATCGATCATGTTATATATCGCGGCGCATCGATCGACT GCTCGATTATCGATACGTCGATCGCTGTATATACGTCTTTATAGCTAGGAGCATAGCGACGCGCTATCGATCGATCGTCTAGTCGA CTGATCGTACTAGCTGACGCTGACGACTAGCTAGCTATCGACGATCGTAGTGCGATTACTAGCTAGGATCCTACTGTACGTCAGTC AGTCTGATCGATAGCGAGGAAAGCGAGACTGATCGTTCTCTAGATGTAGCTGATGTGACTACTATACTACTGGCAGCGATCGGGA…
5
Genome Reconstruction: A Puzzle With a Billion Pieces What Is Genome Sequencing? Different people have slightly different genomes: all humans share 99.9% of the same genetic code. The 0.1% difference accounts for height, eye color, high cholesterol susceptibility, etc. CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGATGAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT
6
Genome Reconstruction: A Puzzle With a Billion Pieces Species Sequencing vs. Individual Genome Sequencing Species Sequencing: Determine the “consensus genome” of an entire species.
7
Genome Reconstruction: A Puzzle With a Billion Pieces Species Sequencing vs. Individual Genome Sequencing Individual Sequencing: Determine how an individual differs from its species.
8
Genome Reconstruction: A Puzzle With a Billion Pieces Species genome sequencing: Compare various species (e.g. human and chimpanzee) to understand how their genes function (e.g. which genes are important for brain development). Reveal evolutionary relationships between species. Determine the genetic makeup of our evolutionary ancestors. Why Would We Want to Sequence a Genome?
9
Genome Reconstruction: A Puzzle With a Billion Pieces Why Would We Want to Sequence a Genome? Individual genome sequencing: Unearth the genetic basis of many diseases. Forensics applications. Example: In 2010, 6-year old Nicholas Volker became the first human being to be saved because of genome sequencing. Doctors could not diagnose his condition, which caused strange infections; he went through nearly 100 surgeries. Genome sequencing revealed a rare mutation in a gene linked to a defect in his immune system. This led doctors to use advanced immunotherapy, which saved the child.
10
Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Genome Sequencing Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods. 1980: They share the Nobel Prize in Chemistry. Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence the human genome. Walter Gilbert Frederick Sanger
11
Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Genome Sequencing 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome. 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal. Francis Collins Craig Venter
12
Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Mammalian Genome Sequencing 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.
13
Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Mammalian Genome Sequencing 2000s: Many more mammalian genomes are sequenced.
14
Genome Reconstruction: A Puzzle With a Billion Pieces The Arrival of Personal Genomics 2000s: Many companies launch projects aimed at reducing sequencing costs by orders of magnitude. 2010: The market for sequencing machines takes off. Illumina reduces the cost of sequencing an individual human genome from $3 billion to $10,000. Complete Genomics builds a genomic factory in Silicon Valley that sequences hundreds of genomes per month. Beijing Genome Institute orders hundreds of sequencing machines, becoming the world’s largest sequencing center. 23andMe offers partial genome sequencing for $499. Many universities introduce new courses in which students study their own genomes.
15
Genome Reconstruction: A Puzzle With a Billion Pieces The Future of Genome Sequencing 2010s?: Genome sequencing will hopefully continue to bloom. The $1,000 human genome may arrive as early as in 2012. Hopefully, sequencing an individual genome will soon become as routine as an X-ray.
16
Genome Reconstruction: A Puzzle With a Billion Pieces What Makes Genome Sequencing So Difficult? When we read a book, we can read the entire book one letter at a time from the beginning to the end. However, modern sequencing machines cannot read an entire genome one nucleotide at a time from beginning to end. They can only shred the genome and read the short pieces. Thus, we can identify very short fragments of DNA (~100 nucleotides long), called reads. But we have no idea which genomic positions these reads come from! We must figure out how to put the reads back together to assemble a genome.
17
Genome Reconstruction: A Puzzle With a Billion Pieces Section 2: The Newspaper Problem and Genome Sequencing
18
Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem
19
Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem
20
Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem
21
Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem
22
Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem
23
Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem
24
Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem as an “Overlap Puzzle” The newspaper problem is not the same as a jigsaw puzzle: We have multiple copies of the same edition of a newspaper. Plus, some pieces of paper got blown to bits in the explosion. Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said. This gives us a giant overlap puzzle!
25
Genome Reconstruction: A Puzzle With a Billion Pieces In the newspaper problem, we have the rules of language and common sense (e.g. “murder” and “suspect” would often appear near each other in a newspaper.) However, the “language” of DNA remains largely unknown. Sequencing is Harder than Newspaper Problem
26
Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing is Harder than Newspaper Problem There are lots of repeated substrings in every genome (50% of human genome is formed by repeats). Example: GCTT is repeated 4 times in the following: AAGCTTCTATTGCTTAATTGGCTTGCTTCGCTTTG Analogy: The Triazzle puzzle contains lots of repeated figures. This makes it very difficult to solve (even with just 16 pieces).
27
Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Lab + Computation Read Generation (Experimental): Generate many reads from multiple copies of the same genome. Fragment Assembly (Computational): Use these reads to algorithmically put the genome back together.
28
Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies
29
Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation
30
Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Read Generation
31
Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Read Generation Fragment Assembly
32
Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Sequenced Genome … GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGCC … Read Generation Fragment Assembly
33
Genome Reconstruction: A Puzzle With a Billion Pieces Section 3: DNA Chips: A First Shot at Sequencing with Short Reads
34
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: From an Idea to a New Industry 1989: Radoje Drmanac, Andrey Mirzabekov, and Edwin Southern independently invent DNA chips (arrays) for read generation. Key Idea: Generate all k-mers (see below) from the genome in the hope that they can be assembled to reconstruct the genome. 1989: Science magazine writes, “Using DNA arrays for sequencing would simply be substituting one horrendous task for another.” 2000: Arrays are a multi-billion dollar industry Southern Mirzabekov Drmanac k-mer: A string of length k (in an alphabet of 4 nucleotides)
35
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Implementation 1.Synthesize a distinct k-mer in each of 4 k cells in the array. 2.Cover the array with multiple copies of a fluorescently-labeled unknown DNA fragment. 3.DNA will hybridize with a k-mer if it contains the complement of that k-mer. 4.Use a spectroscope to determine which sites emit light …the complements of these sites will reveal the k-mers in the unknown DNA fragment = our reads!
36
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Illustration
37
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? AA A AG A CA A CG A GA A GG A TA A TG A AA C AG C CA C CG C GA C GG C TA C TG C AA G AG G CA G CG G GA G GG G TA G TG G AA T AG T CA T CG T GA T GG T TA T TG T AC A AT A CC A CT A GC A GT A TC A TT A AC C AT C CC C CT C GC C GT C TC C TT C AC G AT G CC G CT G GC G GT G TC G TT G AC T AT T CC T CT T GC T GT T TC T TT T
38
Genome Reconstruction: A Puzzle With a Billion Pieces CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T DNA Chips: Example What are our reads? CAT
39
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ||| ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T
40
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T
41
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T
42
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T
43
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T
44
Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? So 3-mer ATG must occur in the genome! ATG CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T
45
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC GCC GGC TTG CAA CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T
46
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC CGC GCG CAT ATG CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T
47
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC GTG CGC GCG CAT ATG GT G CG C TG C AT G CC A GC A GC C AC G TT G AT T
48
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G CG C TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC GTG CGC CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC GCC GGC TTG CAA
49
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC GCC GGC TTG CAA
50
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC GTG CGC GCG CAT ATG TGC
51
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA
52
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG
53
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AT T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT
54
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AT T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT
55
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT
56
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA
57
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G GC A GC C CG T TT G AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG
58
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G GC A GC C CG T TT G AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA
59
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GC C CG T TT G AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC
60
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GC C CG T TT G AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC GCC
61
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T TT G AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC GCC GGC
62
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T TT G AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC GCC GGC TTG
63
Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T CA A AA T What are our reads? CAC GTG CGC GCG CAT ATG TGC GCA ACG CGT ATT AAT CCA TGG GCA TGC GCC GGC TTG CAA
64
Genome Reconstruction: A Puzzle With a Billion Pieces From Biological Data to Computational Problem GT G GC G GC A AT G TG G TG C GG C CG T CA A AA T Aim: Construct a shortest possible genome containing all our reads. This is now a computational problem!
65
Genome Reconstruction: A Puzzle With a Billion Pieces Section 4: Two Mathematical Detours
66
Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg The people of Königsberg, Prussia (present-day Kaliningrad, Russia) enjoyed taking walks.
67
Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg They wondered if they could walk through the city, cross each bridge (blue) exactly once, and return where they started.
68
Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg 1735: Leonhard Euler develops an approach to answer this question for any city, even for a “city” with a million islands. We will soon discuss Euler’s method as well as how it applies to genome sequencing. Leonhard Euler
69
Genome Reconstruction: A Puzzle With a Billion Pieces The Icosian Game Over a century passes… 1857: Irish mathematician William Hamilton designs a game consisting of a board representing 20 “islands” connected by “bridges.” Goal: find a walk that visits every island exactly once and returns back where it started. William Hamilton Icosian Game
70
Genome Reconstruction: A Puzzle With a Billion Pieces Similar Problems with Very Different Fates These two stories have something in common: Find a walk that uses every bridge once (Konigsberg Bridges Problem) Find a walk that visits every island once (Hamilton game) However, while Euler solved the first problem (even for a city with a million bridges), mathematicians still do not know how to solve the second problem, even for a city with a thousand islands. But where are the genomes???
71
Genome Reconstruction: A Puzzle With a Billion Pieces Section 5: Introduction to Graph Theory
72
Genome Reconstruction: A Puzzle With a Billion Pieces Graphs A graph is a network composed of two sets of objects: Vertices: each vertex is represented by a point. Edges: each edge is represented by a segment connecting two vertices. Graph theory can be applied to all kinds of different problems. Transportation networks Disease epidemics Computer viruses spreading through the internet. And, yes…genome sequencing!
73
Genome Reconstruction: A Puzzle With a Billion Pieces Königsberg Bridges Graph For the Königsberg Bridge Problem, we create a graph: Vertices = 4 land masses of the city Edges = 7 bridges connecting land areas Note: We don’t need to worry about the exact placement of vertices or the shape of bridges.
74
Genome Reconstruction: A Puzzle With a Billion Pieces Icosian Game Graph For the Icosian Game, we create a graph: Vertices = islands Edges = bridges connecting the islands
75
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G.
76
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “Here I go!”
77
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…He wakes up in the morning…”
78
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…goes to visit his mommy…”
79
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…when all the little ants are marching…”
80
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…they all do it the same way…”
81
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “Oh no! I’m back where I started!”
82
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Two questions: 1.Is there a cycle of G in which the ant walks through each edge exactly once? 2.Is there a cycle of G in which the ant walks through each vertex exactly once? “???!!!”
83
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Two questions: 1.Is there a cycle of G in which the ant walks through each edge exactly once? Eulerian cycle 2.Is there a cycle of G in which the ant walks through each vertex exactly once? Hamiltonian cycle “I wish someone would name a cycle after me…I’m the one doing all the walking here!”
84
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists.
85
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it?
86
Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1
87
Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2
88
Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3
89
Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4
90
Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4 5
91
Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4 5 6
92
Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4 5 6 7
93
Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4 5 6 7 8
94
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles 1 2 3 4 5 6 7 8 9 An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it?
95
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. For example, the graph corresponding to the Icosian game is Hamiltonian. This means that the Icosian game has a solution!
96
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1
97
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2
98
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3
99
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4
100
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5
101
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6
102
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7
103
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8
104
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9
105
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10
106
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11
107
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12
108
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13
109
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
110
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
111
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
112
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
113
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
114
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
115
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
116
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
117
Genome Reconstruction: A Puzzle With a Billion Pieces Finding Eulerian Cycles vs Hamiltonian Cycles Given a graph G, we now have two questions that we can program a computer to answer about G. Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian. Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G is not Hamiltonian.
118
Genome Reconstruction: A Puzzle With a Billion Pieces Section 6: Euler’s Theorem
119
Genome Reconstruction: A Puzzle With a Billion Pieces Euler’s Theorem We will now discuss how Euler solved the Königsberg Bridge Problem. You might guess: He used graph theory! This is not entirely accurate. A better statement would be: He invented graph theory!
120
Genome Reconstruction: A Puzzle With a Billion Pieces Directed Graphs Directed Graph: A graph in which each edge has a direction (represented by an arrow). You might like to think of directed edges as “one-way bridges.” Undirected GraphDirected Graph
121
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in Directed Graphs An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction. A directed graph is Eulerian if it contains an Eulerian cycle. Is this graph Eulerian? Why?
122
Genome Reconstruction: A Puzzle With a Billion Pieces indegree(v) = the number of edges leading into vertex v. outdegree(v) = the number of edges leading out of v. A graph is balanced if indegree(v) = outdegree(v) for every vertex v. Label each vertex v with (indegree(v), outdegree(v)) This graph isn’t balanced since some vertices don’t have equal indegree and outdegree. Balanced Graphs (1, 2) (2, 1) (1, 0) (2, 1) (1, 1) (0, 2) (1, 1)
123
Genome Reconstruction: A Puzzle With a Billion Pieces indegree(v) = the number of edges leading into vertex v. outdegree(v) = the number of edges leading out of v. A graph is balanced if indegree(v) = outdegree(v) for every vertex v. Label each vertex v with (indegree(v), outdegree(v)) Adding some edges makes the graph balanced. Balanced Graphs (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1)
124
Genome Reconstruction: A Puzzle With a Billion Pieces Euler’s Theorem Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced. A graph is connected if for every pair of vertices {u, v}, an ant can travel either from u to v or from v to u. (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1) Not Connected Connected + Balanced = Eulerian
125
Genome Reconstruction: A Puzzle With a Billion Pieces Section 7: ECP vs. HCP and Algorithmic Complexity
126
Genome Reconstruction: A Puzzle With a Billion Pieces Solving the ECP By Euler’s Theorem, to determine whether G contains an Eulerian cycle, we only need to check if G is balanced. So we simply go to each vertex and perform this simple check: If every vertex is balanced, then G must contain an Eulerian cycle. If some vertex is not balanced, then G cannot contain an Eulerian cycle.
127
Genome Reconstruction: A Puzzle With a Billion Pieces Connected + Balanced = Eulerian (1, 2) (2, 1) (1, 0) (1, 1) (0, 2) (1, 1) Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. (2, 1)
128
Genome Reconstruction: A Puzzle With a Billion Pieces Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. Adding the edges to make the graph balanced will mean that an Eulerian cycle must exist. Connected + Balanced = Eulerian (2, 2) (1, 1) (2, 2) (1, 1) 1 2 3 7 6 5 4 8 9 10 11 (2, 2)
129
Genome Reconstruction: A Puzzle With a Billion Pieces Connected + Balanced = Eulerian Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. Adding the edges to make the graph balanced will mean that an Eulerian cycle must exist. One vital question remains: Where did this Eulerian cycle come from? (2, 2) (1, 1) (2, 2) (1, 1) 1 2 7 6 5 4 8 9 10 11 (1, 1) (2, 2) 3
130
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. (2, 2) (1, 1) (2, 2) (1, 1) (2, 2)
131
Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (0, 1) (2, 1) (1, 1) (2, 2)
132
Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 2) (0, 0) (2, 1) (1, 1)
133
Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (2, 1) (1, 1) (0, 1)
134
Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Cycle! But not Eulerian yet… Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (1, 1) (0, 0)
135
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (1, 1) (0, 0) Let’s cut out the cycle that the ant has found.
136
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. (0, 0)
137
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. Next delete vertices that are no longer connected to anything. (0, 0)
138
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. Next delete vertices that are no longer connected to anything.
139
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Again, let the ant walk through the graph however it chooses.
140
Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 2) (2, 2) (1, 1) (1, 0) (1, 1)
141
Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 1) (1, 2) (1, 1) (1, 0) (1, 1)
142
Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 1) (1, 0) (1, 1) “I really don’t see how this is going to give us an Eulerian cycle in the original graph…I knew I shouldn’t have left the house this morning!”
143
Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Cycle! But still not Eulerian… Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 0) (1, 1)
144
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 0) (1, 1) Let’s trim out this cycle one more time.
145
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Making an Eulerian Cycle from a Balanced Graph (1, 1)
146
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Making an Eulerian Cycle from a Balanced Graph (1, 1) “Hmph! Dragged halfway across the screen…I guess I don’t have any say in the matter…”
147
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (1, 1)
148
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 1) (1, 0)
149
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (0, 1) (0, 0) (1, 0)
150
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Cycle! And Eulerian to boot…so we have run out of edges. Making an Eulerian Cycle from a Balanced Graph (0, 0)
151
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Cycle! And Eulerian to boot…so we have run out of edges. What do we do now? Making an Eulerian Cycle from a Balanced Graph (0, 0) “Yes! What DO we do now?”
152
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s bring back our original graph. Making an Eulerian Cycle from a Balanced Graph
153
Genome Reconstruction: A Puzzle With a Billion Pieces Let’s bring back our original graph. Highlight the three cycles that the ant found. Making an Eulerian Cycle from a Balanced Graph
154
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph
155
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1
156
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1 2
157
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3
158
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4
159
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4
160
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5
161
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6
162
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7
163
Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Cycle formed; however, we now have no new edges to follow! Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8 “???”
164
Genome Reconstruction: A Puzzle With a Billion Pieces To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8 “Backtracking? But I’m not evolved to walk backwards!”
165
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
166
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle.
167
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. 7
168
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. 7 8
169
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… 7 8 9
170
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… 7 8 9 10 “I smell something good!”
171
Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… And we have the same Eulerian cycle from before! 7 8 9 10 11 “Yay! Now can I go home please?”
172
Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? The great thing about this method is that it can be easily generalized to any balanced graph to give an Eulerian cycle. “Yeah, but this Eulerian cycle wasn’t that hard to find anyway! So why should we care about the method?” Think about trying to eyeball an Eulerian cycle in a graph containing billions of edges. Not so easy… 1 2 3 4 5 6 7 8 9 10 11
173
Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? More profoundly, this method to find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer. Example: A modern computer can find an Eulerian cycle in a balanced graph containing billions of edges in under a minute! 1 2 3 4 5 6 7 8 9 10 11
174
Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? “Yeah, but computers are supermachines! They don’t really need 300-year old mathematics to help them solve problems. Aren’t they going to take over the world anyway?” So let’s examine the case of finding a Hamiltonian cycle…
175
Genome Reconstruction: A Puzzle With a Billion Pieces Searching for an Efficient Algorithm for HCP Key Point: No one has ever found a similar efficient test to determine whether a graph is Hamiltonian. Of course, we could examine every possible (ant) walk through the graph to solve the HCP. However, this brute force approach is just not efficient: there are more walks through a graph on just 1,000 vertices than there are atoms in the universe!
176
Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems In fact, the HCP has been classified as NP-Complete. In laymen’s terms, this means that the HCP belongs to a collection containing thousands of computational problems that cannot be solved quickly for large input data sets. NP-Complete problems are all equivalent to each other: find an efficient solution to one, and you have an efficient solution to them all.
177
Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, I guess I'm just too dumb.” From Garey and Johnson. Computers and Intractability. 1979 Attempting to solve any NP-Complete problem is difficult.
178
Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, because no such algorithm is possible.” Attempting to solve any NP-Complete problem is difficult. The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist. From Garey and Johnson. Computers and Intractability. 1979
179
Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, but neither can all these famous people.” Attempting to solve any NP-Complete problem is difficult. The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist. The present state of affairs is somewhere in between. From Garey and Johnson. Computers and Intractability. 1979
180
Genome Reconstruction: A Puzzle With a Billion Pieces The NP-Completeness of the HCP The question of whether or not NP-Complete problems (including the HCP) can be solved efficiently is one of seven Millennium Problems in mathematics. Find an efficient algorithm for the HCP, or demonstrate that no such algorithm exists, and you will get $1 million. However, if you become a mathematician, odds are that you are not in it for the $$$...recently, Grigory Perelman solved one of these problems but turned down the prize. Grigory Perelman, True Legend
181
Genome Reconstruction: A Puzzle With a Billion Pieces Section 8: From Euler and Hamilton to Fragment Assembly
182
Genome Reconstruction: A Puzzle With a Billion Pieces Simplifying Assumptions for Fragment Assembly 1.Every k-mer occurring in the genome is generated by some read. 2.Reads are error-free. 3.Every k-mer occurring in the genome occurs exactly once. 4.The underlying genome consists of a single circular-shaped chromosome. Note: In the final section, we will relax these assumptions.
183
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GTGGCGGCA ATG TGGTGC GGC CGTCAA AAT
184
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GTGGCGGCA ATG TGGTGC GGC CGTCAA AAT GTG
185
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GCGGCA ATG TGGTGC GGC CGTCAA AAT GTG GCG
186
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GCA ATG TGGTGC GGC CGTCAA AAT GTG GCGGCA
187
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. ATG TGGTGC GGC CGTCAA AAT GTG GCGGCAATG
188
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. TGGTGC GGC CGTCAA AAT GTG GCGGCAATGTGG
189
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. TGC GGC CGTCAA AAT GTG GCGGCAATGTGGTGC
190
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GGC CGTCAA AAT GTG GCGGCAATGTGGTGCGGC
191
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. CGTCAA AAT GTG GCGGCAATGTGGTGCGGCCGT
192
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. CAA AAT GTG GCGGCAATGTGGTGCGGCCGTCAA
193
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. AAT GTG GCGGCAATGTGGTGCGGCCGTCAAAAT
194
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. ATGCGTGGCAAT GTG TGGTGCCAAGCAGCG
195
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every k-mer detected by our array. Prefix: First k – 1 nucleotides of a k-mer ( CAA ) Suffix: Last k – 1 nucleotides of a k-mer ( CAA ) Different 3-mers may share a prefix/suffix: ATG, TGA, CTG ATGCGTGGCAAT GTG TGGTGCCAAGCAGCG
196
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H ATGCGTGGCAATGTGTGGTGCCAAGCAGCG As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
197
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H ATGCGTGGCAATGTGTGGTGCCAAGCAGCG As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
198
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
199
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
200
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
201
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
202
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
203
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
204
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
205
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
206
Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
207
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
208
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG CGTGGCAATGTGTGGTGCCAAGCAGCG
209
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
210
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
211
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
212
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
213
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
214
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
215
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
216
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
217
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
218
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG
219
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG
220
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG Genome: T G A
221
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG ATGG Genome: T G G A
222
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC ATGGC Genome: T G G C A
223
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG ATGGCG Genome: T G G C G A
224
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT ATGGCGT Genome: T G G C G T A
225
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG ATGGCGTG Genome: T G G C G T G A
226
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC ATGGCGTGC Genome: T G G C G T G C A
227
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC GCA ATGGCGTGCA Genome: T G G C G T G C A A
228
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA ATGGCGTGCAA Genome: A T G G C G T G C A
229
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGGCGTGCAAT Genome: A T G G C G T G C A
230
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A
231
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A
232
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A
233
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A
234
Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCA Genome: A T G G C G T G C A
235
Genome Reconstruction: A Puzzle With a Billion Pieces Problem with H Ultimately, we must solve the HCP on H in order to find a candidate DNA sequence… This idea motivated the method used for assembling the human genome from 50 million (long and expensive) reads in 2000, but the computational strain was overwhelming: sequencing the human genome took several computers a period of months, working around the clock. For that matter, newer sequencing technologies produce billions of (short and inexpensive) reads: we need a new idea.
236
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
237
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
238
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
239
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
240
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
241
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
242
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
243
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
244
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
245
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
246
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
247
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
248
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
249
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
250
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
251
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
252
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
253
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
254
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
255
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
256
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
257
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
258
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
259
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GTG
260
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GCG GTG
261
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GCG GTG GCA
262
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG GCG GTG GCA
263
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GCG GTG GCA
264
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GCG GTG TGC GCA
265
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG GTG TGC GCA
266
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA
267
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAA
268
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT
269
Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT
270
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT
271
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1
272
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2
273
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2
274
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4
275
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5
276
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6
277
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG TGC 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 7
278
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG TGC GCA 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78
279
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG TGC GCA CAA 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9
280
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10
281
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10
282
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT This is the same sequence of 3-mers that we had in H! ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG
283
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT This is the same sequence of 3-mers that we had in H! Thus we will obtain the same sequenced genome as before. ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A
284
Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT This is the same sequence of 3-mers that we had in H! Thus we will obtain the same sequenced genome as before. ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCA Genome: A T G G C G T G C A
285
Genome Reconstruction: A Puzzle With a Billion Pieces Analysis of E Good News: We now only have to find an Eulerian cycle in the graph E, which could be done on this computer. Bad News: 1.There may be more than one Eulerian cycle in E. We won’t discuss this issue here, but it can be resolved. 2.How do we know that E even has an Eulerian cycle? By Euler’s Theorem, we only need to show that E is a balanced graph. To do this, we need one more piece of mathematical history…
286
Genome Reconstruction: A Puzzle With a Billion Pieces Section 9: De Bruijn and Fragment Assembly
287
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question 1946: The Dutch mathematician Nicolaas de Bruijn asks: can we design a circular superstring of minimal length that contains every binary string of length k? Example for k = 3. The circular superstring ‘00011101’ contains all eight binary strings of length 3. We illustrate the locations of ‘000’ and ’110’ on the string. Nicolaas de Bruijn
288
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question De Bruijn introduced a special class of graph B(n, k): Vertices = all n k – 1 possible (k – 1)-mers in n-letter alphabet. An edge connects v to w if there is a k-mer whose prefix = v and whose suffix = w. At right is B(2, 4), assuming that our alphabet contains 0 and 1.
289
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question For any choice of n and k, B(n, k) must be balanced/Eulerian. Why? Because both the indegree and the outdegree of every vertex is equal to the size of the alphabet (n), since every (k – 1)-mer will occur as the prefix or suffix of n different k-mers. Red numbers show the order of edges in an Eulerian cycle.
290
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
291
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
292
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
293
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
294
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
295
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
296
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
297
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
298
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
299
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
300
Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:
301
Genome Reconstruction: A Puzzle With a Billion Pieces Section 10: Generalizing Fragment Assembly
302
Genome Reconstruction: A Puzzle With a Billion Pieces Simplifying Assumptions for Fragment Assembly Recall the assumptions we have already made: 1.Every k-mer occurring in the genome is generated by some read. 2.Reads are error-free. 3.Every k-mer occurring in the genome occurs exactly once. 4.The underlying genome consists of a single circular-shaped chromosome. Our aim is to relax each of these assumptions and determine how the problem changes.
303
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 1: Generating (nearly) all k-mers 100-nucleotide reads generated by Illumina sequencing technology capture only a small fraction of 100-mers from the genome (even for high-coverage sequencing projects), thus violating this key assumption of the de Bruijn graphs. However, if we break these reads into shorter k-mers, the resulting k-mers often represent nearly all k-mers from the genome for sufficiently small k. For example, modern assemblers often break every 100- nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs.
304
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
305
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
306
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
307
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
308
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
309
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
310
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
311
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
312
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
313
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
314
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
315
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
316
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
317
Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome
318
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads What happens to the graph E when some reads have errors? Example: Say our graph E for genome ATGGCGTGCAATG should look like this.
319
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads What happens to the graph E when some reads have errors? Example: Say our graph E for genome ATGGCGTGCAATG should look like this. If read TGGCGTG is mistakenly sequenced as TGGAGTG, then the graph will look like this instead. This is called a bulge in the graph E.
320
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads Most reads have errors, resulting in millions of bulges in E. 2004: Pevzner et al. provide algorithm for bulge removal.
321
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers The genome ACGTACGT has only four 3-mers: ACG, CGT, GTA, and TAC. We would obtain the graph E below and reconstruct this genome as: ACGT In other words, we can’t represent repeated k-mers in the genome! ACCG GT TA TAC ACG CGT GTA
322
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Define the multiplicity of a k-mer as the number of times it occurs in a genome. We will add edges to E in order to form a new graph E* for which the number of edges connecting two vertices represents the multiplicity of the k-mer on that edge. An Eulerian cycle in E* still gives a candidate genome.
323
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
324
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
325
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
326
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
327
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
328
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
329
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
330
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
331
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
332
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
333
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
334
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
335
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
336
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
337
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
338
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
339
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA
340
Genome Reconstruction: A Puzzle With a Billion Pieces Determining k-mer multiplicities How can we find the multiplicity of a k-mer in the genome? The multiplicity of a k-mer will be directly related to the frequency with which that k-mer occurs in our reads. So a k-mer that appears 5 times in the genome is expected to occur 5 times as often in our reads. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT
341
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes The genomes for all complex organisms are split across a number of linear chromosomes (46 in humans). So in order to sequence the human genome, geneticists simply sequenced all of these linear chromosomes. Question: How do we sequence a linear segment of DNA?
342
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Say our linear DNA segment is ATGCGTGGCGTGCA. Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers: CAA
343
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Say our linear DNA segment is ATGCGTGGCGTGCA. Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers: CAA AAT
344
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT
345
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT
346
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA
347
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA
348
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. So to sequence our segment ATGCGTGGCGTGCA, we need to find a path through E* that starts with AT, ends at CA, and uses every edge in between. CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA
349
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once. So an Eulerian path is just like an Eulerian cycle, except that we don’t have to start and end at the same vertex. Luckily, Euler’s Theorem generalizes to efficiently determine whether a graph has an Eulerian path and then find this path. Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.
350
Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced. So E* must contain an Eulerian path, because AT and CA (the endpoints of our segment) are the only two vertices that aren’t balanced. Hence in every case we have solved our giant puzzle! CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA
351
Genome Reconstruction: A Puzzle With a Billion Pieces What’s Next?
352
Genome Reconstruction: A Puzzle With a Billion Pieces Personal Genomics: Millions of Human Genomes Personal genome sequencing started from sequencing the genomes of a few scientists in 2009 and will soon expand to millions of individuals. Thousands of cancer genomes have already been sequenced, and genome sequencing will soon become a routine technique in medicine. At the heart of this revolution are bioinformaticians, who must harness precise methods in order to analyze the growing data. 10 scientists and entrepreneurs who made their genomes publicly available in 2009
353
Genome Reconstruction: A Puzzle With a Billion Pieces Genome 10K and Beyond 2010: Scientists launch an ambitious project to sequence 10,000 species genomes. 201x?: We will hopefully be able to reconstruct the “tree of life” and uncover the genomes of ancestors that lived millions of years ago. 20xx?: Maybe, just maybe, we will be able to discover why giraffes grew necks and humans grew brains.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.