Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Similar presentations


Presentation on theme: "Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University."— Presentation transcript:

1 Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University of California-San Diego

2 Genome Reconstruction: A Puzzle With a Billion Pieces Outline 1.Introduction to Genome Sequencing 2.The Newspaper Problem 3.DNA Chips: A First Shot at Sequencing with Short Reads 4.Two Mathematical Detours 5.Introduction to Graph Theory 6.Euler’s Theorem 7.ECP vs. HCP and Algorithmic Complexity 8.From Euler and Hamilton to Fragment Assembly 9.De Bruijn and a Final Solution to Fragment Assembly 10.Generalizing Fragment Assembly

3 Genome Reconstruction: A Puzzle With a Billion Pieces Section 1: Introduction to Genome Sequencing

4 Genome Reconstruction: A Puzzle With a Billion Pieces What Is Genome Sequencing? A genome can be represented as a book written in an alphabet containing only 4 letters, called nucleotides: A,T,G, and C. A human genome has roughly 3 billion nucleotides. Genome sequencing is the process of determining the sequence of nucleotides that make up a genome....CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGA TCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACA GATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATATAGCCGAGCGGCTACGATG ATGCTAGCTGTACAGCTGATGATCTAGCTATCGATGCGATCGATGCGCGAGTGCGATCGATCACTTCGAGCTAGCTGATCGATCGA TGCTAGCTAGCTGACTGATCATGGCGTTAGCTAGCTAGCTGATCGTCGATCGTACGTAGCTGATTACGATCGTCCGATCGTGCTAT GACGTACGAGGCGGCTACGTAGCATGCTAGCTGACTGATGTAGCTAGCTATACGATACTATATATTCGATCGATTTATTACCATGA CTGACGCGCATCGCTGTACACGTACTAGCTGATCGATGCTAGTCGATCGATCGATCATGTTATATATCGCGGCGCATCGATCGACT GCTCGATTATCGATACGTCGATCGCTGTATATACGTCTTTATAGCTAGGAGCATAGCGACGCGCTATCGATCGATCGTCTAGTCGA CTGATCGTACTAGCTGACGCTGACGACTAGCTAGCTATCGACGATCGTAGTGCGATTACTAGCTAGGATCCTACTGTACGTCAGTC AGTCTGATCGATAGCGAGGAAAGCGAGACTGATCGTTCTCTAGATGTAGCTGATGTGACTACTATACTACTGGCAGCGATCGGGA…

5 Genome Reconstruction: A Puzzle With a Billion Pieces What Is Genome Sequencing? Different people have slightly different genomes: all humans share 99.9% of the same genetic code. The 0.1% difference accounts for height, eye color, high cholesterol susceptibility, etc. CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGATGAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

6 Genome Reconstruction: A Puzzle With a Billion Pieces Species Sequencing vs. Individual Genome Sequencing Species Sequencing: Determine the “consensus genome” of an entire species.

7 Genome Reconstruction: A Puzzle With a Billion Pieces Species Sequencing vs. Individual Genome Sequencing Individual Sequencing: Determine how an individual differs from its species.

8 Genome Reconstruction: A Puzzle With a Billion Pieces Species genome sequencing: Compare various species (e.g. human and chimpanzee) to understand how their genes function (e.g. which genes are important for brain development). Reveal evolutionary relationships between species. Determine the genetic makeup of our evolutionary ancestors. Why Would We Want to Sequence a Genome?

9 Genome Reconstruction: A Puzzle With a Billion Pieces Why Would We Want to Sequence a Genome? Individual genome sequencing: Unearth the genetic basis of many diseases. Forensics applications. Example: In 2010, 6-year old Nicholas Volker became the first human being to be saved because of genome sequencing. Doctors could not diagnose his condition, which caused strange infections; he went through nearly 100 surgeries. Genome sequencing revealed a rare mutation in a gene linked to a defect in his immune system. This led doctors to use advanced immunotherapy, which saved the child.

10 Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Genome Sequencing Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods. 1980: They share the Nobel Prize in Chemistry. Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence the human genome. Walter Gilbert Frederick Sanger

11 Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Genome Sequencing 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome. 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal. Francis Collins Craig Venter

12 Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Mammalian Genome Sequencing 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.

13 Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Mammalian Genome Sequencing 2000s: Many more mammalian genomes are sequenced.

14 Genome Reconstruction: A Puzzle With a Billion Pieces The Arrival of Personal Genomics 2000s: Many companies launch projects aimed at reducing sequencing costs by orders of magnitude. 2010: The market for sequencing machines takes off. Illumina reduces the cost of sequencing an individual human genome from $3 billion to $10,000. Complete Genomics builds a genomic factory in Silicon Valley that sequences hundreds of genomes per month. Beijing Genome Institute orders hundreds of sequencing machines, becoming the world’s largest sequencing center. 23andMe offers partial genome sequencing for $499. Many universities introduce new courses in which students study their own genomes.

15 Genome Reconstruction: A Puzzle With a Billion Pieces The Future of Genome Sequencing 2010s?: Genome sequencing will hopefully continue to bloom. The $1,000 human genome may arrive as early as in 2012. Hopefully, sequencing an individual genome will soon become as routine as an X-ray.

16 Genome Reconstruction: A Puzzle With a Billion Pieces What Makes Genome Sequencing So Difficult? When we read a book, we can read the entire book one letter at a time from the beginning to the end. However, modern sequencing machines cannot read an entire genome one nucleotide at a time from beginning to end. They can only shred the genome and read the short pieces. Thus, we can identify very short fragments of DNA (~100 nucleotides long), called reads. But we have no idea which genomic positions these reads come from! We must figure out how to put the reads back together to assemble a genome.

17 Genome Reconstruction: A Puzzle With a Billion Pieces Section 2: The Newspaper Problem and Genome Sequencing

18 Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

19 Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

20 Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

21 Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

22 Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

23 Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

24 Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem as an “Overlap Puzzle” The newspaper problem is not the same as a jigsaw puzzle: We have multiple copies of the same edition of a newspaper. Plus, some pieces of paper got blown to bits in the explosion. Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said. This gives us a giant overlap puzzle!

25 Genome Reconstruction: A Puzzle With a Billion Pieces In the newspaper problem, we have the rules of language and common sense (e.g. “murder” and “suspect” would often appear near each other in a newspaper.) However, the “language” of DNA remains largely unknown. Sequencing is Harder than Newspaper Problem

26 Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing is Harder than Newspaper Problem There are lots of repeated substrings in every genome (50% of human genome is formed by repeats). Example: GCTT is repeated 4 times in the following: AAGCTTCTATTGCTTAATTGGCTTGCTTCGCTTTG Analogy: The Triazzle puzzle contains lots of repeated figures. This makes it very difficult to solve (even with just 16 pieces).

27 Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Lab + Computation Read Generation (Experimental): Generate many reads from multiple copies of the same genome. Fragment Assembly (Computational): Use these reads to algorithmically put the genome back together.

28 Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies

29 Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation

30 Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Read Generation

31 Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Read Generation Fragment Assembly

32 Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Sequenced Genome … GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGCC … Read Generation Fragment Assembly

33 Genome Reconstruction: A Puzzle With a Billion Pieces Section 3: DNA Chips: A First Shot at Sequencing with Short Reads

34 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: From an Idea to a New Industry 1989: Radoje Drmanac, Andrey Mirzabekov, and Edwin Southern independently invent DNA chips (arrays) for read generation. Key Idea: Generate all k-mers (see below) from the genome in the hope that they can be assembled to reconstruct the genome. 1989: Science magazine writes, “Using DNA arrays for sequencing would simply be substituting one horrendous task for another.” 2000: Arrays are a multi-billion dollar industry Southern Mirzabekov Drmanac k-mer: A string of length k (in an alphabet of 4 nucleotides)

35 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Implementation 1.Synthesize a distinct k-mer in each of 4 k cells in the array. 2.Cover the array with multiple copies of a fluorescently-labeled unknown DNA fragment. 3.DNA will hybridize with a k-mer if it contains the complement of that k-mer. 4.Use a spectroscope to determine which sites emit light …the complements of these sites will reveal the k-mers in the unknown DNA fragment = our reads!

36 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Illustration

37 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? AA A AG A CA A CG A GA A GG A TA A TG A AA C AG C CA C CG C GA C GG C TA C TG C AA G AG G CA G CG G GA G GG G TA G TG G AA T AG T CA T CG T GA T GG T TA T TG T AC A AT A CC A CT A GC A GT A TC A TT A AC C AT C CC C CT C GC C GT C TC C TT C AC G AT G CC G CT G GC G GT G TC G TT G AC T AT T CC T CT T GC T GT T TC T TT T

38 Genome Reconstruction: A Puzzle With a Billion Pieces CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T DNA Chips: Example What are our reads? CAT

39 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ||| ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

40 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

41 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

42 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

43 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

44 Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? So 3-mer ATG must occur in the genome! ATG CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T

45 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T

46 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC CGC  GCG CAT  ATG CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T

47 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC  GTG CGC  GCG CAT  ATG GT G CG C TG C AT G CC A GC A GC C AC G TT G AT T

48 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G CG C TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA

49 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA

50 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC

51 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA

52 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG

53 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT

54 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT

55 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT

56 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA

57 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G GC A GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG

58 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G GC A GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA

59 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC

60 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC

61 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC

62 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG

63 Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T CA A AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA

64 Genome Reconstruction: A Puzzle With a Billion Pieces From Biological Data to Computational Problem GT G GC G GC A AT G TG G TG C GG C CG T CA A AA T Aim: Construct a shortest possible genome containing all our reads. This is now a computational problem!

65 Genome Reconstruction: A Puzzle With a Billion Pieces Section 4: Two Mathematical Detours

66 Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg The people of Königsberg, Prussia (present-day Kaliningrad, Russia) enjoyed taking walks.

67 Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg They wondered if they could walk through the city, cross each bridge (blue) exactly once, and return where they started.

68 Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg 1735: Leonhard Euler develops an approach to answer this question for any city, even for a “city” with a million islands. We will soon discuss Euler’s method as well as how it applies to genome sequencing. Leonhard Euler

69 Genome Reconstruction: A Puzzle With a Billion Pieces The Icosian Game Over a century passes… 1857: Irish mathematician William Hamilton designs a game consisting of a board representing 20 “islands” connected by “bridges.” Goal: find a walk that visits every island exactly once and returns back where it started. William Hamilton Icosian Game

70 Genome Reconstruction: A Puzzle With a Billion Pieces Similar Problems with Very Different Fates These two stories have something in common: Find a walk that uses every bridge once (Konigsberg Bridges Problem) Find a walk that visits every island once (Hamilton game) However, while Euler solved the first problem (even for a city with a million bridges), mathematicians still do not know how to solve the second problem, even for a city with a thousand islands. But where are the genomes???

71 Genome Reconstruction: A Puzzle With a Billion Pieces Section 5: Introduction to Graph Theory

72 Genome Reconstruction: A Puzzle With a Billion Pieces Graphs A graph is a network composed of two sets of objects: Vertices: each vertex is represented by a point. Edges: each edge is represented by a segment connecting two vertices. Graph theory can be applied to all kinds of different problems. Transportation networks Disease epidemics Computer viruses spreading through the internet. And, yes…genome sequencing!

73 Genome Reconstruction: A Puzzle With a Billion Pieces Königsberg Bridges Graph For the Königsberg Bridge Problem, we create a graph: Vertices = 4 land masses of the city Edges = 7 bridges connecting land areas Note: We don’t need to worry about the exact placement of vertices or the shape of bridges.

74 Genome Reconstruction: A Puzzle With a Billion Pieces Icosian Game Graph For the Icosian Game, we create a graph: Vertices = islands Edges = bridges connecting the islands

75 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G.

76 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “Here I go!”

77 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…He wakes up in the morning…”

78 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…goes to visit his mommy…”

79 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…when all the little ants are marching…”

80 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…they all do it the same way…”

81 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “Oh no! I’m back where I started!”

82 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Two questions: 1.Is there a cycle of G in which the ant walks through each edge exactly once? 2.Is there a cycle of G in which the ant walks through each vertex exactly once? “???!!!”

83 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Two questions: 1.Is there a cycle of G in which the ant walks through each edge exactly once? Eulerian cycle 2.Is there a cycle of G in which the ant walks through each vertex exactly once? Hamiltonian cycle “I wish someone would name a cycle after me…I’m the one doing all the walking here!”

84 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists.

85 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it?

86 Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1

87 Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2

88 Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3

89 Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4

90 Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4 5

91 Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4 5 6

92 Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4 5 6 7

93 Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3 4 5 6 7 8

94 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles 1 2 3 4 5 6 7 8 9 An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it?

95 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. For example, the graph corresponding to the Icosian game is Hamiltonian. This means that the Icosian game has a solution!

96 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1

97 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2

98 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3

99 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4

100 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5

101 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6

102 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7

103 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8

104 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9

105 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10

106 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11

107 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12

108 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13

109 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

110 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

111 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

112 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

113 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

114 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

115 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

116 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

117 Genome Reconstruction: A Puzzle With a Billion Pieces Finding Eulerian Cycles vs Hamiltonian Cycles Given a graph G, we now have two questions that we can program a computer to answer about G. Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian. Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G is not Hamiltonian.

118 Genome Reconstruction: A Puzzle With a Billion Pieces Section 6: Euler’s Theorem

119 Genome Reconstruction: A Puzzle With a Billion Pieces Euler’s Theorem We will now discuss how Euler solved the Königsberg Bridge Problem. You might guess: He used graph theory! This is not entirely accurate. A better statement would be: He invented graph theory!

120 Genome Reconstruction: A Puzzle With a Billion Pieces Directed Graphs Directed Graph: A graph in which each edge has a direction (represented by an arrow). You might like to think of directed edges as “one-way bridges.” Undirected GraphDirected Graph

121 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in Directed Graphs An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction. A directed graph is Eulerian if it contains an Eulerian cycle. Is this graph Eulerian? Why?

122 Genome Reconstruction: A Puzzle With a Billion Pieces indegree(v) = the number of edges leading into vertex v. outdegree(v) = the number of edges leading out of v. A graph is balanced if indegree(v) = outdegree(v) for every vertex v. Label each vertex v with (indegree(v), outdegree(v)) This graph isn’t balanced since some vertices don’t have equal indegree and outdegree. Balanced Graphs (1, 2) (2, 1) (1, 0) (2, 1) (1, 1) (0, 2) (1, 1)

123 Genome Reconstruction: A Puzzle With a Billion Pieces indegree(v) = the number of edges leading into vertex v. outdegree(v) = the number of edges leading out of v. A graph is balanced if indegree(v) = outdegree(v) for every vertex v. Label each vertex v with (indegree(v), outdegree(v)) Adding some edges makes the graph balanced. Balanced Graphs (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1)

124 Genome Reconstruction: A Puzzle With a Billion Pieces Euler’s Theorem Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced. A graph is connected if for every pair of vertices {u, v}, an ant can travel either from u to v or from v to u. (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1) Not Connected Connected + Balanced = Eulerian

125 Genome Reconstruction: A Puzzle With a Billion Pieces Section 7: ECP vs. HCP and Algorithmic Complexity

126 Genome Reconstruction: A Puzzle With a Billion Pieces Solving the ECP By Euler’s Theorem, to determine whether G contains an Eulerian cycle, we only need to check if G is balanced. So we simply go to each vertex and perform this simple check: If every vertex is balanced, then G must contain an Eulerian cycle. If some vertex is not balanced, then G cannot contain an Eulerian cycle.

127 Genome Reconstruction: A Puzzle With a Billion Pieces Connected + Balanced = Eulerian (1, 2) (2, 1) (1, 0) (1, 1) (0, 2) (1, 1) Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. (2, 1)

128 Genome Reconstruction: A Puzzle With a Billion Pieces Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. Adding the edges to make the graph balanced will mean that an Eulerian cycle must exist. Connected + Balanced = Eulerian (2, 2) (1, 1) (2, 2) (1, 1) 1 2 3 7 6 5 4 8 9 10 11 (2, 2)

129 Genome Reconstruction: A Puzzle With a Billion Pieces Connected + Balanced = Eulerian Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. Adding the edges to make the graph balanced will mean that an Eulerian cycle must exist. One vital question remains: Where did this Eulerian cycle come from? (2, 2) (1, 1) (2, 2) (1, 1) 1 2 7 6 5 4 8 9 10 11 (1, 1) (2, 2) 3

130 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. (2, 2) (1, 1) (2, 2) (1, 1) (2, 2)

131 Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (0, 1) (2, 1) (1, 1) (2, 2)

132 Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 2) (0, 0) (2, 1) (1, 1)

133 Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (2, 1) (1, 1) (0, 1)

134 Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Cycle! But not Eulerian yet… Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (1, 1) (0, 0)

135 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (1, 1) (0, 0) Let’s cut out the cycle that the ant has found.

136 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. (0, 0)

137 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. Next delete vertices that are no longer connected to anything. (0, 0)

138 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. Next delete vertices that are no longer connected to anything.

139 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Again, let the ant walk through the graph however it chooses.

140 Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 2) (2, 2) (1, 1) (1, 0) (1, 1)

141 Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 1) (1, 2) (1, 1) (1, 0) (1, 1)

142 Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 1) (1, 0) (1, 1) “I really don’t see how this is going to give us an Eulerian cycle in the original graph…I knew I shouldn’t have left the house this morning!”

143 Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Cycle! But still not Eulerian… Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 0) (1, 1)

144 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 0) (1, 1) Let’s trim out this cycle one more time.

145 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Making an Eulerian Cycle from a Balanced Graph (1, 1)

146 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Making an Eulerian Cycle from a Balanced Graph (1, 1) “Hmph! Dragged halfway across the screen…I guess I don’t have any say in the matter…”

147 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (1, 1)

148 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 1) (1, 0)

149 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (0, 1) (0, 0) (1, 0)

150 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Cycle! And Eulerian to boot…so we have run out of edges. Making an Eulerian Cycle from a Balanced Graph (0, 0)

151 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Cycle! And Eulerian to boot…so we have run out of edges. What do we do now? Making an Eulerian Cycle from a Balanced Graph (0, 0) “Yes! What DO we do now?”

152 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s bring back our original graph. Making an Eulerian Cycle from a Balanced Graph

153 Genome Reconstruction: A Puzzle With a Billion Pieces Let’s bring back our original graph. Highlight the three cycles that the ant found. Making an Eulerian Cycle from a Balanced Graph

154 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph

155 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1

156 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1 2

157 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3

158 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4

159 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4

160 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5

161 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6

162 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7

163 Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Cycle formed; however, we now have no new edges to follow! Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8 “???”

164 Genome Reconstruction: A Puzzle With a Billion Pieces To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8 “Backtracking? But I’m not evolved to walk backwards!”

165 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

166 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle.

167 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. 7

168 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. 7 8

169 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… 7 8 9

170 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… 7 8 9 10 “I smell something good!”

171 Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… And we have the same Eulerian cycle from before! 7 8 9 10 11 “Yay! Now can I go home please?”

172 Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? The great thing about this method is that it can be easily generalized to any balanced graph to give an Eulerian cycle. “Yeah, but this Eulerian cycle wasn’t that hard to find anyway! So why should we care about the method?” Think about trying to eyeball an Eulerian cycle in a graph containing billions of edges. Not so easy… 1 2 3 4 5 6 7 8 9 10 11

173 Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? More profoundly, this method to find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer. Example: A modern computer can find an Eulerian cycle in a balanced graph containing billions of edges in under a minute! 1 2 3 4 5 6 7 8 9 10 11

174 Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? “Yeah, but computers are supermachines! They don’t really need 300-year old mathematics to help them solve problems. Aren’t they going to take over the world anyway?” So let’s examine the case of finding a Hamiltonian cycle…

175 Genome Reconstruction: A Puzzle With a Billion Pieces Searching for an Efficient Algorithm for HCP Key Point: No one has ever found a similar efficient test to determine whether a graph is Hamiltonian. Of course, we could examine every possible (ant) walk through the graph to solve the HCP. However, this brute force approach is just not efficient: there are more walks through a graph on just 1,000 vertices than there are atoms in the universe!

176 Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems In fact, the HCP has been classified as NP-Complete. In laymen’s terms, this means that the HCP belongs to a collection containing thousands of computational problems that cannot be solved quickly for large input data sets. NP-Complete problems are all equivalent to each other: find an efficient solution to one, and you have an efficient solution to them all.

177 Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, I guess I'm just too dumb.” From Garey and Johnson. Computers and Intractability. 1979 Attempting to solve any NP-Complete problem is difficult.

178 Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, because no such algorithm is possible.” Attempting to solve any NP-Complete problem is difficult. The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist. From Garey and Johnson. Computers and Intractability. 1979

179 Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, but neither can all these famous people.” Attempting to solve any NP-Complete problem is difficult. The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist. The present state of affairs is somewhere in between. From Garey and Johnson. Computers and Intractability. 1979

180 Genome Reconstruction: A Puzzle With a Billion Pieces The NP-Completeness of the HCP The question of whether or not NP-Complete problems (including the HCP) can be solved efficiently is one of seven Millennium Problems in mathematics. Find an efficient algorithm for the HCP, or demonstrate that no such algorithm exists, and you will get $1 million. However, if you become a mathematician, odds are that you are not in it for the $$$...recently, Grigory Perelman solved one of these problems but turned down the prize. Grigory Perelman, True Legend

181 Genome Reconstruction: A Puzzle With a Billion Pieces Section 8: From Euler and Hamilton to Fragment Assembly

182 Genome Reconstruction: A Puzzle With a Billion Pieces Simplifying Assumptions for Fragment Assembly 1.Every k-mer occurring in the genome is generated by some read. 2.Reads are error-free. 3.Every k-mer occurring in the genome occurs exactly once. 4.The underlying genome consists of a single circular-shaped chromosome. Note: In the final section, we will relax these assumptions.

183 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GTGGCGGCA ATG TGGTGC GGC CGTCAA AAT

184 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GTGGCGGCA ATG TGGTGC GGC CGTCAA AAT GTG

185 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GCGGCA ATG TGGTGC GGC CGTCAA AAT GTG GCG

186 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GCA ATG TGGTGC GGC CGTCAA AAT GTG GCGGCA

187 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. ATG TGGTGC GGC CGTCAA AAT GTG GCGGCAATG

188 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. TGGTGC GGC CGTCAA AAT GTG GCGGCAATGTGG

189 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. TGC GGC CGTCAA AAT GTG GCGGCAATGTGGTGC

190 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GGC CGTCAA AAT GTG GCGGCAATGTGGTGCGGC

191 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. CGTCAA AAT GTG GCGGCAATGTGGTGCGGCCGT

192 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. CAA AAT GTG GCGGCAATGTGGTGCGGCCGTCAA

193 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. AAT GTG GCGGCAATGTGGTGCGGCCGTCAAAAT

194 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. ATGCGTGGCAAT GTG TGGTGCCAAGCAGCG

195 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every k-mer detected by our array. Prefix: First k – 1 nucleotides of a k-mer ( CAA ) Suffix: Last k – 1 nucleotides of a k-mer ( CAA ) Different 3-mers may share a prefix/suffix: ATG, TGA, CTG ATGCGTGGCAAT GTG TGGTGCCAAGCAGCG

196 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H ATGCGTGGCAATGTGTGGTGCCAAGCAGCG As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

197 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H ATGCGTGGCAATGTGTGGTGCCAAGCAGCG As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

198 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

199 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

200 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

201 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

202 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

203 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

204 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

205 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

206 Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

207 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

208 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG CGTGGCAATGTGTGGTGCCAAGCAGCG

209 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

210 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

211 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

212 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

213 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

214 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

215 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

216 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

217 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

218 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

219 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG

220 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG Genome: T G A

221 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG ATGG Genome: T G G A

222 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC ATGGC Genome: T G G C A

223 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG ATGGCG Genome: T G G C G A

224 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT ATGGCGT Genome: T G G C G T A

225 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG ATGGCGTG Genome: T G G C G T G A

226 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC ATGGCGTGC Genome: T G G C G T G C A

227 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA ATGGCGTGCA Genome: T G G C G T G C A A

228 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA ATGGCGTGCAA Genome: A T G G C G T G C A

229 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGGCGTGCAAT Genome: A T G G C G T G C A

230 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

231 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

232 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

233 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

234 Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCA Genome: A T G G C G T G C A

235 Genome Reconstruction: A Puzzle With a Billion Pieces Problem with H Ultimately, we must solve the HCP on H in order to find a candidate DNA sequence… This idea motivated the method used for assembling the human genome from 50 million (long and expensive) reads in 2000, but the computational strain was overwhelming: sequencing the human genome took several computers a period of months, working around the clock. For that matter, newer sequencing technologies produce billions of (short and inexpensive) reads: we need a new idea.

236 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

237 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

238 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

239 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

240 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

241 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

242 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

243 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

244 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

245 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

246 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

247 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

248 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

249 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

250 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

251 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

252 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

253 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

254 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

255 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

256 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

257 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

258 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

259 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GTG

260 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GCG GTG

261 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GCG GTG GCA

262 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG GCG GTG GCA

263 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GCG GTG GCA

264 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GCG GTG TGC GCA

265 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG GTG TGC GCA

266 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA

267 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAA

268 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

269 Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

270 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

271 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1

272 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2

273 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2

274 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4

275 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5

276 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6

277 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 7

278 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78

279 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9

280 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10

281 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10

282 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT This is the same sequence of 3-mers that we had in H! ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG

283 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT This is the same sequence of 3-mers that we had in H! Thus we will obtain the same sequenced genome as before. ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

284 Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT This is the same sequence of 3-mers that we had in H! Thus we will obtain the same sequenced genome as before. ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCA Genome: A T G G C G T G C A

285 Genome Reconstruction: A Puzzle With a Billion Pieces Analysis of E Good News: We now only have to find an Eulerian cycle in the graph E, which could be done on this computer. Bad News: 1.There may be more than one Eulerian cycle in E. We won’t discuss this issue here, but it can be resolved. 2.How do we know that E even has an Eulerian cycle? By Euler’s Theorem, we only need to show that E is a balanced graph. To do this, we need one more piece of mathematical history…

286 Genome Reconstruction: A Puzzle With a Billion Pieces Section 9: De Bruijn and Fragment Assembly

287 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question 1946: The Dutch mathematician Nicolaas de Bruijn asks: can we design a circular superstring of minimal length that contains every binary string of length k? Example for k = 3. The circular superstring ‘00011101’ contains all eight binary strings of length 3. We illustrate the locations of ‘000’ and ’110’ on the string. Nicolaas de Bruijn

288 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question De Bruijn introduced a special class of graph B(n, k): Vertices = all n k – 1 possible (k – 1)-mers in n-letter alphabet. An edge connects v to w if there is a k-mer whose prefix = v and whose suffix = w. At right is B(2, 4), assuming that our alphabet contains 0 and 1.

289 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question For any choice of n and k, B(n, k) must be balanced/Eulerian. Why? Because both the indegree and the outdegree of every vertex is equal to the size of the alphabet (n), since every (k – 1)-mer will occur as the prefix or suffix of n different k-mers. Red numbers show the order of edges in an Eulerian cycle.

290 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

291 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

292 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

293 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

294 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

295 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

296 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

297 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

298 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

299 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

300 Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4 5 6 78 9 10 ATGGCGTGCA Genome:

301 Genome Reconstruction: A Puzzle With a Billion Pieces Section 10: Generalizing Fragment Assembly

302 Genome Reconstruction: A Puzzle With a Billion Pieces Simplifying Assumptions for Fragment Assembly Recall the assumptions we have already made: 1.Every k-mer occurring in the genome is generated by some read. 2.Reads are error-free. 3.Every k-mer occurring in the genome occurs exactly once. 4.The underlying genome consists of a single circular-shaped chromosome. Our aim is to relax each of these assumptions and determine how the problem changes.

303 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 1: Generating (nearly) all k-mers 100-nucleotide reads generated by Illumina sequencing technology capture only a small fraction of 100-mers from the genome (even for high-coverage sequencing projects), thus violating this key assumption of the de Bruijn graphs. However, if we break these reads into shorter k-mers, the resulting k-mers often represent nearly all k-mers from the genome for sufficiently small k. For example, modern assemblers often break every 100- nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs.

304 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

305 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

306 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

307 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

308 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

309 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

310 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

311 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

312 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

313 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

314 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

315 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

316 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

317 Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

318 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads What happens to the graph E when some reads have errors? Example: Say our graph E for genome ATGGCGTGCAATG should look like this.

319 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads What happens to the graph E when some reads have errors? Example: Say our graph E for genome ATGGCGTGCAATG should look like this. If read TGGCGTG is mistakenly sequenced as TGGAGTG, then the graph will look like this instead. This is called a bulge in the graph E.

320 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads Most reads have errors, resulting in millions of bulges in E. 2004: Pevzner et al. provide algorithm for bulge removal.

321 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers The genome ACGTACGT has only four 3-mers: ACG, CGT, GTA, and TAC. We would obtain the graph E below and reconstruct this genome as: ACGT In other words, we can’t represent repeated k-mers in the genome! ACCG GT TA TAC ACG CGT GTA

322 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Define the multiplicity of a k-mer as the number of times it occurs in a genome. We will add edges to E in order to form a new graph E* for which the number of edges connecting two vertices represents the multiplicity of the k-mer on that edge. An Eulerian cycle in E* still gives a candidate genome.

323 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

324 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

325 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

326 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

327 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

328 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

329 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

330 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

331 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

332 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

333 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

334 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

335 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

336 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

337 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

338 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

339 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

340 Genome Reconstruction: A Puzzle With a Billion Pieces Determining k-mer multiplicities How can we find the multiplicity of a k-mer in the genome? The multiplicity of a k-mer will be directly related to the frequency with which that k-mer occurs in our reads. So a k-mer that appears 5 times in the genome is expected to occur 5 times as often in our reads. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

341 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes The genomes for all complex organisms are split across a number of linear chromosomes (46 in humans). So in order to sequence the human genome, geneticists simply sequenced all of these linear chromosomes. Question: How do we sequence a linear segment of DNA?

342 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Say our linear DNA segment is ATGCGTGGCGTGCA. Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers: CAA

343 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Say our linear DNA segment is ATGCGTGGCGTGCA. Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers: CAA AAT

344 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

345 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

346 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA

347 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA

348 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. So to sequence our segment ATGCGTGGCGTGCA, we need to find a path through E* that starts with AT, ends at CA, and uses every edge in between. CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA

349 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once. So an Eulerian path is just like an Eulerian cycle, except that we don’t have to start and end at the same vertex. Luckily, Euler’s Theorem generalizes to efficiently determine whether a graph has an Eulerian path and then find this path. Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.

350 Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced. So E* must contain an Eulerian path, because AT and CA (the endpoints of our segment) are the only two vertices that aren’t balanced. Hence in every case we have solved our giant puzzle! CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA

351 Genome Reconstruction: A Puzzle With a Billion Pieces What’s Next?

352 Genome Reconstruction: A Puzzle With a Billion Pieces Personal Genomics: Millions of Human Genomes Personal genome sequencing started from sequencing the genomes of a few scientists in 2009 and will soon expand to millions of individuals. Thousands of cancer genomes have already been sequenced, and genome sequencing will soon become a routine technique in medicine. At the heart of this revolution are bioinformaticians, who must harness precise methods in order to analyze the growing data. 10 scientists and entrepreneurs who made their genomes publicly available in 2009

353 Genome Reconstruction: A Puzzle With a Billion Pieces Genome 10K and Beyond 2010: Scientists launch an ambitious project to sequence 10,000 species genomes. 201x?: We will hopefully be able to reconstruct the “tree of life” and uncover the genomes of ancestors that lived millions of years ago. 20xx?: Maybe, just maybe, we will be able to discover why giraffes grew necks and humans grew brains.


Download ppt "Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University."

Similar presentations


Ads by Google