Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Theory Of Automata By Dr. MM Alam
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Introduction to Graphs
The 7 Bridges of Konigsberg A puzzle in need of solving.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Chapter 7 Graph Theory 7.1 Modeling with graphs and finding Euler circuits. Learning Objectives: Know how to use graphs as models and how to determine.
1 Lecture 5 (part 2) Graphs II Euler and Hamiltonian Path / Circuit Reading: Epp Chp 11.2, 11.3.
BY: MIKE BASHAM, Math in Scheduling. The Bridges of Konigsberg.
Euler Circuits and Paths
Koenigsberg bridge problem It is the Pregel River divided Koenigsberg into four distinct sections. Seven bridges connected the four portions of Koenigsberg.
Excursions in Modern Mathematics, 7e: Copyright © 2010 Pearson Education, Inc. 5 The Mathematics of Getting Around 5.1Euler Circuit Problems 5.2What.
Section 2.1 Euler Cycles Vocabulary CYCLE – a sequence of consecutively linked edges (x 1,x2),(x2,x3),…,(x n-1,x n ) whose starting vertex is the ending.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Analysis of Algorithms CS 477/677
Complexity ©D.Moshkovitz 1 Paths On the Reasonability of Finding Paths in Graphs.
Chapter 11 Graphs and Trees This handout: Terminology of Graphs Eulerian Cycles.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Copyright © Cengage Learning. All rights reserved.
Physical Mapping of DNA Shanna Terry March 2, 2004.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
EECS 203: It’s the end of the class and I feel fine. Graphs.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Excursions in Modern Mathematics, 7e: Copyright © 2010 Pearson Education, Inc. 6 The Mathematics of Touring 6.1Hamilton Paths and Hamilton Circuits.
Graph Theory Topics to be covered:
394C March 5, 2012 Introduction to Genome Assembly.
1 Excursions in Modern Mathematics Sixth Edition Peter Tannenbaum.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Algorithms and Running Time Algorithm: Well defined and finite sequence of steps to solve a well defined problem. Eg.,, Sequence of steps to multiply two.
CSE 20: Discrete Mathematics for Computer Science Prof. Shachar Lovett.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Outline More exhaustive search algorithms Today: Motif finding
Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Erdal Kose CC30.10 These slides are based of Prof. N. Yanofsky Lecture notes.
Week 12 - Wednesday.  What did we talk about last time?  Matching  Stable marriage  Started Euler paths.
CIRCUITS, PATHS, AND SCHEDULES Euler and Königsberg.
Euler Paths and Circuits. The original problem A resident of Konigsberg wrote to Leonard Euler saying that a popular pastime for couples was to try.
Lecture 11: 9.4 Connectivity Paths in Undirected & Directed Graphs Graph Isomorphisms Counting Paths between Vertices 9.5 Euler and Hamilton Paths Euler.
Easy, Hard, and Impossible Elaine Rich. Easy Tic Tac Toe.
M Clements Formal Network Theory. Introduction Practical problem – The Seven Bridges of Königsberg Network graphs Nodes & edges Degrees Rules/ axioms.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
CSC 413/513: Intro to Algorithms
Introduction to Graph Theory
Konisberg Bridges (One-way street) SOL: DM.2 Classwork Quiz/worksheet Homework (day 62) worksheet.
Graph Theory Two Applications D.N. Seppala-Holtzman St. Joseph ’ s College.
Euler and Hamiltonian Graphs
1 Euler and Hamilton paths Jorge A. Cobb The University of Texas at Dallas.
Grade 11 AP Mathematics Graph Theory Definition: A graph, G, is a set of vertices v(G) = {v 1, v 2, v 3, …, v n } and edges e(G) = {v i v j where 1 ≤ i,
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 Lecture 5 (part 2) Graphs II (a) Circuits; (b) Representation Reading: Epp Chp 11.2, 11.3
THE ROLES OF DNA.
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
EECS 203 Lecture 19 Graphs.
Euler and Hamiltonian Graphs
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
EECS 203 Lecture 20 More Graphs.
Discrete Maths 9. Graphs Objective
Graphs Chapter 13.
Graph Algorithms in Bioinformatics
Presentation transcript:

Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University of California-San Diego

Genome Reconstruction: A Puzzle With a Billion Pieces Outline 1.Introduction to Genome Sequencing 2.The Newspaper Problem 3.DNA Chips: A First Shot at Sequencing with Short Reads 4.Two Mathematical Detours 5.Introduction to Graph Theory 6.Euler’s Theorem 7.ECP vs. HCP and Algorithmic Complexity 8.From Euler and Hamilton to Fragment Assembly 9.De Bruijn and a Final Solution to Fragment Assembly 10.Generalizing Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces Section 1: Introduction to Genome Sequencing

Genome Reconstruction: A Puzzle With a Billion Pieces What Is Genome Sequencing? A genome can be represented as a book written in an alphabet containing only 4 letters, called nucleotides: A,T,G, and C. A human genome has roughly 3 billion nucleotides. Genome sequencing is the process of determining the sequence of nucleotides that make up a genome....CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGA TCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACA GATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATATAGCCGAGCGGCTACGATG ATGCTAGCTGTACAGCTGATGATCTAGCTATCGATGCGATCGATGCGCGAGTGCGATCGATCACTTCGAGCTAGCTGATCGATCGA TGCTAGCTAGCTGACTGATCATGGCGTTAGCTAGCTAGCTGATCGTCGATCGTACGTAGCTGATTACGATCGTCCGATCGTGCTAT GACGTACGAGGCGGCTACGTAGCATGCTAGCTGACTGATGTAGCTAGCTATACGATACTATATATTCGATCGATTTATTACCATGA CTGACGCGCATCGCTGTACACGTACTAGCTGATCGATGCTAGTCGATCGATCGATCATGTTATATATCGCGGCGCATCGATCGACT GCTCGATTATCGATACGTCGATCGCTGTATATACGTCTTTATAGCTAGGAGCATAGCGACGCGCTATCGATCGATCGTCTAGTCGA CTGATCGTACTAGCTGACGCTGACGACTAGCTAGCTATCGACGATCGTAGTGCGATTACTAGCTAGGATCCTACTGTACGTCAGTC AGTCTGATCGATAGCGAGGAAAGCGAGACTGATCGTTCTCTAGATGTAGCTGATGTGACTACTATACTACTGGCAGCGATCGGGA…

Genome Reconstruction: A Puzzle With a Billion Pieces What Is Genome Sequencing? Different people have slightly different genomes: all humans share 99.9% of the same genetic code. The 0.1% difference accounts for height, eye color, high cholesterol susceptibility, etc. CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGATGAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

Genome Reconstruction: A Puzzle With a Billion Pieces Species Sequencing vs. Individual Genome Sequencing Species Sequencing: Determine the “consensus genome” of an entire species.

Genome Reconstruction: A Puzzle With a Billion Pieces Species Sequencing vs. Individual Genome Sequencing Individual Sequencing: Determine how an individual differs from its species.

Genome Reconstruction: A Puzzle With a Billion Pieces Species genome sequencing: Compare various species (e.g. human and chimpanzee) to understand how their genes function (e.g. which genes are important for brain development). Reveal evolutionary relationships between species. Determine the genetic makeup of our evolutionary ancestors. Why Would We Want to Sequence a Genome?

Genome Reconstruction: A Puzzle With a Billion Pieces Why Would We Want to Sequence a Genome? Individual genome sequencing: Unearth the genetic basis of many diseases. Forensics applications. Example: In 2010, 6-year old Nicholas Volker became the first human being to be saved because of genome sequencing. Doctors could not diagnose his condition, which caused strange infections; he went through nearly 100 surgeries. Genome sequencing revealed a rare mutation in a gene linked to a defect in his immune system. This led doctors to use advanced immunotherapy, which saved the child.

Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Genome Sequencing Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods. 1980: They share the Nobel Prize in Chemistry. Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence the human genome. Walter Gilbert Frederick Sanger

Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Genome Sequencing 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome. 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal. Francis Collins Craig Venter

Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Mammalian Genome Sequencing 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.

Genome Reconstruction: A Puzzle With a Billion Pieces Brief History of Mammalian Genome Sequencing 2000s: Many more mammalian genomes are sequenced.

Genome Reconstruction: A Puzzle With a Billion Pieces The Arrival of Personal Genomics 2000s: Many companies launch projects aimed at reducing sequencing costs by orders of magnitude. 2010: The market for sequencing machines takes off. Illumina reduces the cost of sequencing an individual human genome from $3 billion to $10,000. Complete Genomics builds a genomic factory in Silicon Valley that sequences hundreds of genomes per month. Beijing Genome Institute orders hundreds of sequencing machines, becoming the world’s largest sequencing center. 23andMe offers partial genome sequencing for $499. Many universities introduce new courses in which students study their own genomes.

Genome Reconstruction: A Puzzle With a Billion Pieces The Future of Genome Sequencing 2010s?: Genome sequencing will hopefully continue to bloom. The $1,000 human genome may arrive as early as in Hopefully, sequencing an individual genome will soon become as routine as an X-ray.

Genome Reconstruction: A Puzzle With a Billion Pieces What Makes Genome Sequencing So Difficult? When we read a book, we can read the entire book one letter at a time from the beginning to the end. However, modern sequencing machines cannot read an entire genome one nucleotide at a time from beginning to end. They can only shred the genome and read the short pieces. Thus, we can identify very short fragments of DNA (~100 nucleotides long), called reads. But we have no idea which genomic positions these reads come from! We must figure out how to put the reads back together to assemble a genome.

Genome Reconstruction: A Puzzle With a Billion Pieces Section 2: The Newspaper Problem and Genome Sequencing

Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces The Newspaper Problem as an “Overlap Puzzle” The newspaper problem is not the same as a jigsaw puzzle: We have multiple copies of the same edition of a newspaper. Plus, some pieces of paper got blown to bits in the explosion. Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said. This gives us a giant overlap puzzle!

Genome Reconstruction: A Puzzle With a Billion Pieces In the newspaper problem, we have the rules of language and common sense (e.g. “murder” and “suspect” would often appear near each other in a newspaper.) However, the “language” of DNA remains largely unknown. Sequencing is Harder than Newspaper Problem

Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing is Harder than Newspaper Problem There are lots of repeated substrings in every genome (50% of human genome is formed by repeats). Example: GCTT is repeated 4 times in the following: AAGCTTCTATTGCTTAATTGGCTTGCTTCGCTTTG Analogy: The Triazzle puzzle contains lots of repeated figures. This makes it very difficult to solve (even with just 16 pieces).

Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Lab + Computation Read Generation (Experimental): Generate many reads from multiple copies of the same genome. Fragment Assembly (Computational): Use these reads to algorithmically put the genome back together.

Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies

Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Read Generation

Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Read Generation

Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Read Generation Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces Sequencing a Genome: Illustration Multiple (Unsequenced) Genome Copies Reads Sequenced Genome … GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGCC … Read Generation Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces Section 3: DNA Chips: A First Shot at Sequencing with Short Reads

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: From an Idea to a New Industry 1989: Radoje Drmanac, Andrey Mirzabekov, and Edwin Southern independently invent DNA chips (arrays) for read generation. Key Idea: Generate all k-mers (see below) from the genome in the hope that they can be assembled to reconstruct the genome. 1989: Science magazine writes, “Using DNA arrays for sequencing would simply be substituting one horrendous task for another.” 2000: Arrays are a multi-billion dollar industry Southern Mirzabekov Drmanac k-mer: A string of length k (in an alphabet of 4 nucleotides)

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Implementation 1.Synthesize a distinct k-mer in each of 4 k cells in the array. 2.Cover the array with multiple copies of a fluorescently-labeled unknown DNA fragment. 3.DNA will hybridize with a k-mer if it contains the complement of that k-mer. 4.Use a spectroscope to determine which sites emit light …the complements of these sites will reveal the k-mers in the unknown DNA fragment = our reads!

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Illustration

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? AA A AG A CA A CG A GA A GG A TA A TG A AA C AG C CA C CG C GA C GG C TA C TG C AA G AG G CA G CG G GA G GG G TA G TG G AA T AG T CA T CG T GA T GG T TA T TG T AC A AT A CC A CT A GC A GT A TC A TT A AC C AT C CC C CT C GC C GT C TC C TT C AC G AT G CC G CT G GC G GT G TC G TT G AC T AT T CC T CT T GC T GT T TC T TT T

Genome Reconstruction: A Puzzle With a Billion Pieces CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T DNA Chips: Example What are our reads? CAT

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ||| ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? CAT ATG CA C CG C TG C CA T CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces DNA Chips: Example What are our reads? So 3-mer ATG must occur in the genome! ATG CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC CGC  GCG CAT  ATG CA C CG C TG C AT G CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome What are our reads? CAC  GTG CGC  GCG CAT  ATG GT G CG C TG C AT G CC A GC A GC C AC G TT G AT T

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G CG C TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G TG C AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C AC G TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AT T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G CC A GC A GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G GC A GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G GC A GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GC C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T TT G AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG

Genome Reconstruction: A Puzzle With a Billion Pieces Red 3-mers Must Occur in the Genome GT G GC G GC A AT G TG G TG C GG C CG T CA A AA T What are our reads? CAC  GTG CGC  GCG CAT  ATG TGC  GCA ACG  CGT ATT  AAT CCA  TGG GCA  TGC GCC  GGC TTG  CAA

Genome Reconstruction: A Puzzle With a Billion Pieces From Biological Data to Computational Problem GT G GC G GC A AT G TG G TG C GG C CG T CA A AA T Aim: Construct a shortest possible genome containing all our reads. This is now a computational problem!

Genome Reconstruction: A Puzzle With a Billion Pieces Section 4: Two Mathematical Detours

Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg The people of Königsberg, Prussia (present-day Kaliningrad, Russia) enjoyed taking walks.

Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg They wondered if they could walk through the city, cross each bridge (blue) exactly once, and return where they started.

Genome Reconstruction: A Puzzle With a Billion Pieces The Bridges of Königsberg 1735: Leonhard Euler develops an approach to answer this question for any city, even for a “city” with a million islands. We will soon discuss Euler’s method as well as how it applies to genome sequencing. Leonhard Euler

Genome Reconstruction: A Puzzle With a Billion Pieces The Icosian Game Over a century passes… 1857: Irish mathematician William Hamilton designs a game consisting of a board representing 20 “islands” connected by “bridges.” Goal: find a walk that visits every island exactly once and returns back where it started. William Hamilton Icosian Game

Genome Reconstruction: A Puzzle With a Billion Pieces Similar Problems with Very Different Fates These two stories have something in common: Find a walk that uses every bridge once (Konigsberg Bridges Problem) Find a walk that visits every island once (Hamilton game) However, while Euler solved the first problem (even for a city with a million bridges), mathematicians still do not know how to solve the second problem, even for a city with a thousand islands. But where are the genomes???

Genome Reconstruction: A Puzzle With a Billion Pieces Section 5: Introduction to Graph Theory

Genome Reconstruction: A Puzzle With a Billion Pieces Graphs A graph is a network composed of two sets of objects: Vertices: each vertex is represented by a point. Edges: each edge is represented by a segment connecting two vertices. Graph theory can be applied to all kinds of different problems. Transportation networks Disease epidemics Computer viruses spreading through the internet. And, yes…genome sequencing!

Genome Reconstruction: A Puzzle With a Billion Pieces Königsberg Bridges Graph For the Königsberg Bridge Problem, we create a graph: Vertices = 4 land masses of the city Edges = 7 bridges connecting land areas Note: We don’t need to worry about the exact placement of vertices or the shape of bridges.

Genome Reconstruction: A Puzzle With a Billion Pieces Icosian Game Graph For the Icosian Game, we create a graph: Vertices = islands Edges = bridges connecting the islands

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G.

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “Here I go!”

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…He wakes up in the morning…”

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…goes to visit his mommy…”

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…when all the little ants are marching…”

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “…they all do it the same way…”

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Now, consider an ant standing on a vertex of a graph G. The ant can walk from vertex to vertex along the edges of G. If the ant returns where it started, the result of its walk forms a cycle of G. “Oh no! I’m back where I started!”

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Two questions: 1.Is there a cycle of G in which the ant walks through each edge exactly once? 2.Is there a cycle of G in which the ant walks through each vertex exactly once? “???!!!”

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian and Hamiltonian Cycles Two questions: 1.Is there a cycle of G in which the ant walks through each edge exactly once? Eulerian cycle 2.Is there a cycle of G in which the ant walks through each vertex exactly once? Hamiltonian cycle “I wish someone would name a cycle after me…I’m the one doing all the walking here!”

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists.

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it?

Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1

Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2

Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles 1 2 3

Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles

Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles

Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles

Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles

Genome Reconstruction: A Puzzle With a Billion Pieces An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it? Eulerian Cycles

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles An Eulerian cycle is a cycle that travels to each edge exactly once. A graph containing such a cycle is called Eulerian. If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. However, no such cycle exists. If we add two more edges, there will be such a cycle; see it?

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. For example, the graph corresponding to the Icosian game is Hamiltonian. This means that the Icosian game has a solution!

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian. 1 2

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once. A graph containing such a cycle is called Hamiltonian

Genome Reconstruction: A Puzzle With a Billion Pieces Finding Eulerian Cycles vs Hamiltonian Cycles Given a graph G, we now have two questions that we can program a computer to answer about G. Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian. Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G is not Hamiltonian.

Genome Reconstruction: A Puzzle With a Billion Pieces Section 6: Euler’s Theorem

Genome Reconstruction: A Puzzle With a Billion Pieces Euler’s Theorem We will now discuss how Euler solved the Königsberg Bridge Problem. You might guess: He used graph theory! This is not entirely accurate. A better statement would be: He invented graph theory!

Genome Reconstruction: A Puzzle With a Billion Pieces Directed Graphs Directed Graph: A graph in which each edge has a direction (represented by an arrow). You might like to think of directed edges as “one-way bridges.” Undirected GraphDirected Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in Directed Graphs An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction. A directed graph is Eulerian if it contains an Eulerian cycle. Is this graph Eulerian? Why?

Genome Reconstruction: A Puzzle With a Billion Pieces indegree(v) = the number of edges leading into vertex v. outdegree(v) = the number of edges leading out of v. A graph is balanced if indegree(v) = outdegree(v) for every vertex v. Label each vertex v with (indegree(v), outdegree(v)) This graph isn’t balanced since some vertices don’t have equal indegree and outdegree. Balanced Graphs (1, 2) (2, 1) (1, 0) (2, 1) (1, 1) (0, 2) (1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces indegree(v) = the number of edges leading into vertex v. outdegree(v) = the number of edges leading out of v. A graph is balanced if indegree(v) = outdegree(v) for every vertex v. Label each vertex v with (indegree(v), outdegree(v)) Adding some edges makes the graph balanced. Balanced Graphs (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Euler’s Theorem Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced. A graph is connected if for every pair of vertices {u, v}, an ant can travel either from u to v or from v to u. (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1) Not Connected Connected + Balanced = Eulerian

Genome Reconstruction: A Puzzle With a Billion Pieces Section 7: ECP vs. HCP and Algorithmic Complexity

Genome Reconstruction: A Puzzle With a Billion Pieces Solving the ECP By Euler’s Theorem, to determine whether G contains an Eulerian cycle, we only need to check if G is balanced. So we simply go to each vertex and perform this simple check: If every vertex is balanced, then G must contain an Eulerian cycle. If some vertex is not balanced, then G cannot contain an Eulerian cycle.

Genome Reconstruction: A Puzzle With a Billion Pieces Connected + Balanced = Eulerian (1, 2) (2, 1) (1, 0) (1, 1) (0, 2) (1, 1) Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. (2, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. Adding the edges to make the graph balanced will mean that an Eulerian cycle must exist. Connected + Balanced = Eulerian (2, 2) (1, 1) (2, 2) (1, 1) (2, 2)

Genome Reconstruction: A Puzzle With a Billion Pieces Connected + Balanced = Eulerian Recall our example directed graph from before. Here the graph is not balanced, and so it clearly isn’t Eulerian. Adding the edges to make the graph balanced will mean that an Eulerian cycle must exist. One vital question remains: Where did this Eulerian cycle come from? (2, 2) (1, 1) (2, 2) (1, 1) (1, 1) (2, 2) 3

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. (2, 2) (1, 1) (2, 2) (1, 1) (2, 2)

Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (0, 1) (2, 1) (1, 1) (2, 2)

Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 2) (0, 0) (2, 1) (1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (2, 1) (1, 1) (0, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes. The ant cannot walk along any edge that has been previously traversed. The ant must always walk along edges in the legal direction. At each step, we update the remaining indegree and outdegree of each vertex. Cycle! But not Eulerian yet… Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (1, 1) (0, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) (0, 0) (1, 1) (0, 0) Let’s cut out the cycle that the ant has found.

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. (0, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. Next delete vertices that are no longer connected to anything. (0, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Let’s cut out the cycle that the ant has found. Next delete vertices that are no longer connected to anything.

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (2, 2) (1, 1) Again, let the ant walk through the graph however it chooses.

Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 2) (2, 2) (1, 1) (1, 0) (1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 1) (1, 2) (1, 1) (1, 0) (1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 1) (1, 0) (1, 1) “I really don’t see how this is going to give us an Eulerian cycle in the original graph…I knew I shouldn’t have left the house this morning!”

Genome Reconstruction: A Puzzle With a Billion Pieces Again, let the ant walk through the graph however it chooses. We always start with a balanced graph, which means that the ant can never “get stuck” at a vertex along the way, because it will always have an edge leading out of any vertex that it enters. Cycle! But still not Eulerian… Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 0) (1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 0) (1, 1) Let’s trim out this cycle one more time.

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Making an Eulerian Cycle from a Balanced Graph (1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Making an Eulerian Cycle from a Balanced Graph (1, 1) “Hmph! Dragged halfway across the screen…I guess I don’t have any say in the matter…”

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (1, 1)

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (1, 1) (0, 1) (1, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Making an Eulerian Cycle from a Balanced Graph (0, 1) (0, 0) (1, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Cycle! And Eulerian to boot…so we have run out of edges. Making an Eulerian Cycle from a Balanced Graph (0, 0)

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s trim out this cycle one more time. The ant is stranded, so let’s move it to a vertex. Now there’s only one way that the ant can walk through the graph. Cycle! And Eulerian to boot…so we have run out of edges. What do we do now? Making an Eulerian Cycle from a Balanced Graph (0, 0) “Yes! What DO we do now?”

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s bring back our original graph. Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Let’s bring back our original graph. Highlight the three cycles that the ant found. Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1 2

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph 1 2 3

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Making an Eulerian Cycle from a Balanced Graph

Genome Reconstruction: A Puzzle With a Billion Pieces Start at the ant’s original position, and follow the green cycle. Cycle formed: we can continue along the blue cycle. Cycle formed; however, we now have no new edges to follow! Making an Eulerian Cycle from a Balanced Graph “???”

Genome Reconstruction: A Puzzle With a Billion Pieces To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Making an Eulerian Cycle from a Balanced Graph “Backtracking? But I’m not evolved to walk backwards!”

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle.

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. 7

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. 7 8

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… 7 8 9

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… “I smell something good!”

Genome Reconstruction: A Puzzle With a Billion Pieces Making an Eulerian Cycle from a Balanced Graph To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it. Success! Now let’s follow the orange cycle. Rejoin the blue cycle… And we have the same Eulerian cycle from before! “Yay! Now can I go home please?”

Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? The great thing about this method is that it can be easily generalized to any balanced graph to give an Eulerian cycle. “Yeah, but this Eulerian cycle wasn’t that hard to find anyway! So why should we care about the method?” Think about trying to eyeball an Eulerian cycle in a graph containing billions of edges. Not so easy…

Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? More profoundly, this method to find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer. Example: A modern computer can find an Eulerian cycle in a balanced graph containing billions of edges in under a minute!

Genome Reconstruction: A Puzzle With a Billion Pieces What’s the Big Deal? “Yeah, but computers are supermachines! They don’t really need 300-year old mathematics to help them solve problems. Aren’t they going to take over the world anyway?” So let’s examine the case of finding a Hamiltonian cycle…

Genome Reconstruction: A Puzzle With a Billion Pieces Searching for an Efficient Algorithm for HCP Key Point: No one has ever found a similar efficient test to determine whether a graph is Hamiltonian. Of course, we could examine every possible (ant) walk through the graph to solve the HCP. However, this brute force approach is just not efficient: there are more walks through a graph on just 1,000 vertices than there are atoms in the universe!

Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems In fact, the HCP has been classified as NP-Complete. In laymen’s terms, this means that the HCP belongs to a collection containing thousands of computational problems that cannot be solved quickly for large input data sets. NP-Complete problems are all equivalent to each other: find an efficient solution to one, and you have an efficient solution to them all.

Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, I guess I'm just too dumb.” From Garey and Johnson. Computers and Intractability Attempting to solve any NP-Complete problem is difficult.

Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, because no such algorithm is possible.” Attempting to solve any NP-Complete problem is difficult. The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist. From Garey and Johnson. Computers and Intractability. 1979

Genome Reconstruction: A Puzzle With a Billion Pieces NP-Complete Problems “I can't find an efficient algorithm, but neither can all these famous people.” Attempting to solve any NP-Complete problem is difficult. The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist. The present state of affairs is somewhere in between. From Garey and Johnson. Computers and Intractability. 1979

Genome Reconstruction: A Puzzle With a Billion Pieces The NP-Completeness of the HCP The question of whether or not NP-Complete problems (including the HCP) can be solved efficiently is one of seven Millennium Problems in mathematics. Find an efficient algorithm for the HCP, or demonstrate that no such algorithm exists, and you will get $1 million. However, if you become a mathematician, odds are that you are not in it for the $$$...recently, Grigory Perelman solved one of these problems but turned down the prize. Grigory Perelman, True Legend

Genome Reconstruction: A Puzzle With a Billion Pieces Section 8: From Euler and Hamilton to Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces Simplifying Assumptions for Fragment Assembly 1.Every k-mer occurring in the genome is generated by some read. 2.Reads are error-free. 3.Every k-mer occurring in the genome occurs exactly once. 4.The underlying genome consists of a single circular-shaped chromosome. Note: In the final section, we will relax these assumptions.

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GTGGCGGCA ATG TGGTGC GGC CGTCAA AAT

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GTGGCGGCA ATG TGGTGC GGC CGTCAA AAT GTG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GCGGCA ATG TGGTGC GGC CGTCAA AAT GTG GCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GCA ATG TGGTGC GGC CGTCAA AAT GTG GCGGCA

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. ATG TGGTGC GGC CGTCAA AAT GTG GCGGCAATG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. TGGTGC GGC CGTCAA AAT GTG GCGGCAATGTGG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. TGC GGC CGTCAA AAT GTG GCGGCAATGTGGTGC

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. GGC CGTCAA AAT GTG GCGGCAATGTGGTGCGGC

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. CGTCAA AAT GTG GCGGCAATGTGGTGCGGCCGT

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. CAA AAT GTG GCGGCAATGTGGTGCGGCCGTCAA

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. AAT GTG GCGGCAATGTGGTGCGGCCGTCAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every read detected by our array. ATGCGTGGCAAT GTG TGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H Create a vertex for every k-mer detected by our array. Prefix: First k – 1 nucleotides of a k-mer ( CAA ) Suffix: Last k – 1 nucleotides of a k-mer ( CAA ) Different 3-mers may share a prefix/suffix: ATG, TGA, CTG ATGCGTGGCAAT GTG TGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H ATGCGTGGCAATGTGTGGTGCCAAGCAGCG As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H ATGCGTGGCAATGTGTGGTGCCAAGCAGCG As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces First Try: The Graph H As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w. ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG CGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATGCGTGGCAATGTGTGGTGCCAAGCAGCG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG Genome: T G A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG ATGG Genome: T G G A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC ATGGC Genome: T G G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG ATGGCG Genome: T G G C G A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT ATGGCGT Genome: T G G C G T A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG ATGGCGTG Genome: T G G C G T G A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC ATGGCGTGC Genome: T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA ATGGCGTGCA Genome: T G G C G T G C A A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA ATGGCGTGCAA Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGGCGTGCAAT Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Hamiltonian Cycles in H Here we have a Hamiltonian cycle in H: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT  ATG ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCA Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Problem with H Ultimately, we must solve the HCP on H in order to find a candidate DNA sequence… This idea motivated the method used for assembling the human genome from 50 million (long and expensive) reads in 2000, but the computational strain was overwhelming: sequencing the human genome took several computers a period of months, working around the clock. For that matter, newer sequencing technologies produce billions of (short and inexpensive) reads: we need a new idea.

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. GT TGGC CG CAAT GG AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GTG

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GCG GTG

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads GCG GTG GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG GCG GTG GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GCG GTG GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GCG GTG TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG GTG TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAA

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Second Try: The Graph E Form a different graph E as follows: Create a vertex for each distinct prefix/suffix from reads. Connect vertex v to vertex w with a directed edge if there is a read whose prefix is v and whose suffix is w. CAGC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT 1 2 4

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT This is the same sequence of 3-mers that we had in H! ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT This is the same sequence of 3-mers that we had in H! Thus we will obtain the same sequenced genome as before. ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCAATG Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Eulerian Cycles in E We have an Eulerian cycle in E: ATG  TGG  GGC  GCG  CGT  GTG  TGC  GCA  CAA  AAT This is the same sequence of 3-mers that we had in H! Thus we will obtain the same sequenced genome as before. ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG ATGGCGTGCA Genome: A T G G C G T G C A

Genome Reconstruction: A Puzzle With a Billion Pieces Analysis of E Good News: We now only have to find an Eulerian cycle in the graph E, which could be done on this computer. Bad News: 1.There may be more than one Eulerian cycle in E. We won’t discuss this issue here, but it can be resolved. 2.How do we know that E even has an Eulerian cycle? By Euler’s Theorem, we only need to show that E is a balanced graph. To do this, we need one more piece of mathematical history…

Genome Reconstruction: A Puzzle With a Billion Pieces Section 9: De Bruijn and Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question 1946: The Dutch mathematician Nicolaas de Bruijn asks: can we design a circular superstring of minimal length that contains every binary string of length k? Example for k = 3. The circular superstring ‘ ’ contains all eight binary strings of length 3. We illustrate the locations of ‘000’ and ’110’ on the string. Nicolaas de Bruijn

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question De Bruijn introduced a special class of graph B(n, k): Vertices = all n k – 1 possible (k – 1)-mers in n-letter alphabet. An edge connects v to w if there is a k-mer whose prefix = v and whose suffix = w. At right is B(2, 4), assuming that our alphabet contains 0 and 1.

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question For any choice of n and k, B(n, k) must be balanced/Eulerian. Why? Because both the indegree and the outdegree of every vertex is equal to the size of the alphabet (n), since every (k – 1)-mer will occur as the prefix or suffix of n different k-mers. Red numbers show the order of edges in an Eulerian cycle.

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces De Bruijn’s Question The graph E we have constructed is contained in the graph B(4, k). We have n = 4 since there are four possible nucleotides. E must be balanced/Eulerian too! The indegree and outdegree of any (k – 1)-mer vertex both equal how many times this (k - 1)-mer appears in the genome. 3 CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGGCGTGCA Genome:

Genome Reconstruction: A Puzzle With a Billion Pieces Section 10: Generalizing Fragment Assembly

Genome Reconstruction: A Puzzle With a Billion Pieces Simplifying Assumptions for Fragment Assembly Recall the assumptions we have already made: 1.Every k-mer occurring in the genome is generated by some read. 2.Reads are error-free. 3.Every k-mer occurring in the genome occurs exactly once. 4.The underlying genome consists of a single circular-shaped chromosome. Our aim is to relax each of these assumptions and determine how the problem changes.

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 1: Generating (nearly) all k-mers 100-nucleotide reads generated by Illumina sequencing technology capture only a small fraction of 100-mers from the genome (even for high-coverage sequencing projects), thus violating this key assumption of the de Bruijn graphs. However, if we break these reads into shorter k-mers, the resulting k-mers often represent nearly all k-mers from the genome for sufficiently small k. For example, modern assemblers often break every 100- nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs.

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6: We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG ), but we will have all possible 3-mers by splitting the reads into pieces. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT ATGC CT Reads Genome

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads What happens to the graph E when some reads have errors? Example: Say our graph E for genome ATGGCGTGCAATG should look like this.

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads What happens to the graph E when some reads have errors? Example: Say our graph E for genome ATGGCGTGCAATG should look like this. If read TGGCGTG is mistakenly sequenced as TGGAGTG, then the graph will look like this instead. This is called a bulge in the graph E.

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 2: Handling Errors in Reads Most reads have errors, resulting in millions of bulges in E. 2004: Pevzner et al. provide algorithm for bulge removal.

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers The genome ACGTACGT has only four 3-mers: ACG, CGT, GTA, and TAC. We would obtain the graph E below and reconstruct this genome as: ACGT In other words, we can’t represent repeated k-mers in the genome! ACCG GT TA TAC ACG CGT GTA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Define the multiplicity of a k-mer as the number of times it occurs in a genome. We will add edges to E in order to form a new graph E* for which the number of edges connecting two vertices represents the multiplicity of the k-mer on that edge. An Eulerian cycle in E* still gives a candidate genome.

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 3: Handling Repeated k-mers Say that we have the following read multiplicities: Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA Multiplicity 2: GCG, CGT, GTG, TGC We reflect multiplicities as multiple edges Candidate genome: E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT ATGCGTGGCGTGCA

Genome Reconstruction: A Puzzle With a Billion Pieces Determining k-mer multiplicities How can we find the multiplicity of a k-mer in the genome? The multiplicity of a k-mer will be directly related to the frequency with which that k-mer occurs in our reads. So a k-mer that appears 5 times in the genome is expected to occur 5 times as often in our reads. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes The genomes for all complex organisms are split across a number of linear chromosomes (46 in humans). So in order to sequence the human genome, geneticists simply sequenced all of these linear chromosomes. Question: How do we sequence a linear segment of DNA?

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Say our linear DNA segment is ATGCGTGGCGTGCA. Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers: CAA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Say our linear DNA segment is ATGCGTGGCGTGCA. Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers: CAA AAT

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAAAAT

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. CAGC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*. Get rid of the vertex AA as well. So to sequence our segment ATGCGTGGCGTGCA, we need to find a path through E* that starts with AT, ends at CA, and uses every edge in between. CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once. So an Eulerian path is just like an Eulerian cycle, except that we don’t have to start and end at the same vertex. Luckily, Euler’s Theorem generalizes to efficiently determine whether a graph has an Eulerian path and then find this path. Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.

Genome Reconstruction: A Puzzle With a Billion Pieces Assumption 4: From Circular to Linear Genomes Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced. So E* must contain an Eulerian path, because AT and CA (the endpoints of our segment) are the only two vertices that aren’t balanced. Hence in every case we have solved our giant puzzle! CAGC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA

Genome Reconstruction: A Puzzle With a Billion Pieces What’s Next?

Genome Reconstruction: A Puzzle With a Billion Pieces Personal Genomics: Millions of Human Genomes Personal genome sequencing started from sequencing the genomes of a few scientists in 2009 and will soon expand to millions of individuals. Thousands of cancer genomes have already been sequenced, and genome sequencing will soon become a routine technique in medicine. At the heart of this revolution are bioinformaticians, who must harness precise methods in order to analyze the growing data. 10 scientists and entrepreneurs who made their genomes publicly available in 2009

Genome Reconstruction: A Puzzle With a Billion Pieces Genome 10K and Beyond 2010: Scientists launch an ambitious project to sequence 10,000 species genomes. 201x?: We will hopefully be able to reconstruct the “tree of life” and uncover the genomes of ancestors that lived millions of years ago. 20xx?: Maybe, just maybe, we will be able to discover why giraffes grew necks and humans grew brains.