Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Slides:



Advertisements
Similar presentations
CSE 211 Discrete Mathematics
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Chapter 8: Graph Algorithms July/23/2012 Name: Xuanyu Hu Professor: Elise de Doncker.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Graph Theory: Euler Circuits Christina Mende Math 480 April 15, 2013.
Lecture 21 Paths and Circuits CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Combinatorial Algorithms
Section 2.1 Euler Cycles Vocabulary CYCLE – a sequence of consecutively linked edges (x 1,x2),(x2,x3),…,(x n-1,x n ) whose starting vertex is the ending.
IEOR266 © Classification of MCNF problems.
Approximation Algorithms for the Traveling Salesperson Problem.
Computational Genomics and Proteomics Sequencing genomes.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Sequencing tutorial Peter HANTZ EMBL Heidelberg.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
GRAPH Learning Outcomes Students should be able to:
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.
Take a Tour with Euler Elementary Graph Theory – Euler Circuits and Hamiltonian Circuits Amro Mosaad – Middlesex County Academy.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
EECS 203: It’s the end of the class and I feel fine. Graphs.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Lecture 3 1.Protein Function prediction using network concepts 2.Application of network concepts in DNA sequencing.
394C March 5, 2012 Introduction to Genome Assembly.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Fuzzypath – Algorithms, Applications and Future Developments
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
6.3 Advanced Molecular Biological Techniques 1. Polymerase chain reaction (PCR) 2. Restriction fragment length polymorphism (RFLP) 3. DNA sequencing.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
Sequencing tutorial Peter HANTZ EMBL Heidelberg.
CIRCUITS, PATHS, AND SCHEDULES Euler and Königsberg.
Lecture 52 Section 11.2 Wed, Apr 26, 2006
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
1 Euler and Hamilton paths Jorge A. Cobb The University of Texas at Dallas.
Grade 11 AP Mathematics Graph Theory Definition: A graph, G, is a set of vertices v(G) = {v 1, v 2, v 3, …, v n } and edges e(G) = {v i v j where 1 ≤ i,
1 GRAPH Learning Outcomes Students should be able to: Explain basic terminology of a graph Identify Euler and Hamiltonian cycle Represent graphs using.
Detecting DNA with DNA probes arrays. DNA sequences can be detected by DNA probes and arrays (= collection of microscopic DNA spots attached to a solid.
Graph Algorithms © Jones and Pevzner © Robert Simons
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly
EECS 203 Lecture 19 Graphs.
CS296-5 Genomes, Networks, and Cancer
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Lecture 3 Protein Function prediction using network concepts
CSE 5290: Algorithms for Bioinformatics Fall 2011
Graph Algorithms in Bioinformatics
Graph Theory.
Genome Assembly.
Graph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Euler and Hamiltonian Graphs
Graphs CS 2606.
Presentation transcript:

Graph Theory And Bioinformatics Jason Wengert

Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing DNA Shortest Superstring Problem Sequencing by Hybridization (SBH) SBH as Hamiltonian Cycle Problem SBH as Eulerian Path Problem

Graph Theory A graph is a set of vertices connected by edges Edges connect two vertices and can be weighted, directed or undirected

The Bridge Obsession Problem Bridges of Königsberg Find a tour crossing every bridge just once Leonhard Euler, 1735

Eulerian Cycle Problem Find a cycle that visits every edge exactly once Linear time Solutions exist if each vertex has even degree For a path, the start and end vertices can have odd degree More complicated Königsberg

Hamiltonian Cycle Problem Find a cycle that visits every vertex exactly once NP – complete  Equivalent to Travelling Salesperson Problem Game invented by Sir William Hamilton in 1857

Interval Graph Vertices are segments on a line Edges exist between two vertices if any part of their segments overlap

Seymour Benzer and T4 phages Benzer studied mutated bacteriophages Non-mutated T4 phages kill bacteria The mutants had continuous segments missing from their DNA These mutants lost the ability to kill the bacteria Two mutant strains with non-overlapping segments missing can kill bacteria together

Complementation between pairs of mutant T4 bacteriophages

Interval Graph: Linear Genes

Interval Graph: Branched Genes

Sequencing DNA Sanger Method:  Make copies of fragments of different length by nucleotide starvation during copying  Separate the fragments via electrophoresis  Each starvation experiment produces a ladder of fragments of various lengths  Generate several-hundred-nucleotide fragment sequences (reads) from these ladders

Sanger Method: Generating Read 1. Start at primer (restriction site) 2. Grow DNA chain 3. Include ddNTPs 4. Stops reaction at all possible points 5. Separate products by length, using gel electrophoresis

Combining Reads Need a method to take the reads and combine them into a genome You don't know where the reads came from in the DNA

Shortest Superstring Problem Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s 1, s 2,…., s n Output: A string s that contains all strings s 1, s 2,…., s n as substrings, such that the length of s is minimized Complexity: NP – complete Note: this formulation does not take into account sequencing errors

Shortest Superstring Problem: Example

Reducing SSP to TSP Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa Construct a graph with n vertices representing the n strings s 1, s 2,…., s n. Insert edges of length overlap ( s i, s j ) between vertices s i and s j. Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete.

Sequencing By Hybridization The presence of a target DNA sequence can be determined by creating a probe, a single-strand fragment with the complementary sequence If the target sequence exists in the sample, it will hybridize with the probe

How SBH Works Attach all possible DNA probes of length l to a flat surface, each probe at a distinct and known location. This set of probes is called the DNA array. Apply a solution containing fluorescently labeled DNA fragment to the array. The DNA fragment hybridizes with those probes that are complementary to substrings of length l of the fragment.

How SBH Works (cont’d) Using a spectroscopic detector, determine which probes hybridize to the DNA fragment to obtain the l–mer composition of the target DNA fragment. Apply the combinatorial algorithm (below) to reconstruct the sequence of the target DNA fragment from the l – mer composition.

Hybridization on DNA Array

Spectrum Spectrum(s,l) is the multiset of all l-mers that appear in s Example:  Spectrum(TATGGTGC, 3) = {TAT, ATG, TGG, GGT, GTG, TGC}

The SBH Problem Goal: Reconstruct a string from its l-mer composition Input: A set S, representing all l-mers from an (unknown) string s Output: String s such that Spectrum ( s,l ) = S

SBH As Hamiltonian Path Vertices of the graph are every l-mer in the spectrum A directed edge leads from vertex u to vertex v if the last l-1 characters of u equal the first l-1 characters of v. S = { ATG TGG TGC GTG GGC GCA GCG CGT }

SBH: Hamiltonian Path Approach S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1: ATGCGTGGCA ATGGCGTGCA Path 2:

SBH As Eulerian Path Vertices are (l-1)-mers Edges represent the l-mers from the spectrum AT GT CG CA GC TG GG S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT }

SBH: Eulerian Path Approach S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: ATGGCGTGCA ATGCGTGGCA AT TG GC CA GG GT CG AT GT CG CA GCTG GG

Euler Theorem A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v) Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.

Constructing an Eulerian Cycle Start with an arbitrary vertex and form an arbitrary cycle until you reach a dead end  Because the graph is Eulerian, this dead end will be the starting point If the resulting cycle is not Eulerian, select a vertex from the cycle with untraversed edges, and repeat step 1 starting there. Combine the two cycles and repeat step 2.

Constructing an Eulerian Cycle

Some Difficulties with SBH Fidelity of Hybridization: difficult to detect differences between probes hybridized with perfect matches and 1 or 2 mismatches Array Size: Effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology. Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future Practicality again: Although SBH is still impractical, it spearheaded expression analysis and SNP analysis techniques

Summary Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing DNA Shortest Superstring Problem Sequencing by Hybridization (SBH) SBH as Hamiltonian Cycle Problem SBH as Eulerian Path Problem

References Jones, N. and Pevzner, P. An introduction to Bioinformatics Algorithms. Some slides courtesy of the book's website at bix.ucsd.edu/bioalgorithms/slides.php