A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

Slides:



Advertisements
Similar presentations
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Advertisements

A simple example finding the maximum of a set S of n numbers.
1 Counting Perfect Matchings of a Graph It is hard in general to count the number of perfect matchings in a graph. But for planar graphs it can be done.
Time Complexity of Basic BST Operations Search, Insert, Delete – These operations visit the nodes along a root-to- leaf path – The number of nodes encountered.
Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
Linear Inequalities and Linear Programming Chapter 5
5.4 Simplex method: maximization with problem constraints of the form
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Dynamic Programming Technique. D.P.2 The term Dynamic Programming comes from Control Theory, not computer science. Programming refers to the use of tables.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Representing Graphs Wade Trappe. Lecture Overview Introduction Some Terminology –Paths Adjacency Matrix.
Haplotyping via Perfect Phylogeny: A Direct Approach
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003.
Incorporating Mutations
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
CSE 373, Copyright S. Tanimoto, 2002 Up-trees - 1 Up-Trees Review of the UNION-FIND ADT Straight implementation with Up-Trees Path compression Worst-case.
Perfect Phylogeny MLE for Phylogeny Lecture 14
Combinatorial Optimization and Combinatorial Structure in Computational Biology Dan Gusfield, Computer Science, UC Davis.
Chapter 9: Graphs Basic Concepts
Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny Dan Gusfield OSB 2008, Lijiang, China, November 1, 2008.
Important Problem Types and Fundamental Data Structures
A Complexity Measure THOMAS J. McCABE Presented by Sarochapol Rattanasopinswat.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Randomized Turing Machines
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
PHYLOGENETIC TREES Dwyane George February 24,
Parallel #2 Paper – Phylogeny and Branch and Bound Algorithms George McGinn
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
Minimum Spanning Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Constructing evolutionary trees from rooted triples Bang Ye Wu Dept. of Computer Science and Information Engineering Shu-Te University.
Computer Sciences Department1. Sorting algorithm 3 Chapter 6 3Computer Sciences Department Sorting algorithm 1  insertion sort Sorting algorithm 2.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Planarity Test W. L. Hsu. Plane Graph A plane graph is a graph drawn in the plane in such a way that no two edges intersect A plane graph.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Union-find Algorithm Presented by Michael Cassarino.
Chapter 10 Graph Theory Eulerian Cycle and the property of graph theory 10.3 The important property of graph theory and its representation 10.4.
PC-Trees & PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.
Heaps © 2010 Goodrich, Tamassia. Heaps2 Priority Queue ADT  A priority queue (PQ) stores a collection of entries  Typically, an entry is a.
PC-Trees vs. PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
MATRIX FORM OF PRIM’S ALGORITHM. This network may be described using a Distance Matrix.
CSE 373, Copyright S. Tanimoto, 2001 Up-trees - 1 Up-Trees Review of the UNION-FIND ADT Straight implementation with Up-Trees Path compression Worst-case.
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Sort Algorithm.
The minimum cost flow problem
CS 581 Tandy Warnow.
Warm-Up 3) 1) 4) Name the dimensions 2).
Perfect Phylogeny Tutorial #10
Algorithms Tutorial 27th Sept, 2019.
Presentation transcript:

A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date: Nov. 23, 2005 Introducer: Hsing-Yen Ann Modified from:

2 Abstract Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002, the problem of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In this paper we solve the open problem, giving a practical, deterministic linear-time algorithm based on a simple data- structure and simple operations on it. The method is straightforward to program and has been fully implemented. Simulations show that it is much faster in practice than prior methods. The value of a linear-time solution to the PPH problem is partly conceptual and partly for use in the inner-loop of algorithms for more complex problems, where the PPH problem must be solved repeatedly. Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002, the problem of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In this paper we solve the open problem, giving a practical, deterministic linear-time algorithm based on a simple data- structure and simple operations on it. The method is straightforward to program and has been fully implemented. Simulations show that it is much faster in practice than prior methods. The value of a linear-time solution to the PPH problem is partly conceptual and partly for use in the inner-loop of algorithms for more complex problems, where the PPH problem must be solved repeatedly.

3 Haplotypes to Genotypes Two haplotypes per individual Genotype for the individual Merge the haplotypes (experiential results) Sites: two 0s  0 two 1s  1 one 0 + one 1  2

4 Genotypes to Haplotypes Two haplotypes per individual Genotype for the individual 0  (0, 0) 1  (1, 1) 2  (1, 0) or (0, 1) 2 k possible solutions!! Haplotype Inference Problem: Given a set of n genotypes (on the same sites), determine the original set of n haplotype pairs that generated the n genotypes

5 The Perfect Phylogeny Model of Haplotype Evolution sites Ancestral haplotype Extant haplotypes at the leaves Site mutations on edges Perfect: Never mutate twice on the same site

6 The Perfect Phylogeny Haplotyping (PPH) Problem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 1 (a,b) (b)(b) 2 01c 20b 22a21 01c 01c 10b 00b 10a 01a Genotype matrix Haplotype matrix Perfect phylogeny Site (a,c,c)

7 The Perfection A example that does not fit a perfect phylogeny 1 (b) (a,b) 2 01c 20b 22a21 01c 01c 10b 00b 00a 11a Genotype matrix Haplotype matrix Not Perfect!! Site (c,c) 2 1 (a) 1 1

8 Prior Work Several existing algorithms: Several existing algorithms: A complex nearly-linear-time algorithm with a little bug runs in O(n m α(n m)) time. A complex nearly-linear-time algorithm with a little bug runs in O(n m α(n m)) time. Two simpler but slower algorithms run in O(n m 2 ) time. Two simpler but slower algorithms run in O(n m 2 ) time. Contribution of this paper: Contribution of this paper: A linear-time (O(n m)) algorithm. A linear-time (O(n m)) algorithm. Use a simple data-structure Shadow Tree and some simple operations on it. Use a simple data-structure Shadow Tree and some simple operations on it.

9 Shadow Tree (1/7) root Tree edge Shadow edge Class Free link Flipping Fixed link Classes merge

10 Shadow Tree (2/7) root Tree edge Shadow edge Class Free link Flipping Fixed link Classes merge

11 Shadow Tree (3/7) root Tree edge Shadow edge Class Free link Flipping Fixed link Classes merge

12 Shadow Tree (4/7) root Tree edge Shadow edge Class Free link Flipping Fixed link Classes merge

13 Shadow Tree (5/7) root Tree edge Shadow edge Class Free link Flipping Fixed link Classes merge

14 Shadow Tree (6/7) root Tree edge Shadow edge Class Free link Flipping Fixed link Classes merge

15 Shadow Tree (7/7) root Tree edge Shadow edge Class Free link Flipping Fixed link Classes merge

16 The Algorithm Process the genotype matrix one row at a time, starting at the first row, and modify the shadow tree Process the genotype matrix one row at a time, starting at the first row, and modify the shadow tree While processing an element in one row, there are at most 4+3 cases, and all the cases can be done in constant time. While processing an element in one row, there are at most 4+3 cases, and all the cases can be done in constant time. Assumption: The genotype matrix only contains entries of value 0 and 2. Assumption: The genotype matrix only contains entries of value 0 and 2.

17 OldEntryList Genotype Matrix OldEntryList for row 3 : 1, 2, 3, 5 OldEntryList : column indices that have entries of value 2 in this row and also have entries of value 2 in some previous rows OldEntryList : column indices that have entries of value 2 in this row and also have entries of value 2 in some previous rows 3

18 Shadow Tree After Processing the First Two Rows root Genotype Matrix OldEntryList for row 3 : 1, 2, 3,

19 Algorithm – FirstPath root OldEntryList: CheckList: 3, 2 2,2,2,2, 3,3,3,3,5 1,1,1,1, Edges 4 and 5 cannot be on the same path to the root in any PPH solution Edges 4 and 5 cannot be on the same path to the root in any PPH solution

20 Algorithm – SecondPath root CheckList:3 OldEntryList: 1, 2, 3, 5 2,2,2,2,

21 Shadow Tree to PPH Solutions (1/2) root Genotype Matrix One PPH Solution Sites: abcd Final shadow tree

22 Shadow Tree to PPH Solutions (2/2) root Second PPH Solution Final shadow tree a,d b,c b,d a,c

23 The End

24 A P-Class of PPH Solutions Genotype Matrix One PPH Solution root P-Class: Maximum common subgraph in all PPH solutions P-Class: Maximum common subgraph in all PPH solutions Each P-Class consists of two subtrees Each P-Class consists of two subtrees Sites: Genotypes a b c d a b c d a,d a,c b,d b,c

25 P-Class Property of PPH Solutions Second PPH Solutions All PPH solutions can be obtained by choosing how to flip each P-Class. All PPH solutions can be obtained by choosing how to flip each P-Class. One PPH Solution root a,d a,c b,c b,d a,c b,d root 1 a,d5 b,c Switching points

26 The Key Theorem Every PPH solution can be obtained by choosing a flip for each P-Class. Every PPH solution can be obtained by choosing a flip for each P-Class. Conversely, after fixing one P-Class, every distinct choice of flips of P-Classes, leads to a distinct PPH solution. Conversely, after fixing one P-Class, every distinct choice of flips of P-Classes, leads to a distinct PPH solution. If there are k P-Classes, there are 2 k – 1 distinct PPH solutions. If there are k P-Classes, there are 2 k – 1 distinct PPH solutions.

27 Shadow Tree Contains classes Contains classes Each class in the shadow tree is a subgraph of a P-Class Each class in the shadow tree is a subgraph of a P-Class Merging classes results in larger classes, classes are never split Merging classes results in larger classes, classes are never split Contains tree edges and shadow edges Contains tree edges and shadow edges

28 Overview of the Algorithm for One Row Procedure FirstPath Procedure FirstPath Procedure SecondPath Procedure SecondPath Procedure FixTree Procedure FixTree Procedure NewEntries Procedure NewEntries

29 Procedures FirstPath and SecondPath FirstPath : Construct a first path towards the root of the shadow tree which passes through tree edges of as many columns in OldEntryList as possible FirstPath : Construct a first path towards the root of the shadow tree which passes through tree edges of as many columns in OldEntryList as possible SecondPath : Construct a second path towards the root of the shadow tree which passes through tree edges of columns in OldEntryList and not on the first path SecondPath : Construct a second path towards the root of the shadow tree which passes through tree edges of columns in OldEntryList and not on the first path