1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Artificial Intelligence Adversarial search Fall 2008 professor: Luigi Ceccaroni.
Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
12.1 Systems of Linear Equations: Substitution and Elimination.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Combinatorial Optimization in Computational Biology Dan Gusfield Computer Science, UC Davis.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Dynamic Programming Technique. D.P.2 The term Dynamic Programming comes from Control Theory, not computer science. Programming refers to the use of tables.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Linear-Time Reconstruction of Zero-Recombinant Mendelian Inheritance on Pedigrees without Mating Loops Authors: Lan Liu, Tao Jiang Univ. California, Riverside.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Haplotyping via Perfect Phylogeny: A Direct Approach
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Perfect Phylogeny MLE for Phylogeny Lecture 14
Combinatorial Optimization and Combinatorial Structure in Computational Biology Dan Gusfield, Computer Science, UC Davis.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny Dan Gusfield OSB 2008, Lijiang, China, November 1, 2008.
Important Problem Types and Fundamental Data Structures
A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:
Physical Mapping of DNA Shanna Terry March 2, 2004.
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
Researchers: Preet Bola Mike Earnest Kevin Varela-O’Hara Han Zou Advisor: Walter Rusin Data Storage Networks.
Computer Sciences Department1. Sorting algorithm 3 Chapter 6 3Computer Sciences Department Sorting algorithm 1  insertion sort Sorting algorithm 2.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
PC-Trees & PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Copyright ©2015 Pearson Education, Inc. All rights reserved.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
PC-Trees vs. PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Yufeng Wu and Dan Gusfield University of California, Davis
B+ Tree.
PC trees and Circular One Arrangements
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
CS 581 Tandy Warnow.
Outline Cancer Progression Models
Perfect Phylogeny Tutorial #10
Presentation transcript:

1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science University of California, Davis RECOMB 2005

2 Haplotypes to Genotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two states denoted by 0 and 1 At each site, each chromosome has one of two states denoted by 0 and 1 From haplotypes to genotypes: From haplotypes to genotypes: For each site of an individual, if both haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2. For each site of an individual, if both haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2.

3 Haplotypes to Genotypes Two haplotypes per individual Genotype for the individual Merge the haplotypes Sites:

4 Genotypes to Haplotypes Two haplotypes per individual Genotype for the individual For each site, if the genotype has state 0 or 1, then the two haplotypes must have states 0, 0 or 1, 1. If the genotype has state 2, the two haplotypes can either have states 0, 1 or 1, 0.

5 Haplotype Inference Problem For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is harder and more expensive to collect than genotype data. For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is harder and more expensive to collect than genotype data. Haplotype Inference Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. Haplotype Inference Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. NIH leads HAPMAP project to find common haplotypes in the human population. NIH leads HAPMAP project to find common haplotypes in the human population.

6 Haplotype Inference Problem If the genotype has state 2 at k sites, there are 2 k – 1 possible explaining haplotype pairs. If the genotype has state 2 at k sites, there are 2 k – 1 possible explaining haplotype pairs. How to determine which haplotype pair is the original one generating the genotype ? How to determine which haplotype pair is the original one generating the genotype ? We need a model of haplotype evolution to help solve the haplotype inference problem. We need a model of haplotype evolution to help solve the haplotype inference problem.

7 The Perfect Phylogeny Model of Haplotype Evolution sites Ancestral haplotype Extant haplotypes at the leaves Site mutations on edges

8 Assumptions of Perfect Phylogeny Model No recombination, only mutation. No recombination, only mutation. Infinite-site assumption: one mutation per site. Infinite-site assumption: one mutation per site.

9 The Perfect Phylogeny Haplotyping (PPH) Problem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12a10 a01 b00 b01 c10 c10 1 c c a a b b Genotype matrix Haplotype matrixPerfect phylogeny Site

10 Prior Work Several existing algorithms that solve the PPH problem, but none of them is in linear time. Several existing algorithms that solve the PPH problem, but none of them is in linear time. Our contribution: Our contribution: A linear time algorithm. A linear time algorithm. Our implementation is about 250 times faster than the fastest one of previous algorithms for large data set. Our implementation is about 250 times faster than the fastest one of previous algorithms for large data set.

11 A P-Class of PPH Solutions Genotype Matrix One PPH Solution root P-Class: Maximum common subgraph in all PPH solutions P-Class: Maximum common subgraph in all PPH solutions Each P-Class consists of two subtrees Each P-Class consists of two subtrees Sites: Genotypes a b c d a b c d a,d a,c b,d b,c

12 P-Class Property of PPH Solutions Second PPH Solutions All PPH solutions can be obtained by choosing how to flip each P-Class. All PPH solutions can be obtained by choosing how to flip each P-Class. One PPH Solution root a,d a,c b,c b,d a,c b,d root 1 a,d5 b,c Switching points

13 The Key Theorem Every PPH solution can be obtained by choosing a flip for each P-Class. Every PPH solution can be obtained by choosing a flip for each P-Class. Conversely, after fixing one P-Class, every distinct choice of flips of P-Classes, leads to a distinct PPH solution. Conversely, after fixing one P-Class, every distinct choice of flips of P-Classes, leads to a distinct PPH solution. If there are k P-Classes, there are 2 k – 1 distinct PPH solutions. If there are k P-Classes, there are 2 k – 1 distinct PPH solutions.

14 Shadow Tree Contains classes Contains classes Each class in the shadow tree is a subgraph of a P-Class Each class in the shadow tree is a subgraph of a P-Class Merging classes results in larger classes, classes are never split Merging classes results in larger classes, classes are never split Contains tree edges and shadow edges Contains tree edges and shadow edges

15 The Algorithm Process the genotype matrix one row at a time, starting at the first row, and modify the shadow tree Process the genotype matrix one row at a time, starting at the first row, and modify the shadow tree The genotype matrix only contains entries of value 0 and 2. The genotype matrix only contains entries of value 0 and 2.

16 Overview of the Algorithm for One Row Procedure FirstPath Procedure FirstPath Procedure SecondPath Procedure SecondPath Procedure FixTree Procedure FixTree Procedure NewEntries Procedure NewEntries

17 OldEntryList Genotype Matrix OldEntryList for row 3 : 1, 2, 3, 5 OldEntryList : column indices that have entries of value 2 in this row and also have entries of value 2 in some previous rows OldEntryList : column indices that have entries of value 2 in this row and also have entries of value 2 in some previous rows 3

18 Procedures FirstPath and SecondPath FirstPath : Construct a first path towards the root of the shadow tree which passes through tree edges of as many columns in OldEntryList as possible FirstPath : Construct a first path towards the root of the shadow tree which passes through tree edges of as many columns in OldEntryList as possible SecondPath : Construct a second path towards the root of the shadow tree which passes through tree edges of columns in OldEntryList and not on the first path SecondPath : Construct a second path towards the root of the shadow tree which passes through tree edges of columns in OldEntryList and not on the first path

19 Shadow Tree After Processing the First Two Rows root Genotype Matrix OldEntryList for row 3 : 1, 2, 3,

20 Algorithm – FirstPath root OldEntryList: CheckList: 3, 2 2,2,2,2, 3,3,3,3,5 1,1,1,1, Edges 4 and 5 cannot be on the same path to the root in any PPH solution Edges 4 and 5 cannot be on the same path to the root in any PPH solution

21 Algorithm – SecondPath root CheckList:3 OldEntryList: 1, 2, 3, 5 2,2,2,2,

22 Shadow Tree to PPH Solutions root Genotype Matrix One PPH Solution Sites: abcd Final shadow tree

23 Shadow Tree to PPH Solutions root Second PPH Solution Final shadow tree a,d b,c b,d a,c

24 Implementation – Leaf Count Leaf count of column i (L[ i ]): the number of 2's plus twice the number of 1's in column i. Leaf count of column i (L[ i ]): the number of 2's plus twice the number of 1's in column i. L[ i ] is the number of leaves below mutation i, in every perfect phylogeny for the genotype matrix. L[ i ] is the number of leaves below mutation i, in every perfect phylogeny for the genotype matrix. Along any path to the root in any PPH solution, the successive edges are labeled by columns with strictly increasing leaf counts. Along any path to the root in any PPH solution, the successive edges are labeled by columns with strictly increasing leaf counts a1100 b0220 c2020 d Leaf Count:

25 Time Complexity Constant number of simple operations on each edge per row Constant number of simple operations on each edge per row Each traversal in the shadow tree goes through O(m) edges. Each traversal in the shadow tree goes through O(m) edges. The algorithm does constant number of traversals in the shadow tree for each row. The algorithm does constant number of traversals in the shadow tree for each row. Total time: O(n m) Total time: O(n m) n, m are the number of rows and columns in the genotype matrix.

26 Results Average Running Times (seconds) Sites (m)Individuals (n)DatasetDPPH O(nm 2 )Our Alg. O(nm)

27 Thank you ! Paper and program can be downloaded at: Paper and program can be downloaded at: