Download presentation
Presentation is loading. Please wait.
1
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science University of California, Davis RECOMB 2005
2
2 Haplotypes to Genotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two states denoted by 0 and 1 At each site, each chromosome has one of two states denoted by 0 and 1 From haplotypes to genotypes: From haplotypes to genotypes: For each site of an individual, if both haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2. For each site of an individual, if both haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2.
3
3 Haplotypes to Genotypes 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 2 1 2 1 0 0 1 2 0 Two haplotypes per individual Genotype for the individual Merge the haplotypes Sites: 1 2 3 4 5 6 7 8 9
4
4 Genotypes to Haplotypes 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 2 1 2 1 0 0 1 2 0 Two haplotypes per individual Genotype for the individual For each site, if the genotype has state 0 or 1, then the two haplotypes must have states 0, 0 or 1, 1. If the genotype has state 2, the two haplotypes can either have states 0, 1 or 1, 0.
5
5 Haplotype Inference Problem For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is harder and more expensive to collect than genotype data. For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is harder and more expensive to collect than genotype data. Haplotype Inference Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. Haplotype Inference Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. NIH leads HAPMAP project to find common haplotypes in the human population. NIH leads HAPMAP project to find common haplotypes in the human population.
6
6 Haplotype Inference Problem If the genotype has state 2 at k sites, there are 2 k – 1 possible explaining haplotype pairs. If the genotype has state 2 at k sites, there are 2 k – 1 possible explaining haplotype pairs. How to determine which haplotype pair is the original one generating the genotype ? How to determine which haplotype pair is the original one generating the genotype ? We need a model of haplotype evolution to help solve the haplotype inference problem. We need a model of haplotype evolution to help solve the haplotype inference problem.
7
7 The Perfect Phylogeny Model of Haplotype Evolution 00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 sites Ancestral haplotype Extant haplotypes at the leaves Site mutations on edges
8
8 Assumptions of Perfect Phylogeny Model No recombination, only mutation. No recombination, only mutation. Infinite-site assumption: one mutation per site. Infinite-site assumption: one mutation per site.
9
9 The Perfect Phylogeny Haplotyping (PPH) Problem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 12 a22 b02 c10 12a10 a01 b00 b01 c10 c10 1 c c a a b b 2 10 01 00 Genotype matrix Haplotype matrixPerfect phylogeny Site
10
10 Prior Work Several existing algorithms that solve the PPH problem, but none of them is in linear time. Several existing algorithms that solve the PPH problem, but none of them is in linear time. Our contribution: Our contribution: A linear time algorithm. A linear time algorithm. Our implementation is about 250 times faster than the fastest one of previous algorithms for large data set. Our implementation is about 250 times faster than the fastest one of previous algorithms for large data set.
11
11 A P-Class of PPH Solutions 1 2 35 4 Genotype Matrix 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 One PPH Solution root P-Class: Maximum common subgraph in all PPH solutions P-Class: Maximum common subgraph in all PPH solutions Each P-Class consists of two subtrees Each P-Class consists of two subtrees Sites: 1 2 3 4 5 Genotypes a b c d a b c d a,d a,c b,d b,c
12
12 P-Class Property of PPH Solutions Second PPH Solutions All PPH solutions can be obtained by choosing how to flip each P-Class. All PPH solutions can be obtained by choosing how to flip each P-Class. One PPH Solution 1 2 35 4 root a,d a,c b,c b,d 2 3 4 a,c b,d root 1 a,d5 b,c Switching points
13
13 The Key Theorem Every PPH solution can be obtained by choosing a flip for each P-Class. Every PPH solution can be obtained by choosing a flip for each P-Class. Conversely, after fixing one P-Class, every distinct choice of flips of P-Classes, leads to a distinct PPH solution. Conversely, after fixing one P-Class, every distinct choice of flips of P-Classes, leads to a distinct PPH solution. If there are k P-Classes, there are 2 k – 1 distinct PPH solutions. If there are k P-Classes, there are 2 k – 1 distinct PPH solutions.
14
14 Shadow Tree Contains classes Contains classes Each class in the shadow tree is a subgraph of a P-Class Each class in the shadow tree is a subgraph of a P-Class Merging classes results in larger classes, classes are never split Merging classes results in larger classes, classes are never split Contains tree edges and shadow edges Contains tree edges and shadow edges
15
15 The Algorithm Process the genotype matrix one row at a time, starting at the first row, and modify the shadow tree Process the genotype matrix one row at a time, starting at the first row, and modify the shadow tree The genotype matrix only contains entries of value 0 and 2. The genotype matrix only contains entries of value 0 and 2.
16
16 Overview of the Algorithm for One Row Procedure FirstPath Procedure FirstPath Procedure SecondPath Procedure SecondPath Procedure FixTree Procedure FixTree Procedure NewEntries Procedure NewEntries
17
17 OldEntryList Genotype Matrix 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 OldEntryList for row 3 : 1, 2, 3, 5 OldEntryList : column indices that have entries of value 2 in this row and also have entries of value 2 in some previous rows OldEntryList : column indices that have entries of value 2 in this row and also have entries of value 2 in some previous rows 3
18
18 Procedures FirstPath and SecondPath FirstPath : Construct a first path towards the root of the shadow tree which passes through tree edges of as many columns in OldEntryList as possible FirstPath : Construct a first path towards the root of the shadow tree which passes through tree edges of as many columns in OldEntryList as possible SecondPath : Construct a second path towards the root of the shadow tree which passes through tree edges of columns in OldEntryList and not on the first path SecondPath : Construct a second path towards the root of the shadow tree which passes through tree edges of columns in OldEntryList and not on the first path
19
19 Shadow Tree After Processing the First Two Rows root 1 1 4 5 2 3 Genotype Matrix 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 3 1 2 OldEntryList for row 3 : 1, 2, 3, 5 2 3 4 5
20
20 Algorithm – FirstPath root 11 4 5 2 3 2 3 4 5 OldEntryList: CheckList: 3, 2 2,2,2,2, 3,3,3,3,5 1,1,1,1, Edges 4 and 5 cannot be on the same path to the root in any PPH solution Edges 4 and 5 cannot be on the same path to the root in any PPH solution
21
21 Algorithm – SecondPath root 1 1 4 5 2 3 2 3 4 5 CheckList:3 OldEntryList: 1, 2, 3, 5 2,2,2,2,
22
22 Shadow Tree to PPH Solutions root 1 1 4 5 2 3 2 3 4 5 Genotype Matrix 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 One PPH Solution Sites: 1 2 3 4 5 abcd Final shadow tree 1 5 2 3 2 4
23
23 Shadow Tree to PPH Solutions root 1 1 4 5 2 3 2 3 4 5 Second PPH Solution Final shadow tree 5 3 1 2 2 4 a,d b,c b,d a,c
24
24 Implementation – Leaf Count Leaf count of column i (L[ i ]): the number of 2's plus twice the number of 1's in column i. Leaf count of column i (L[ i ]): the number of 2's plus twice the number of 1's in column i. L[ i ] is the number of leaves below mutation i, in every perfect phylogeny for the genotype matrix. L[ i ] is the number of leaves below mutation i, in every perfect phylogeny for the genotype matrix. Along any path to the root in any PPH solution, the successive edges are labeled by columns with strictly increasing leaf counts. Along any path to the root in any PPH solution, the successive edges are labeled by columns with strictly increasing leaf counts. 1234 a1100 b0220 c2020 d2002 4 3 2 1 Leaf Count:
25
25 Time Complexity Constant number of simple operations on each edge per row Constant number of simple operations on each edge per row Each traversal in the shadow tree goes through O(m) edges. Each traversal in the shadow tree goes through O(m) edges. The algorithm does constant number of traversals in the shadow tree for each row. The algorithm does constant number of traversals in the shadow tree for each row. Total time: O(n m) Total time: O(n m) n, m are the number of rows and columns in the genotype matrix.
26
26 Results Average Running Times (seconds) Sites (m)Individuals (n)DatasetDPPH O(nm 2 )Our Alg. O(nm) 300150301.070.05 500250305.720.13 10005003045.850.48 2000100010467.181.89
27
27 Thank you ! Paper and program can be downloaded at: Paper and program can be downloaded at:http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.