Download presentation
1
Haplotyping via Perfect Phylogeny: A Direct Approach
Dan Gusfield CS, UC Davis Joint work with V. Bafna, G. Lancia and S. Yooseph
2
Genotypes and Haplotypes
Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles (states) denoted by 0 and 1 (motivated by SNPs) Two haplotypes per individual Merge the haplotypes Genotype for the individual
3
Haplotyping Problem Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect. Computational Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. This is hopeless without a genetic model.
4
The Perfect Phylogeny Model
We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed. In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root.
5
The Perfect Phylogeny Model
sites 12345 Ancestral haplotype 00000 1 4 Site mutations on edges 3 00010 2 10100 5 10000 01010 01011 Extant haplotypes at the leaves
6
Justification for Perfect Phylogeny Model
In the absence of recombination each haplotype of any individual has a single parent, so tracing back the history of the haplotypes in a population gives a tree. Recent strong evidence for long regions of DNA with no recombination. Key to the NIH haplotype mapping project (see NY Times October 30, 2002) Mutations are rare at selected sites, so are assumed non-recurrent. Connection with coalescent models.
7
The Haplotype Phylogeny Problem
Given a set of genotypes S, find an explaining set of haplotypes that fits a perfect phylogeny. sites A haplotype pair explains a genotype if the merge of the haplotypes creates the genotype. Example: The merge of 0 1 and 1 0 explains 2 2. 1 2 a b c S Genotype matrix
8
The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 1 2 a b c 1 2 a b c
9
The Haplotype Phylogeny Problem (PPH problem)
Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 00 1 2 a b c 1 2 a b c 1 2 b 00 a a c b c 01 01 10 10 10
10
The Alternative Explanation
1 2 a b c No tree possible for this explanation 1 2 a b c
11
When does a set of haplotypes to fit a perfect phylogeny?
Classic NASC: Arrange the haplotypes in a matrix, two haplotypes for each individual. Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all three pairs: 0,1 and 1,0 and 1,1 This is the 3-Gamete Test
12
We can remove the red words to obtain another
true statement. Also, we can consider an unrooted version of the problem, where the 4-gamete test is used, but in this talk we consider the simpler, rooted version. See the full paper for the unrooted version.
13
The Alternative Explanation
1 2 a b c No tree possible for this explanation 1 2 a b c
14
1 2 b 0 0 a b a c c 0 1 0 1 The Tree Explanation Again 0 0 1 2 a b c 1
b c 1 2 a b c 1 2 b 0 0 a b a c c 0 1 0 1
15
The case of the unknown root
The 3-Gamete Test is for the case when the root is assumed to be the all-0 vector. When the root is not known then the NASC is that the submatrix 00 must not appear in the matrix. This is called the 4-Gamete Test. 11
16
Solving the Haplotype Phylogeny Problem (PPH) in nearly linear O(nm alpha(nm)) time
Gusfield, RECOMB, April 2002 Simple Tools based on classical Perfect Phylogeny Problem. Complex Tools based on Graph Realization Problem (graphic matroid realization). But in this talk, we develop a simpler, but somewhat slower version.
17
Program PPH Program PPH solves the perfect phylogeny haplotyping problem using the graph realization approach. It solves problems with 50 sites and 100 individuals in about 1 second. Program PPH can be obtained at
18
The Combinatorial Problem
Input: A ternary matrix (0,1,2) M with 2N rows partitioned into N pairs of rows, where the two rows in each pair are identical. Def: If a pair of rows (r,r’) in the partition have entry values of 2 in a column j then positions (r,j) and (r’,j) are called Mates.
19
Output: A binary matrix M’ created from M
by replacing each 2 in M with either 0 or 1, such that A position is assigned 0 if and only if its Mate is assigned 1. b) M’ passes the 3-Gamete Test, i.e., does not contain a 3x2 submatrix (after row and column permutations) with all three combinations 0,1; 1,0; and 1,1
20
Initial Observations If two columns of M contain the following rows
2 0 2 0 mates 0 2 0 2 mates then M’ will contain a row with and a row with in those columns. This is a forced expansion.
21
Initial Observations Similarly, if two columns of M contain the mates
2 1 then M’ will contain a row with 1 1 in those columns. This is a forced expansion.
22
If a forced expansion of two columns
creates 0 1 in those columns, then any 2 2 in those columns must be set to be 0 1 1 0 We say that two columns are forced out-of-phase. If a forced expansion of two columns creates 1 1 in those columns, then any 2 2 2 2 in those columns must be set to be 1 1 0 0 We say that two columns are forced in-phase.
23
a 1 2 Example: a Columns 1 and 2, and 1 and 3 are forced in-phase. Columns 2 and 3 are forced out-of-phase. b b c c d d e e
24
Immediate Failure It can happen that the forced expansion of cells
creates a 3x2 submatrix that fails the 3-Gamete Test. In that case, there is no PPH solution for M. 20 11 02 Example: Will fail the 3-Gamete Test
25
An O(nm^2)-time Algorithm
Find all the forced phase relationships by considering columns in pairs. Find all the inferred, invariant, phase relationships. Find a set of column pairs whose phase relationship can be arbitrarily set, so that all the remaining phase relationships can be inferred. Result: An implicit representation of all solutions to the PPH problem.
26
a 1 2 A running example. a b b c c d d e e
27
7 1 Graph G Each node represents a column in M, and each edge indicates that the pair of columns has a row with 2’s in both columns. The algorithm builds this graph, and then checks whether any pair of nodes is forced in or out of phase. 6 3 4 2 5
28
7 1 Graph Gc Each Red edge indicates that the columns are forced in-phase. Each Blue edge indicates forced out-of-phase. 6 3 4 2 Let Gf be the subgraph of Gc defined by the red and blue edges. 5
29
7 1 Graph Gf has three connected components. 6 3 4 2 5
30
The Central Theorem There is a solution to the PPH problem for M if
and only if there is a coloring of the dashed edges of Gc with the following property: For any triangle (i,j,k) in Gc, where there is one row containing 2’s in all three columns i,j and k (any triangle containing at least one dashed edge will be of this type), the coloring makes either 0 or 2 of the edges blue (out-of-phase). Nice, but how do we find such a coloring?
31
Note on CMU talk Feb. 28, 2003 In that talk I oversimplified the central theorem, focusing only on the triangles with at least one dashed edge. This approach can be made to work, but wasn’t quite right as stated in the talk. The statement in the prior slide is correct.
32
7 1 Triangle Rule Graph Gf Theorem 1: If there are any dashed edges whose ends are in the same connected component of Gf, at least one edge is in a triangle where the other edges are not dashed, and in every PPH solution, it must be colored so that the triangle has an even number of Blue (out of Phase) edges. This is an “inferred” coloring. 6 3 4 2 5
33
7 1 6 3 4 2 5
34
7 1 6 3 4 2 5
35
7 1 6 3 4 2 5
36
Corollary Inside any connected component of Gf, ALL the phase
relationships on edges (columns of M) are uniquely determined, either as forced relationships based on pairwise column comparisons, or by triangle-based inferred colorings. Hence, the phase relationships of all the columns in a connected component of Gf are INVARIANT over all the solutions to the PPH problem.
37
The dashed edges in Gf can be ordered so that
the inferred colorings can be done in linear time. Modification of DFS. See the paper for details, or assign it as a homework exercise.
38
Finishing the Solution
Problem: A connected component C of G may contain several connected components of Gf, so any edge crossing two components of Gf will still be dashed. How should they be colored?
39
7 1 How should we color the remaining dashed edges in a connected component C of Gc? 6 3 4 2 5
40
Answer For a connected component C of G with k connected
components of Gf, select any subset S of k-1 dashed edges in C, so that S together with the red and blue edges span all the nodes of C. Arbitrarily, color each edge in S either red or blue. Infer the color of any remaining dashed edges by successive use of the triangle rule.
41
7 1 Pick and color edges (2,5) and (3,7) The remaining dashed edges are colored by using the triangle rule. 6 3 4 2 5
42
7 1 6 3 4 2 5
43
Theorem 2 Any selected S works (allows the triangle rule to work) and any coloring of the edges in S determines the colors of any remaining dashed edges. Different colorings of S determine different colorings of the remaining dashed edges. Each different coloring of S determines a different solution to the PPH problem. All PPH solutions can be obtained in this way, i.e. using just one selected S set, but coloring it in all 2^(k-1) ways.
44
a 1 2 How does the coloring determine a PPH solution? Each component of G is handled independently. So, assume only one component of G. Arbitrarily set the 2’s in column 1, say as 1 a b b c c d d e e
45
For j from 2 to m, If a row in column j has a 2, scan to the left for a column j’ in M with a 2 in that row. If j’ is found, use the phase relationship between j and j’ to Set those 2’s in col. j. Otherwise, set them arbitrarily. a 1 2 a b b c c d d e e
46
a 1 PPH solution derived from the edge coloring a b b c c d d e e
47
A biologically more meaningful restatement?
Once a PPH solution is found we use the connected components of Gf to partition the columns (sites) into blocks. Inside each block, the haplotype pairs are fixed. But in any block, all the shaded 0’s and 1’s can be switched, changing the complete haplotypes, formed from all the blocks.
48
Starting from a PPH Solution, if all shaded cells in a block switch value, then the result is also a PPH solution, and any PPH solution can be obtained in this way, i.e. by choosing in each block whether to switch or not. a 1 a b b c c d d e e
49
Corollary In a single connected component C of G with k connected components in Gf, there are exactly 2^(k-1) different solutions to the PPH problem in the columns of M represented by C. If G has r connected components and t connected components of Gf, then there are exactly 2^(t-r) solutions to the PPH problem. There is one unique PPH solution if and only if each connected component in G is a connected component in Gf.
50
Algorithm Build Graph G and find its connected components. Solve each connected component C of G separately. Find the forced (red or blue) edges. Let Gf be the subgraph of C containing colored edges. Find each connected component of Gf and make the inferred edge colorings (phase decisions). Find a spanning tree of uncolored edges in C, and color those edges arbitrarily, and follow the inferred edge colorings.
51
Secondary information and optimization
The partition shows explicitly what added phase information is useful and what is redundant. Phase information for an edge is redundant if and only if the edge is inside a component of Gf. Apply this successively as additional phase information is obtained. Problem: Minimize the number of haplotype pairs (individuals) that need be laboratory determined in order to find the correct tree. Minimize the number of (individual, site1, site2) triples whose phase relationship needs to be determined, in order to find the correct tree.
52
The implicit representation of all solutions provides
a framework for solving these secondary problems, as well as other problems involving the use of additional information, and specific tree-selection criteria.
53
A Phase-Transition Problem, as the ratio of sites to genotypes changes, how does the probability that the PPH solution is unique change? For greatest utility, we want genotype data where the PPH solution is unique. Intuitively, as the ratio of genotypes to sites increases, the probability of uniqueness increases.
54
Frequency of a unique solution with 50 and 100 sites, 5% rule and 2500 datasets per entry
# geno. Frequency of unique solution 10 0.0018 20 0.0032 22 0.7646 40 0.7488 42 0.9611 70 0.994 130 0.999 140 1 10 20 22 0.78 40 0.725 42 0.971 60 0.983 100 0.999 110 1
55
Program DPPH Program DPPH implements the solution to the
PPH problom discussed in this talk. It can be obtained at wwwcsif.cs.ucdavis.edu/~gusfield/
56
Observed running times
The following are typical running times of Program DPPH running on an 800 MHZ Mac G4 Powerbook. The first number is the number of genotypes and the second the number of sites. 20, sec , sec 50, sec , sec 50, sec , sec 100, sec , sec 300, sec
57
The full paper Technical Report from UCD, July 17, 2002
can be found on the recent papers page through wwwcsif.cs.ucdavis.edu/~gusfield
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.