Download presentation
Presentation is loading. Please wait.
1
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis
2
Haplotyping Problem Diploid organisms have two copies of (not identical) chromosomes. A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs) SNP: a site with two types of nucleotides occur frequently, 0 or 1 The mixed description is genotype, vector of 0,1,2 –If both haplotypes are 0, genotype is 0 –If both haplotypes are 1, genotype is 1 –If one is 0 and the other is 1, genotype is 2
3
Haplotypes and Genotypes 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 2 1 2 1 0 0 1 2 0 Two haplotypes per individual Genotype for the individual Merge the haplotypes Sites: 1 2 3 4 5 6 7 8 9 Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes
4
Perfect Phylogeny Haplotyping (PPH) Finding original haplotypes in nature hopeless without genetic model to guide solution picking Gusfield (2002) introduced PPH problem PPH is to find HI solutions that fit into a perfect phylogeny. Nice results for PPH, including a linear time algorithm
5
The Perfect Phylogeny Model for Haplotypes 00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 sites Ancestral sequence Extant sequences at the leaves Site mutations on edges The tree derives the set M: 10100 10000 01011 01010 00010 Assume at most 1 mutation at each site
6
PPH Example Genotypes Inferred Haplotypes Perfect Phylogeny
7
Imperfect Phylogeny Haplotyping (IPPH): Extending PPH Often, the real biological data does not have PPH solutions. Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic) Our approach: IPPH with explicit genetic model, with small amount of –Homoplasy, i.e. back or recurrent mutation –Recombination Goal: Extend usage of PPH –Real data: may be of small perturbation from PPH –Haplotype block: low recombination or homoplasy
8
Back/Recurrent Mutation for Haplotypes Data 000 010 101 110 000 110 2 1 3 010 101 1 010 100 More than one mutation at a site
9
Recombinations: Single Crossover Recombination is one of the principle genetic force shaping genetic variations Two equal length sequences generate the third equal length sequence 110001111111001000110000001111 Prefix Suffix 110000000001111 breakpoint
10
IPPH (Imperfect Phylogeny Haplotyping) Problems Small deviation from PPH H-1 IPPH problem –Find a tree that allows exactly one site to mutate twice –The rest of sites can only mutate at most once –Derive haplotypes for the given genotypes R-1 IPPH problem –Find a network that has exactly one recombination event –Each site mutates at most once –Derive haplotypes for the given genotypes
11
Number of Minimum Recombinations for Haplotypes RminRho=1Rho=3Rho=5 060.8%23.6%8.4% 131.8%35.2%27.6% 26.8%24.8%27.8% 311.6%21.6% 43.8%9.0% 50.8%3.6% 60.2%1.4% Frequency of Minimum recombinations for small rho (scaled recombination rate) 20 sequences 30 sites 500 simulations
12
Haplotyping with One Homoplasy More than one mutation at a site 1 s1s2s3 a1000 a2010 b1101 b2110 s1s2s3 a020 b122 Genotype Haplotype 000 a1 b2 2 1 3 a2 b1 1 010 100 1 Homoplasy Tree
13
Algorithm for H1-IPPH For each site s in the input genotype data M –Test whether M-{s} has PPH solutions –If not, move to next site. –Otherwise, check whether 1 homoplasy at site s can lead to HI solutions –If yes, stop and report result Assume only one PPH solution for M-{s} But how to find solutions with 1 homoplasy at s efficiently?
14
Example M Site i3 M-{i3}{i3}
15
PPH M-{i3}{i3} Mh-{i3}h{i3} r2 r2’s2’ s2 Assume Mh-{i3} is fixed. Haplotypes for the same genotype must pair up. Two ways to pair Combine Mh-{i3} with h{i3}
16
4 ways to try pairing i3. Exponential number in general, even for one PPH solution Need polynomial-time method to avoid trying all the pairings ? Mh-{i3}h{i3}Mh1Mh2
17
Mh-{i3}h{i3} Move to Trees Convert perfect phylogeny tree from PPH solution to un-rooted
18
1 Homoplasy: from T to T r, T s ss Recurrent mutation @ site s Tree T L1L2O1O2 L1, L2 O1, O2 s TsTs Tree T r s induces a split T s Deleting s induces tree T r
19
From T r, T s to T Find two subtrees Ts1, Ts2, in Tr, s.t. Tree Tr L O s TsTs Ts1, Ts2 corresponds to one side ss Tree T L1L - L1O1O2 of T s L1 L - L1
21
2. Pick leaves from Tr corresponding the chosen partition side 1. Pick one side of partition from Ts 3. Check whether the selected leaves fit into two sub-trees
22
1. May need to refine a non-binary vertex before picking subtree s2 can pair with r2’
23
Solution
24
Algorithms and Results Efficient graph-coloring based method to select two subtrees (skipped) Implemented in C++ Simulation with data with program ms. Compare to PHASE (a haplotyping program) –Accuracy: comparable –Speed: at least 10x faster –100x100 data: about 3 seconds Can identify the homoplasy site with high accuracy: >95% in simulation
25
Algorithm for R1-IPPH M MLML MRMR Split M by cutting between two sites
26
PPH Solutions Build perfect phylogeny for two partitions
27
1-SPR operation SPR: subtree-prune-regraft operation 1 recombination condition equivalent to distance-SPR(T L,T R ) = 1
28
Algorithm for R1-IPPH Brute-force 1-SPR idea leads to exponential time when T L or T R are not binary. Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)
29
Conclusions Contributions –Assuming bounded number of PPH solutions 1.Polynomial time algorithm for H1-IPPH problem 2.Polynomial time algorithm for R1-IPPH problem 3.Possible extension to more than 1 homoplasy event. Open problems –Haplotyping with more than 1 recombination efficiently. –Remove assumption that number of PPH solutions for M-{s} is bounded.
30
Thank you Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.