WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis
Haplotyping Problem Diploid organisms have two copies of (not identical) chromosomes. A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs) SNP: a site with two types of nucleotides occur frequently, 0 or 1 The mixed description is genotype, vector of 0,1,2 –If both haplotypes are 0, genotype is 0 –If both haplotypes are 1, genotype is 1 –If one is 0 and the other is 1, genotype is 2
Haplotypes and Genotypes Two haplotypes per individual Genotype for the individual Merge the haplotypes Sites: Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes
Perfect Phylogeny Haplotyping (PPH) Finding original haplotypes in nature hopeless without genetic model to guide solution picking Gusfield (2002) introduced PPH problem PPH is to find HI solutions that fit into a perfect phylogeny. Nice results for PPH, including a linear time algorithm
The Perfect Phylogeny Model for Haplotypes sites Ancestral sequence Extant sequences at the leaves Site mutations on edges The tree derives the set M: Assume at most 1 mutation at each site
PPH Example Genotypes Inferred Haplotypes Perfect Phylogeny
Imperfect Phylogeny Haplotyping (IPPH): Extending PPH Often, the real biological data does not have PPH solutions. Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic) Our approach: IPPH with explicit genetic model, with small amount of –Homoplasy, i.e. back or recurrent mutation –Recombination Goal: Extend usage of PPH –Real data: may be of small perturbation from PPH –Haplotype block: low recombination or homoplasy
Back/Recurrent Mutation for Haplotypes Data More than one mutation at a site
Recombinations: Single Crossover Recombination is one of the principle genetic force shaping genetic variations Two equal length sequences generate the third equal length sequence Prefix Suffix breakpoint
IPPH (Imperfect Phylogeny Haplotyping) Problems Small deviation from PPH H-1 IPPH problem –Find a tree that allows exactly one site to mutate twice –The rest of sites can only mutate at most once –Derive haplotypes for the given genotypes R-1 IPPH problem –Find a network that has exactly one recombination event –Each site mutates at most once –Derive haplotypes for the given genotypes
Number of Minimum Recombinations for Haplotypes RminRho=1Rho=3Rho= %23.6%8.4% 131.8%35.2%27.6% 26.8%24.8%27.8% 311.6%21.6% 43.8%9.0% 50.8%3.6% 60.2%1.4% Frequency of Minimum recombinations for small rho (scaled recombination rate) 20 sequences 30 sites 500 simulations
Haplotyping with One Homoplasy More than one mutation at a site 1 s1s2s3 a1000 a2010 b1101 b2110 s1s2s3 a020 b122 Genotype Haplotype 000 a1 b a2 b Homoplasy Tree
Algorithm for H1-IPPH For each site s in the input genotype data M –Test whether M-{s} has PPH solutions –If not, move to next site. –Otherwise, check whether 1 homoplasy at site s can lead to HI solutions –If yes, stop and report result Assume only one PPH solution for M-{s} But how to find solutions with 1 homoplasy at s efficiently?
Example M Site i3 M-{i3}{i3}
PPH M-{i3}{i3} Mh-{i3}h{i3} r2 r2’s2’ s2 Assume Mh-{i3} is fixed. Haplotypes for the same genotype must pair up. Two ways to pair Combine Mh-{i3} with h{i3}
4 ways to try pairing i3. Exponential number in general, even for one PPH solution Need polynomial-time method to avoid trying all the pairings ? Mh-{i3}h{i3}Mh1Mh2
Mh-{i3}h{i3} Move to Trees Convert perfect phylogeny tree from PPH solution to un-rooted
1 Homoplasy: from T to T r, T s ss Recurrent site s Tree T L1L2O1O2 L1, L2 O1, O2 s TsTs Tree T r s induces a split T s Deleting s induces tree T r
From T r, T s to T Find two subtrees Ts1, Ts2, in Tr, s.t. Tree Tr L O s TsTs Ts1, Ts2 corresponds to one side ss Tree T L1L - L1O1O2 of T s L1 L - L1
2. Pick leaves from Tr corresponding the chosen partition side 1. Pick one side of partition from Ts 3. Check whether the selected leaves fit into two sub-trees
1. May need to refine a non-binary vertex before picking subtree s2 can pair with r2’
Solution
Algorithms and Results Efficient graph-coloring based method to select two subtrees (skipped) Implemented in C++ Simulation with data with program ms. Compare to PHASE (a haplotyping program) –Accuracy: comparable –Speed: at least 10x faster –100x100 data: about 3 seconds Can identify the homoplasy site with high accuracy: >95% in simulation
Algorithm for R1-IPPH M MLML MRMR Split M by cutting between two sites
PPH Solutions Build perfect phylogeny for two partitions
1-SPR operation SPR: subtree-prune-regraft operation 1 recombination condition equivalent to distance-SPR(T L,T R ) = 1
Algorithm for R1-IPPH Brute-force 1-SPR idea leads to exponential time when T L or T R are not binary. Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)
Conclusions Contributions –Assuming bounded number of PPH solutions 1.Polynomial time algorithm for H1-IPPH problem 2.Polynomial time algorithm for R1-IPPH problem 3.Possible extension to more than 1 homoplasy event. Open problems –Haplotyping with more than 1 recombination efficiently. –Remove assumption that number of PPH solutions for M-{s} is bounded.
Thank you Questions?