WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.

Slides:



Advertisements
Similar presentations
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Advertisements

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
An introduction to maximum parsimony and compatibility
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Haplotyping via Perfect Phylogeny: A Direct Approach
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Two Solutions in Search of Killer Apps. Dimacs workshop on Algorithms in Human Population Genomics Dan Gusfield UC Davis.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
CSE182-L17 Clustering Population Genetics: Basics.
Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005.
Multi-State Perfect Phylogeny via Chordal Graph Theory Dan Gusfield UC Davis December 7, UCLA.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003.
Incorporating Mutations
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Combinatorial Optimization and Combinatorial Structure in Computational Biology Dan Gusfield, Computer Science, UC Davis.
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny Dan Gusfield OSB 2008, Lijiang, China, November 1, 2008.
A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Yufeng Wu and Dan Gusfield University of California, Davis
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Estimating Recombination Rates
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Haplotyping Problem Diploid organisms have two copies of (not identical) chromosomes. A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs) SNP: a site with two types of nucleotides occur frequently, 0 or 1 The mixed description is genotype, vector of 0,1,2 –If both haplotypes are 0, genotype is 0 –If both haplotypes are 1, genotype is 1 –If one is 0 and the other is 1, genotype is 2

Haplotypes and Genotypes Two haplotypes per individual Genotype for the individual Merge the haplotypes Sites: Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes

Perfect Phylogeny Haplotyping (PPH) Finding original haplotypes in nature hopeless without genetic model to guide solution picking Gusfield (2002) introduced PPH problem PPH is to find HI solutions that fit into a perfect phylogeny. Nice results for PPH, including a linear time algorithm

The Perfect Phylogeny Model for Haplotypes sites Ancestral sequence Extant sequences at the leaves Site mutations on edges The tree derives the set M: Assume at most 1 mutation at each site

PPH Example Genotypes Inferred Haplotypes Perfect Phylogeny

Imperfect Phylogeny Haplotyping (IPPH): Extending PPH Often, the real biological data does not have PPH solutions. Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic) Our approach: IPPH with explicit genetic model, with small amount of –Homoplasy, i.e. back or recurrent mutation –Recombination Goal: Extend usage of PPH –Real data: may be of small perturbation from PPH –Haplotype block: low recombination or homoplasy

Back/Recurrent Mutation for Haplotypes Data More than one mutation at a site

Recombinations: Single Crossover Recombination is one of the principle genetic force shaping genetic variations Two equal length sequences generate the third equal length sequence Prefix Suffix breakpoint

IPPH (Imperfect Phylogeny Haplotyping) Problems Small deviation from PPH H-1 IPPH problem –Find a tree that allows exactly one site to mutate twice –The rest of sites can only mutate at most once –Derive haplotypes for the given genotypes R-1 IPPH problem –Find a network that has exactly one recombination event –Each site mutates at most once –Derive haplotypes for the given genotypes

Number of Minimum Recombinations for Haplotypes RminRho=1Rho=3Rho= %23.6%8.4% 131.8%35.2%27.6% 26.8%24.8%27.8% 311.6%21.6% 43.8%9.0% 50.8%3.6% 60.2%1.4% Frequency of Minimum recombinations for small rho (scaled recombination rate) 20 sequences 30 sites 500 simulations

Haplotyping with One Homoplasy More than one mutation at a site 1 s1s2s3 a1000 a2010 b1101 b2110 s1s2s3 a020 b122 Genotype Haplotype 000 a1 b a2 b Homoplasy Tree

Algorithm for H1-IPPH For each site s in the input genotype data M –Test whether M-{s} has PPH solutions –If not, move to next site. –Otherwise, check whether 1 homoplasy at site s can lead to HI solutions –If yes, stop and report result Assume only one PPH solution for M-{s} But how to find solutions with 1 homoplasy at s efficiently?

Example M Site i3 M-{i3}{i3}

PPH M-{i3}{i3} Mh-{i3}h{i3} r2 r2’s2’ s2 Assume Mh-{i3} is fixed. Haplotypes for the same genotype must pair up. Two ways to pair Combine Mh-{i3} with h{i3}

4 ways to try pairing i3. Exponential number in general, even for one PPH solution Need polynomial-time method to avoid trying all the pairings ? Mh-{i3}h{i3}Mh1Mh2

Mh-{i3}h{i3} Move to Trees Convert perfect phylogeny tree from PPH solution to un-rooted

1 Homoplasy: from T to T r, T s ss Recurrent site s Tree T L1L2O1O2 L1, L2 O1, O2 s TsTs Tree T r s induces a split T s Deleting s induces tree T r

From T r, T s to T Find two subtrees Ts1, Ts2, in Tr, s.t. Tree Tr L O s TsTs Ts1, Ts2 corresponds to one side ss Tree T L1L - L1O1O2 of T s L1 L - L1

2. Pick leaves from Tr corresponding the chosen partition side 1. Pick one side of partition from Ts 3. Check whether the selected leaves fit into two sub-trees

1. May need to refine a non-binary vertex before picking subtree s2 can pair with r2’

Solution

Algorithms and Results Efficient graph-coloring based method to select two subtrees (skipped) Implemented in C++ Simulation with data with program ms. Compare to PHASE (a haplotyping program) –Accuracy: comparable –Speed: at least 10x faster –100x100 data: about 3 seconds Can identify the homoplasy site with high accuracy: >95% in simulation

Algorithm for R1-IPPH M MLML MRMR Split M by cutting between two sites

PPH Solutions Build perfect phylogeny for two partitions

1-SPR operation SPR: subtree-prune-regraft operation 1 recombination condition equivalent to distance-SPR(T L,T R ) = 1

Algorithm for R1-IPPH Brute-force 1-SPR idea leads to exponential time when T L or T R are not binary. Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)

Conclusions Contributions –Assuming bounded number of PPH solutions 1.Polynomial time algorithm for H1-IPPH problem 2.Polynomial time algorithm for R1-IPPH problem 3.Possible extension to more than 1 homoplasy event. Open problems –Haplotyping with more than 1 recombination efficiently. –Remove assumption that number of PPH solutions for M-{s} is bounded.

Thank you Questions?