Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.

Similar presentations


Presentation on theme: "An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University."— Presentation transcript:

1 An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA RECOMB 2013 1

2 1 1 2 2 3 3 4 4 Keep two red edges Keep two black edges Reticulation event(s): nodes with in-degree two or more 1 1 2 2 3 3 4 4 1 1 3 3 2 2 4 4 TATA TBTB Hybridization Networks Gene trees: phylogenetic history for individual genes - Inferred from gene sequences -Assume: Binary and rooted - Different topologies at different genes Reticulate evolution: one explanation - Hybrid speciation, horizontal gene transfer Gene A 1: T C G 2: T C A 3: C G G 4: C C G Gene B 1: A G C 2: T G T 3: A A C 4: A G T Hybridization network: A directed acyclic graph displaying each gene tree 2

3 The Minimum Hybridization Network Problem Given: a set of K gene trees G. Problem: reconstruct hybridization networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. NP complete: even for K=2 Most current approaches: exact methods for K=2 case (see Semple, et al) impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies 1 1 2 2 3 3 4 4 T1T1 1 1 2 2 3 3 4 4 1 1 2 2 4 4 3 3 T2T2 T3T3 1 1 2 2 3 3 4 4 N 2 reticulation events. Minimum! 3

4 What if K  3? R min (G) LB(G) < < UB(G) The lower and upper bounds approach (Wu, 2010) for Rmin(G): ( G: K gene trees) If LB(G)=UB(G), then R min (G) = LB(G) = UB(G) 4 Problem: if LB(G) < UB(G), then do not know the exact value of Rmin This talk: the first exact algorithm for constructing the most parsimonious hybridization network for the K  3 case. The K  3 case is much harder than the two tree case

5 Backward in Time View and Ancestral Configurations Backward in time: 1.Two lineages coalesce into one lineage. 2.One lineage reticulates into two lineages. 1234 5 2314 5 1324 5 3 12 4 5 Three input trees Hybridization Network T0T0 T1T1 T2T2 Time Coalescence Reticulation T3T3 T4T4 T5T5 {1,2,3,4,5} {1,3,4,5,a,b} {3,4,5,a,c} {3,5,a,c,d,e} {3,5,a,d,f} {5,a,d,g} a b c de f g h i j {5,g,h} {5,i} {j} Ancestral configuration (AC): set of lineages in the network that are alive at time t. AC Hybridization network = A series of ACs Search for ACs: guided by the input trees

6 Lineage in AC: Display Input Subtrees Each lineage in network displays one or more input subtrees Which reticulation edge to follow? Lineage represented by the set of displayed subtrees 3 12 4 5 Subtrees labeled T0T0 T1T1 T2T2 T3T3 Progress of displayed subtrees when moving back: The more move backward, the larger input subtrees obtained. When a single lineage displays each complete input tree, done. T4T4 T5T5 1234 5     T1T1 2314 5     T2T2 1324 5    T3T3 a b c {1,  } Display 1 Display  Display  6

7 Search for Optimal ACs High-level idea: breath-first style search for optimal ancestral configurations 1234 5     T1T1 2314 5     T2T2 1324 5    T3T3 (1),(2),(3),(4),(5)(1),(2),(3),(3),(4),(5)(1),(2),(2),(3),(4),(5)(1),(1),(2),(3),(4),(5)(1),(2),(3),(4),(4),(5)(1),(2),(3),(4),(5),(5) (1,  ),(2),(3),(4),(5)(1),(2,  ),(3),(4),(5)(1,  ),(2),(3),(4),(5)(1),(2),(3),(4,  ),(5)(1,  ),(2),(3),(4),(5)(1),(2,  ),(3),(4),(5)(1),(2),(3),(4,  ),(5)(1),(2),(3,  ),(4),(5) Initial AC at level 0 Level 1 Level 2... ACs found by one reticulation from the initial AC ACs found by one or more coalescences Level k: all ACs reachable from initial AC with k reticulation (and any number of coalescences) Stop when reaching a final configuration displaying each complete input tree.... 7

8 The configuration search algorithm gives optimal network Efficiency: space of ACs is huge. For an AC with n lineages, E.g. n = 30, up to 465 new ACs with one reticulation or coalescence. Infeasible: for data with even moderate size Prune infeasible ACs: sometimes a coalescence lead s to an AC that is incompatible with the input trees Key to make the AC search feasible for relatively large data Works when Rmin is relatively small Issues in Searching for Optimal ACs 8

9 Techniques for Pruning ACs 1 1 2 2 3 3 4 4 T1T1 1 1 2 2 3 3 4 4 T2T2 a a b b c c Compatible Coalesce 1 and 2 Incompatible Coalesce 3 and 4 Incompatible AC: if some input subtrees can not be displayed A leaf under a lineage: covered Incompatible: some leaf not covered by any lineage. There are stronger rules (see paper). 1 1 2 2 3 3 4 4 a,c 3 3 4 4 b b 1 1 2 2 3 3 1 1 2 2 3 3 4 4 Compatible Reticulate 3 Compatible 3 3 1 1 2 2 4,b Coalesce 3 and 4 9

10 Implementation and Simulation 10 Simulation Data: from Wu (2010) Simulate a hybridization network N backwards in time for n species Randomly select K trees embedded in N. Evaluation Creteria: Compare with the original lower and upper bound approach: do the bounds give optimal network? The algorithm is implemented in a downloadable open-source software tool: An exact method: PIRN C : Can find exact Rmin when Rmin is relatively small (say 5 or less). Also a heuristic method for larger data: PIRN Ch. Search in a smaller space of ACs with a greedy approach.

11 Performance of Exact Method: PIRNc Only datasets with Rmin  4 are used. 100 datasets in total. Number of taxa: fixed to 10. K: number of gene trees, between 3 to 5 PIRNc better: % of datasets PIRNc finds optimal Rmin but not the bounds approach. LB: existing lower bound method UB: existing upper bond method PIRNc: always find optimal solution (if run to end) # of datasets PIRNc is better 11

12 Performance of Heuristic Method: PIRN ch PIRNc becomes slow when Rmin increases. PIRNch  UB: # of datasets among 100 datasets PIRNch < Upper Bound PIRNch outperforms the original lower bound/upper bound approach for larger daaets among 100 datasets Larger data with taxa number: 30, 40 or 50. 100 datasets each. PIRNch: heuristic for larger data. 12

13 13 Acknowledgement More information available at: http://www.engr.uconn.edu/~ywu Research supported by US National Science Foundation


Download ppt "An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University."

Similar presentations


Ads by Google