An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA RECOMB
Keep two red edges Keep two black edges Reticulation event(s): nodes with in-degree two or more TATA TBTB Hybridization Networks Gene trees: phylogenetic history for individual genes - Inferred from gene sequences -Assume: Binary and rooted - Different topologies at different genes Reticulate evolution: one explanation - Hybrid speciation, horizontal gene transfer Gene A 1: T C G 2: T C A 3: C G G 4: C C G Gene B 1: A G C 2: T G T 3: A A C 4: A G T Hybridization network: A directed acyclic graph displaying each gene tree 2
The Minimum Hybridization Network Problem Given: a set of K gene trees G. Problem: reconstruct hybridization networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. NP complete: even for K=2 Most current approaches: exact methods for K=2 case (see Semple, et al) impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies T1T T2T2 T3T N 2 reticulation events. Minimum! 3
What if K 3? R min (G) LB(G) < < UB(G) The lower and upper bounds approach (Wu, 2010) for Rmin(G): ( G: K gene trees) If LB(G)=UB(G), then R min (G) = LB(G) = UB(G) 4 Problem: if LB(G) < UB(G), then do not know the exact value of Rmin This talk: the first exact algorithm for constructing the most parsimonious hybridization network for the K 3 case. The K 3 case is much harder than the two tree case
Backward in Time View and Ancestral Configurations Backward in time: 1.Two lineages coalesce into one lineage. 2.One lineage reticulates into two lineages Three input trees Hybridization Network T0T0 T1T1 T2T2 Time Coalescence Reticulation T3T3 T4T4 T5T5 {1,2,3,4,5} {1,3,4,5,a,b} {3,4,5,a,c} {3,5,a,c,d,e} {3,5,a,d,f} {5,a,d,g} a b c de f g h i j {5,g,h} {5,i} {j} Ancestral configuration (AC): set of lineages in the network that are alive at time t. AC Hybridization network = A series of ACs Search for ACs: guided by the input trees
Lineage in AC: Display Input Subtrees Each lineage in network displays one or more input subtrees Which reticulation edge to follow? Lineage represented by the set of displayed subtrees Subtrees labeled T0T0 T1T1 T2T2 T3T3 Progress of displayed subtrees when moving back: The more move backward, the larger input subtrees obtained. When a single lineage displays each complete input tree, done. T4T4 T5T T1T T2T T3T3 a b c {1, } Display 1 Display Display 6
Search for Optimal ACs High-level idea: breath-first style search for optimal ancestral configurations T1T T2T T3T3 (1),(2),(3),(4),(5)(1),(2),(3),(3),(4),(5)(1),(2),(2),(3),(4),(5)(1),(1),(2),(3),(4),(5)(1),(2),(3),(4),(4),(5)(1),(2),(3),(4),(5),(5) (1, ),(2),(3),(4),(5)(1),(2, ),(3),(4),(5)(1, ),(2),(3),(4),(5)(1),(2),(3),(4, ),(5)(1, ),(2),(3),(4),(5)(1),(2, ),(3),(4),(5)(1),(2),(3),(4, ),(5)(1),(2),(3, ),(4),(5) Initial AC at level 0 Level 1 Level 2... ACs found by one reticulation from the initial AC ACs found by one or more coalescences Level k: all ACs reachable from initial AC with k reticulation (and any number of coalescences) Stop when reaching a final configuration displaying each complete input tree.... 7
The configuration search algorithm gives optimal network Efficiency: space of ACs is huge. For an AC with n lineages, E.g. n = 30, up to 465 new ACs with one reticulation or coalescence. Infeasible: for data with even moderate size Prune infeasible ACs: sometimes a coalescence lead s to an AC that is incompatible with the input trees Key to make the AC search feasible for relatively large data Works when Rmin is relatively small Issues in Searching for Optimal ACs 8
Techniques for Pruning ACs T1T T2T2 a a b b c c Compatible Coalesce 1 and 2 Incompatible Coalesce 3 and 4 Incompatible AC: if some input subtrees can not be displayed A leaf under a lineage: covered Incompatible: some leaf not covered by any lineage. There are stronger rules (see paper) a,c b b Compatible Reticulate 3 Compatible ,b Coalesce 3 and 4 9
Implementation and Simulation 10 Simulation Data: from Wu (2010) Simulate a hybridization network N backwards in time for n species Randomly select K trees embedded in N. Evaluation Creteria: Compare with the original lower and upper bound approach: do the bounds give optimal network? The algorithm is implemented in a downloadable open-source software tool: An exact method: PIRN C : Can find exact Rmin when Rmin is relatively small (say 5 or less). Also a heuristic method for larger data: PIRN Ch. Search in a smaller space of ACs with a greedy approach.
Performance of Exact Method: PIRNc Only datasets with Rmin 4 are used. 100 datasets in total. Number of taxa: fixed to 10. K: number of gene trees, between 3 to 5 PIRNc better: % of datasets PIRNc finds optimal Rmin but not the bounds approach. LB: existing lower bound method UB: existing upper bond method PIRNc: always find optimal solution (if run to end) # of datasets PIRNc is better 11
Performance of Heuristic Method: PIRN ch PIRNc becomes slow when Rmin increases. PIRNch UB: # of datasets among 100 datasets PIRNch < Upper Bound PIRNch outperforms the original lower bound/upper bound approach for larger daaets among 100 datasets Larger data with taxa number: 30, 40 or datasets each. PIRNch: heuristic for larger data. 12
13 Acknowledgement More information available at: Research supported by US National Science Foundation