New methods for estimating species trees from gene trees

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
Molecular Evolution Revised 29/12/06
CIS786, Lecture 4 Usman Roshan.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Estimating Species Tree from Gene Trees by Minimizing Duplications
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
Distance-based phylogeny estimation
Advances in Ultra-large Phylogeny Estimation
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CS 581 / BIOE 540: Algorithmic Computational Genomics
Statistical tree estimation
Distance based phylogenetics
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
Techniques for MSA Tandy Warnow.
Character-Based Phylogeny Reconstruction
Algorithm Design and Phylogenomics
Mathematical and Computational Challenges in Reconstructing Evolution
Tandy Warnow The University of Illinois
Mathematical and Computational Challenges in Reconstructing Evolution
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
Tandy Warnow Department of Computer Sciences
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Presentation transcript:

New methods for estimating species trees from gene trees Tandy Warnow March 12, 2012

Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Progress on Gene Tree and Alignment Estimation Statistical performance of phylogeny estimation methods Co-estimation of alignments and trees (SATé) “Alignment-free” phylogeny estimation (DACTAL) Phylogenetic analysis and alignment of NGS data (SEPP) Taxon identification of short reads from same gene (metagenomic analysis) (TIPP) Tomorrow’s talk will cover SATé, SEPP, and TIPP

Single gene vs. multi-gene analyses Most methods analyze single genes (or other genomic region). These produce estimated “gene trees”. But species trees are estimated using multiple genes.

Multi-gene analyses After alignment of each gene dataset: Combined analysis: Concatenate (“combine”) alignments for different genes, and run phylogeny estimation methods Supertree: Compute trees on alignment and combine gene trees

Not all genes present in all species TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA gene 3 TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC S1 S3 S4 S7 S8 gene 2 GGTAACCCTC GCTAAACCTC GGTGACCATC S4 S5 S6 S7

Two competing approaches gene 1 gene 2 . . . gene k . . . Species Combined Analysis . . . Analyze separately point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”. Supertree Method

Constructing trees from subtrees Let T|A denote the induced subtree of T on the leafset A a b c f d e c d f a T|{a,c,d,f} T Question: given induced subtrees of T for many subsets of taxa -- can you produce the tree T?

Supertree estimation Challenges: Tree compatibility is NP-complete (therefore, even if subtrees are correct, supertree estimation is hard) Estimated subtrees have error Advantages: Estimating individual gene trees can be computationally feasible (compared to the combined analysis of many genes) Can use different types of data for each gene

Many Supertree Methods Matrix Representation with Parsimony (Most commonly used and most accurate) MRP weighted MRP MRF MRD Robinson-Foulds Supertrees Min-Cut Modified Min-Cut Semi-strict Supertree QMC Q-imputation SDM PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more ... move to later

Quantifying topological error b c f d e a b d f c e False negative (FN): b  B(Ttrue)-B(Test.) False positive (FP): b  B(Test.)-B(Ttrue) True Tree Estimated Tree write out exactly what to say about why fn is more important than fp

FN rate of MRP vs. combined analysis Scaffold Density (%)

SuperFine-boosting: improves accuracy of MRP Scaffold Density (%) (Swenson et al., Syst. Biol. 2012)

SuperFine First, construct a supertree with low false positives The Strict Consensus Then, refine the tree to reduce false negatives by resolving each polytomy using a “base” supertree method (e.g., MRP) Quartet Max Cut fix ideal/real

Obtaining a supertree with low FP The Strict Consensus Merger (SCM) SCM of two trees Computes the strict consensus on the common leaf set Then superimposes the two trees, contracting more edges in the presence of “collisions” say something about both

Strict Consensus Merger (SCM) f g h i j a b c d b a e e f g a b c d h i j f c d g a b mention that SCM is a supertree method in and of itself. describe how it is used: merges pairs of trees until a single tree is left. c h d i j

Performance of SCM Low false positive (FP) rate (Estimated supertree has few false edges) High false negative (FN) rate (Estimated supertree is missing many true edges) Remember to make fp/fn box and mark SCM, and MRP. (add QMC-sparse and QMC-dense later)

Theoretical results for SCM SCM can be computed in polynomial time For certain types of inputs, the SCM method solves the NP-hard “Tree Compatibility” problem All splits in the SCM “appear” in at least one source tree (and project onto each source tree)

Resolving a single polytomy, v, using MRP Step 1: Reduce each source tree to a tree on leafset, {1,2,...,d} where d=degree(v) Step 2: Apply MRP to the collection of reduced source trees, to produce a tree t on {1,2,...,d} Step 3: Replace the star tree at v by tree t

Part 1 of SuperFine e f g h i j a b c d b a e e f g a b c d h i j f c mention that SCM is a supertree method in and of itself. describe how it is used: merges pairs of trees until a single tree is left. c h d i j

Part 2 of SuperFine a b c d e f g h i j 1 4 6 5 2 3 e f g a b c d h i mention that rooting matters here mention theorem a b c e h i j d f g

Theorem Given a set of source trees, SCM tree T, and a polytomy in T, after relabelling and reducing, each source tree has at most one leaf with each label.

Step 2: Apply MRP to the collection of reduced source trees 1 4 5 6 5 1 4 MRP 1 2 3 4 2 3 6

Replace polytomy using tree from MRP b c d h i j a b c e g 5 4 d 1 2 h 3 6 f i j h i j a b c e d g f

SuperFine-boosting: improves accuracy of MRP Scaffold Density (%) (Swenson et al., Syst. Biol. 2012)

SuperFine is also much faster MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

Limitations of Supertree Methods Traditional supertree methods assume that the true gene trees match the true species tree. This is known to be unrealistic in some situations, due to processes such as Deep coalescence (“incomplete lineage sorting”) Gene duplication and loss Horizontal gene transfer

Multiple populations/species Present Past Courtesy James Degnan

Gene tree in a species tree Courtesy James Degnan

Deep Coalescence Population-level process, also called “Incomplete Lineage Sorting” Gene trees can differ from species trees due to short times between speciation events (population size also impacts this probability) Causes difficulty in estimating some species trees (such as human-chimp-gorilla)

Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

MDC Problem Posed by Wayne Maddison, Syst Biol 1997 MDC (minimize deep coalescence) problem: given set of true gene trees, find the species tree that implies the fewest deep coalescence events Posed by Wayne Maddison, Syst Biol 1997

Counting deep coalescences

Extra Lineages XL(T,t) T is the species tree t is the gene tree XL(T,t): the number of extra lineages, under the best embedding of t into T

Two MDC problems Score pair of trees: Input: rooted binary gene tree t and species tree T Output: XL(T,t) Find best species tree: Input: set X of rooted, binary gene trees on set S Output: species tree T on S that minimizes XL(T,X) = t XL(T,t).

Limitations of methods for MDC Current methods typically assume input gene trees are correct, binary, rooted trees containing all the taxa But Estimated gene trees are usually partially incorrect, are often unrooted, and may not be complete. Assuming all gene tree incompatibility is due to deep coalescence is likely problematic.

Minimizing Deep Coalescence (MDC) Than and Nakhleh (PLoS Comp Biol 2009): algorithms for MDC which assume all gene trees are correct, rooted, binary trees. Yu, Warnow, and Nakhleh (RECOMB 2011 and J Comp Biol 2011) extends T&N 2009 to handle estimated gene trees that are unrooted and have errors. Bayzid and Warnow (J Comp Biol, in press) extends T&N 2009 to handle incomplete gene trees.

Search: main results in T&N 2009 Theorem: Let X be a set of k rooted binary gene trees on taxon set S, and let C be a set of subsets of the taxon set. Then a species tree T that optimizes MDC with Clusters(T)  C can be found in time that is polynomial in |C|, n, and k. Exact MDC: Let C be all possible subsets of S “Heuristic” MDC: Let C be the set of “clusters” of the input gene trees (where a cluster is the set of leaves below a node in a tree)

T&N 2009: B-maximal clusters and kB(t) T is a species tree, and t is a gene tree, both rooted and binary Definitions B is a cluster of T Y is a B-maximal cluster in t if (i) Y is a cluster of t, (ii) Y  B, and (iii) Y  Z for any other cluster Z of t such that Z  B. kB(t) is the number of B-maximal clusters in t

Calculating XL(T,t) Lemma (T&N 2009): Let T be a binary species tree and t be a binary rooted gene tree. Then for an optimal embedding of t into T: kB(t) is the number of lineages on the edge “above” subtree for B in T XL(T,t) = B[kB(t)-1], where B ranges over the clusters of T.

Calculating XL(T,X) Define CostB(t)= kB(t)-1, and therefore XL(T,t) = B CostB(t) Given set X of gene trees, define XL(T,X) = t XL(T,t) = t B CostB(t) = B t CostB(t) = B w(B) where w(B) = t CostB(t)

Graph Algorithm for MDC Graph G(X): Vertex set: v corresponds to non-trivial S(v) S, where S(v) is the cluster of T below node v Edges: (v,w) present iff clusters S(v) and S(w) can co-exist as clusters in a tree Vertex weight: Weight(v) =∑t CostS(v)(t) Theorem:  T, binary rooted tree on S s.t. XL(T,X)=W, iff  (n-2)-clique in G(X) of weight W, where |S|=n. Hence, MDC can be solved by finding a (n-2)-clique of minimum total weight in G(X).

T&N algorithm for MDC Because of the structure of the graph, we can find a min cost max clique (of size n-2) in polynomial time (in the size of the graph), using dynamic programming. But the graph has 2n vertices! However, if we constrain the set C of permitted clusters for the species tree, we can find an optimal constrained solution in O(|C|2 nk) time (the “heuristic” algorithm in T&N 2009).

Yu, Warnow and Nakhleh (2011) Allows for error in estimated gene trees. RECOMB 2011 and J Comp Biol 2011

Yu, Warnow and Nakhleh (2011) Modify gene trees to reduce false positive error: Unroot trees Use bootstrap (or other statistical techniques) to identify the edges that are potentially incorrect Contract the low support edges Result: estimated gene trees that are likely to be unrooted contractions of the true gene tree.

New MDC problem Input: set X ={t1, t2, …, tk} of incompletely resolved, unrooted gene trees. Output: set X’={t’1, t’2, …, t’k} (such that each t’i is a resolved, rooted version of ti, i=1,2…k) and species tree T that minimizes XL(T,X’). In other words, we treat ti as a constraint on the true gene tree for gene i.

Search: main theoretical result in T&N 2009 Theorem: Let X be a set of k rooted binary gene trees on taxon set S, and let C be a set of clusters on the taxon set. Then a species tree T that optimizes MDC with Clusters(T)  C can be found in O(|C|2nk) time, where |S|=n.

Search: main theoretical result in YWN 2011 Theorem: Let X be a set of k unrooted and not necessarily binary gene trees on taxon set S, and let C be a set of clusters on the taxon set. Then a species tree T that optimizes MDC with Clusters(T)  C can be found in O(|C|2nk) time, where |S|=n.

Scoring: main theoretical result Theorem: Let t be an unrooted and not necessarily binary gene tree, and let T be a rooted binary species tree, both on S. Then a rooted refinement t* of t that minimizes XL(T,t*) can be found in O(n2) time, where |S|=n. Note: brute-force is exponential, even if t is rooted and the maximum degree in t is low

Simplest case: t is rooted Input: rooted tree t, not necessarily binary, and binary rooted species tree T Output: refinement t* of t, minimizing XL(T,t*) Recall that XL(T,t*) = ∑B[kB(t*)-1]

Refining rooted tree t Def.: FB(t) denotes the number of nodes in t that have at least one B-maximal child. Lemma: If t’ is a binary refinement of t, then FB(t)  kB(t’). Theorem: For all rooted trees t, there exists t*, a binary refinement of t, such that for all clusters B of T, kB(t*) = FB(t).

Computing t* Algorithm: Refine around each high degree node v in t using the subtree of T defined by the LCAs in T of the children of v. Order in which you visit each high degree node does not impact the output Can be computed in O(n2) time

Proof of optimality Recall: FB(t) denotes the number of nodes in t that have at least one B-maximal child. Theorem: The tree t* produced by the algorithm satisfies kB(t*) = FB(t) for every cluster B of T. Hence, t* is optimal. Proof: Algorithm is locally optimal.

Finding the best species tree, given rooted non-binary trees Same basic graph-theoretic approach and DP algorithms work Same graph G(X), but redefine CostB(t)= FB(t)-1 and keep weight(v) = t CostS(v)(t)

General case: t unrooted, non-binary Input: unrooted, non-binary gene tree t and rooted binary species tree T Output: rooted, binary tree t* refining t such that XL(T,t*) is minimized Clearly this is solvable in O(n3) time. Better O(n2) algorithm: find root, then refine optimally.

Summary of YWN 2011 Extends all results from Than and Nakhleh 2009 to partially resolved, unrooted gene trees Suggests contraction of low support edges and suppression of root before species tree estimation Gives polynomial time DP algorithm for constrained search for species tree (using only clusters from input gene trees)

Related results Yang and Warnow (RECOMB-CG 2011 and BMC Bioinformatics 2011) shows that the constrained version of the polynomial time DP algorithm in YWN 2011 produces trees of comparable accuracy to BUCKy, a statistically-based method for species tree estimation under ILS. Bayzid and Warnow (in press, J Comp Biol) extends T&N 2009 to incomplete gene trees

Discussion SuperFine is a fast method to “boost” the accuracy of supertree methods, and produces highly accurate species trees quickly when no ILS occurs. (Data not shown: SuperFine also gives good results in the presence of ILS!) In the presence of ILS, statistically-based methods give the best results, but can only be run on small datasets. Acknowledging error in gene trees improves species tree estimation.

Acknowledgments Funding: Microsoft Research New England, National Science Foundation, and the Guggenheim Foundation Collaborators: Luay Nakhleh and Yun Yu (MDC), Shel Swenson, Randy Linder, and Rahul Suri (SuperFine)

Part I: SuperFine Nelesen, Suri, Linder, and Warnow Accepted for publication, subject to revision, Systematic Biology Note: SuperFine is the supertree method used in the DACTAL software (Nelesen et al., submitted)

Step 1: Encode each source tree as a collection of reduced source trees on {1,2,...,d} b c d e f g h i j 4 1 6 5 2 3

Part 2 of SuperFine a b c d e f g h i j 1 4 6 5 2 3 e f g a b c d h i mention that rooting matters here mention theorem a b c e h i j d f g

Recall Lemma b e b a e a e f g a b c d h i j c f c d g f g d a b a b c

Replace polytomy using tree from MRP b c d h i j a b c e g 5 4 d 1 2 h 3 6 f i j h i j a b c e d g f

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Neighbor Joining’s sequence length requirement is exponential! Atteson: Let T be a General Markov model tree on n leaves. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length at least O(ln n eO(n)).

Chordal graph algorithms yield phylogeny estimation from polynomial length sequences Theorem (Warnow et al., SODA 2001): DCM1-NJ correct with high probability given sequences of length O(ln n eO(ln n)) Simulation study from Nakhleh et al. ISMB 2001 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

SATé-1 and SATé-2 (“Next” SATé), on 1000 leaf models

DACTAL more accurate than all standard methods, and much faster than SATé Average results on 3 large RNA datasets (6K to 28K) CRW: Comparative RNA database, structural alignments 3 datasets with 6,323 to 27,643 sequences Reference trees: 75% RAxML bootstrap trees DACTAL (shown in red) run for 5 iterations starting from FT(Part) SATé-1 fails on the largest dataset SATé-2 runs but is not more accurate than DACTAL, and takes longer

Markov Model of Site Evolution Simplest (Jukes-Cantor): The model tree T is binary and has substitution probabilities p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleotides) If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

Recall Lemma b e b a e a e f g a b c d h i j c f c d g f g d a b a b c

Step 1: Encode each source tree as a collection of reduced source trees on {1,2,...,d} b c d e f g h i j 4 1 6 5 2 3

Bipartitions and refinement B(T) denotes the set of non-trivial bipartitions (splits) of T T refines T’ (T’≤T) if B(T’)  B(T) a b c f d e T B(T) = {ab|cdef, abc|def, abcf|de} T’ B(T’) = {ab|cdef, abc|def}

Displays and compatibility T displays T’ if T’ ≤ T|L(T’) T displays a set of trees if it displays every tree in that set. A set S of trees is compatible if there exists a tree T such that T displays S In general, determining whether a set of trees is compatible is NP-hard

Matrix representation with parsimony (MRP) First, encode each edge of each source tree as a partial binary character Then, analyze this matrix of partial binary characters (the matrix representation) using maximum parsimony (MP) If used with exact solutions to MP, MRP is an exact algorithm for Tree Compatibility

Maximum Parsimony (Hamming distance Steiner Tree) Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T leaf-labeled by sequences in S additional sequences of length k labeling the internal nodes of T such that is minimized.

Lemma: SCM splits project onto source trees b e b a e a e f g a b c d h i j c f c d g f g d a b a b c h c i j d h d i j

Finding optimal root Color all edges of the gene tree in a B-maximal subtree, for some cluster B in T. Theorem: the optimal rooted refinement of t can be obtained by rooting t at any node that is incident to at least one uncolored edge (and there will be at least one). Furthermore, such a node can be found in O(n2) time.

Graph algorithm For each non-trivial subset B of S, find the best rooted version t’ of each gene tree t, and define CostB(t) = FB(t’)-1. Find (n-2)-clique of minimum total weight in the new G(X), with weight(v) = t CostS(v)(t).

Main results in Than and Nakhleh, 2009 Gives polynomial time algorithm to compute XL(T,X), where T is a binary rooted species tree and X is a set of binary rooted gene trees Gives exact DP algorithm for finding optimal MDC species tree for input set of binary rooted gene trees Gives exact DP polynomial time solution for constructing optimal MDC species tree when all its bipartitions constrained to come from a user-specified set. All results require input gene trees be binary, rooted trees. Analysis assumes input trees are 100% correct.

SuperFine: new supertree method Step 1: construct a supertree with low false positives (unresolved) Step 2: Refine the tree to reduce false negatives by resolving each high degree node (“polytomy”) using a “base” supertree method (e.g., MRP) applied to recoded source trees. Quartet Max Cut fix ideal/real

Main results of Than and Nakhleh, 2009 Gives polynomial time algorithm to compute XL(T,X), where T is a binary rooted species tree and X is a set of binary rooted gene trees Gives exact (DP) algorithm for finding optimal MDC species tree for input set of binary rooted gene trees, by finding (n-2)-clique of minimum weight in a exponentially large graph. Gives exact (DP) polynomial time algorithm for constrained version of MDC problem, in which the species tree bipartitions must come from a user-provided input set. All results require input gene trees be binary, rooted trees. Analysis assumes input trees are 100% correct.

Scoring a pair of trees Recall: FB(t) denotes the number of nodes in t that have at least one B-maximal child. Corollary: Given rooted gene tree t and rooted, binary species tree T, and t* an optimal refinement of t. Then XL(T,t*) = ∑B[FB(t)-1] as B ranges over the clusters of T.