CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.

Slides:



Advertisements
Similar presentations
1 Modified Mincut Supertrees Roderic Page University of Glasgow.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Greedy Algorithms Be greedy! always make the choice that looks best at the moment. Local optimization. Not always yielding a globally optimal solution.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
Molecular Evolution Revised 29/12/06
Computational problems, algorithms, runtime, hardness
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
CIS786, Lecture 3 Usman Roshan.
CIS786, Lecture 4 Usman Roshan.
1 Tricks for trees: Having reconstructed phylogenies what can we do with them? DIMACS, June 2006 Mike Steel Allan Wilson Centre for Molecular Ecology and.
Supertrees: Algorithms and Databases Roderic Page University of Glasgow DIMACS Working Group Meeting on Mathematical and Computational.
Supertrees and the Tree of Life
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Descendent Subtrees Comparison of Phylogenetic Trees with Applications to Co-evolutionary Classifications in Bacterial Genome Yaw-Ling Lin 1 Tsan-Sheng.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
 2004 SDU Lecture 7- Minimum Spanning Tree-- Extension 1.Properties of Minimum Spanning Tree 2.Secondary Minimum Spanning Tree 3.Bottleneck.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Estimating Species Tree from Gene Trees by Minimizing Duplications
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Understanding sets of trees CS 394C September 10, 2009.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
LIMITATIONS OF ALGORITHM POWER
CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Introduction to NP-Completeness Tahir Azim. The Downside of Computers Many problems can be solved in linear time or polynomial time But there are also.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Tandy Warnow The University of Illinois
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
New methods for estimating species trees from gene trees
Imputing Supertrees and Supernetworks from Quartets
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
Scaling Species Tree Estimation to Large Datasets
Presentation transcript:

CS 598 AGB Supertrees Tandy Warnow

Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree on the full set S of taxa. Textbook material: Chapter 5 (Aho, Sagiv, Szymanski, and Ullman) and Chapter

Computing a tree from a set of rooted triplet trees Constructing a rooted tree from a set of compatible rooted triplet trees. Equivalently, test compatibility of a set of rooted triplet trees. Recursive algorithm by Aho, Sagiv, Szymanski, and Ullman Chapter 5.1

ASSU algorithm Given set X of k triplet trees on n species: If n>1, then construct graph with each species one of the vertices, and edges (a,b) for triplets ab|c. If the graph has a single component, reject (the set is not compatible); else recurse on each component, and return tree formed by making the rooted trees on the components each a subtree off the root of the returned tree.

Why does it work? If the set X of triplet trees is compatible, Then there is a rooted tree T with at least two subtrees off the root, T 1 and T 2. Any two leaves a,b in the same subtree cannot be in a triplet ab|c. Hence the graph formed for the set of triplet trees cannot be connected. Therefore the graph formed for the set of triplet trees must have at least two components. This argument applies recursively to every subset of X. Hence the algorithm returns a tree on which all the triplet trees agree. If the set X of triplet trees is not compatible, it is not hard to show that the algorithm will detect this (proof by induction on the number of taxa).

Compatibility of rooted trees Suppose the input is a set X of rooted trees (not necessarily triplet trees). Can we use ASSU to determine if X is compatible, and to compute a compatibility supertree for X? Solution: YES, just encode each rooted tree in X by its set of rooted triplet trees (or some subset of these that suffices to define each tree in X), and then run ASSU.

Summary so far Testing compatibility of an arbitrary set of rooted trees (and constructing compatibility supertree): polynomial time, using ASSU Testing compatibility of an arbitrary set of unrooted trees (and constructing compatibility supertree): NP-complete! Special cases for testing compatibility of unrooted trees: – Input has a tree on every four taxa. (Solution: Use All Quartets Method to test for compatibility) – Input trees all have a common species, A. (Solution: root all the input trees using leaf A, and then run ASSU.) – Input has all the “short quartets” of a tree. (Solution: Use Dyadic Closure to test for compatibility, see Chapter 13)

Summary so far Testing compatibility of an arbitrary set of rooted trees (and constructing compatibility supertree): polynomial time, using ASSU Testing compatibility of an arbitrary set of unrooted trees (and constructing compatibility supertree): NP-complete! Special cases for testing compatibility of unrooted trees: – Input has a tree on every four taxa. (Solution: Use All Quartets Method to test for compatibility) – Input trees all have a common species, A. (Solution: root all the input trees using leaf A, and then run ASSU.) – Input has all the “short quartets” of a tree. (Solution: Use Dyadic Closure to test for compatibility, see Chapter 13)

Summary so far Testing compatibility of an arbitrary set of rooted trees (and constructing compatibility supertree): polynomial time, using ASSU Testing compatibility of an arbitrary set of unrooted trees (and constructing compatibility supertree): NP-complete! Special cases for testing compatibility of unrooted trees: – Input has a tree on every four taxa. (Solution: Use All Quartets Method to test for compatibility) – Input trees all have a common species, A. (Solution: root all the input trees using leaf A, and then run ASSU.) – Input has all the “short quartets” of a tree. (Solution: Use Dyadic Closure to test for compatibility, see Chapter 13)

Summary so far Testing compatibility of an arbitrary set of rooted trees (and constructing compatibility supertree): polynomial time, using ASSU Testing compatibility of an arbitrary set of unrooted trees (and constructing compatibility supertree): NP-complete! Special cases for testing compatibility of unrooted trees: – Input has a tree on every four taxa. (Solution: Use All Quartets Method to test for compatibility) – Input trees all have a common species, A. (Solution: root all the input trees using leaf A, and then run ASSU.) – Input has all the “short quartets” of a tree. (Solution: Use Dyadic Closure to test for compatibility, see Chapter 13)

Supertree Methods Most of the time, the input is a set of unrooted source trees that is incompatible. All the methods described so far only return compatibility supertrees. How can we construct supertrees from incompatible source trees?

Supertree estimation Challenges: Tree compatibility is NP-complete (therefore, even if subtrees are correct, supertree estimation is hard) Estimated subtrees have error Advantages: Estimating individual gene trees can be computationally feasible (compared to the combined analysis of many genes) Can use different types of data for each source tree

Many Supertree Methods MRP weighted MRP MRF MRD Robinson-Foulds Supertrees Min-Cut Modified Min-Cut Semi-strict Supertree QMC Q-imputation SDM PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more... Matrix Representation with Parsimony (Most commonly used and most accurate)

Supertree Optimization Problems MRP (Matrix Representation with Parsimony) MRL (Matrix Representation with Likelihood) RFS (Robinson-Foulds Supertree) MQDS (Minimum Quartet Distance Supertree) Everything is NP-hard. Some of the methods have good heuristics. It is easy to see that if the input source trees are compatible, then MRP, RFS, and MQDS return a compatibility tree.

FN rate of MRP vs. combined analysis Scaffold Density (%)

Comparison of Supertree methods and Concatenation From Swenson et al., Algorithms for Molecular Biology

Comparison of Supertree Methods Swenson et al., Algorithms for Molecular Biology

SuperFine SuperFine is a technique for improving the speed and accuracy of supertree methods. The first step computes a “strict consensus merger” (SCM) of the input trees, and the second step refines the SCM using the supertree method. The SCM calculation is very fast. The refinement step is applied to each polytomy (node with degree greater than 3) independently, and is fast when the degree is small.

SuperFine-boosting: improves accuracy of MRP Scaffold Density (%) (Swenson et al., Syst. Biol. 2012)

SuperFine First, construct a supertree with low false positives The Strict Consensus Then, refine the tree to reduce false negatives by resolving each polytomy using a “base” supertree method (e.g., MRP)Quartet Max Cut

Theoretical results for SCM SCM can be computed in polynomial time For certain types of inputs, the SCM method solves the NP-hard “Tree Compatibility” problem All splits in the SCM “appear” in at least one source tree (and are not contradicted by any source tree)

Comparing Supertree Methods on 1000-taxon datasets Figure 1 from Nguyen, Mirarab, and Warnow, Algorithms for Molecular Biology

Obtaining a supertree with low FP The Strict Consensus Merger (SCM) SCM of two trees Computes the strict consensus on the common leaf set Then superimposes the two trees, contracting more edges in the presence of “collisions”

Strict Consensus Merger (SCM) a b c d e f g ab c d h ij e f g h ij a b c d a b c d e f g a b c d h ij

Performance of SCM Low false positive (FP) rate (Estimated supertree has few false edges) High false negative (FN) rate (Estimated supertree is missing many true edges)

Part II of SuperFine Refine the tree to reduce false negatives by resolving each polytomy using a base supertree method (e.g., MRP)

Part 1 of SuperFine a b c d e f g ab c d h ij e f g h ij a b c d a b c d e f g a b c d h ij

Part 2 of SuperFine e f g a b c d h ij abce h i j dfg a b c d e f g ab c d h ij

Step 2: Apply MRP to the collection of reduced source trees MRP

Replace polytomy using tree from MRP abce h i j dfg e f g a b c d h ij h d g f i j a b c e

Resolving a single polytomy, v, using MRP Step 1: Reduce each source tree to a tree on leafset, {1,2,...,d} where d=degree(v) Step 2: Apply MRP to the collection of reduced source trees, to produce a tree t on {1,2,...,d} Step 3: Replace the star tree at v by tree t

SuperFine-boosting: improves accuracy of MRP Scaffold Density (%) (Swenson et al., Syst. Biol. 2012)

SuperFine is also much faster MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%)

Summary (so far) Supertree methods are useful for constructing very large species trees from a set of source trees. The most well known supertree method is MRP, but there are more accurate methods (e.g., MRL, and perhaps quartet-based methods that try to solve Minimum Quartet Distance Supertree). SuperFine is a technique for improving the speed and accuracy of supertree methods. CA-ML (concatenation using maximum likelihood) is often more accurate than current supertree methods, but is more computationally intensive.

Limitations of Supertree Methods Traditional supertree methods assume that the true gene trees match the true species tree. This is known to be unrealistic in some situations, due to processes such as Deep coalescence (“incomplete lineage sorting”) Gene duplication and loss Horizontal gene transfer

Red gene tree ≠ species tree (green gene tree okay)

Coming up Supertree methods based on quartets are also good for species tree estimation in the presence of ILS and/or HGT! Supertree methods are useful for divide-and- conquer methods (e.g., DACTAL).