SupreFine, a new supertree method Shel Swenson September 17th 2009.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

An introduction to maximum parsimony and compatibility
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
CIS786, Lecture 3 Usman Roshan.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
CIS786, Lecture 4 Usman Roshan.
1 Tricks for trees: Having reconstructed phylogenies what can we do with them? DIMACS, June 2006 Mike Steel Allan Wilson Centre for Molecular Ecology and.
Supertrees and the Tree of Life
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Terminology of phylogenetic trees
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant
SuperTriplets: a triplet-based supertree approach to phylogenomics Vincent Ranwez, Alexis Criscuolo and Emmanuel J.P. Douzery.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Introduction to Phylogenetic Trees
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Understanding sets of trees CS 394C September 10, 2009.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
BNFO 602 Phylogenetics Usman Roshan.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
New methods for estimating species trees from gene trees
Imputing Supertrees and Supernetworks from Quartets
Presentation transcript:

SupreFine, a new supertree method Shel Swenson September 17th 2009

Tree of Life challenges: - millions of species - millions of species - lots of missing data - lots of missing data Reconstructing the Tree of Life Two possible approaches: - Combined Analysis - Supertree Methods

Two competing approaches gene 1 gene 2... gene k... Combined Analysis Species

Combined Analysis Methods gene 1 S1S1 S2S2 S3S3 S4S4 S7S7 S8S8 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA gene 3 TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC TAGTGATGCA S1S1 S3S3 S4S4 S7S7 S8S8 gene 2 GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC S4S4 S5S5 S6S6 S7S7

Combined Analysis gene 1 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 gene 2gene 3 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC TAGTGATGCA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

... Analyze separately Supertree Method Two competing approaches gene 1 gene 2... gene k... Combined Analysis Species

Why use supertree methods? Missing data Large dataset sizes Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry) Unavailable sequence data (only trees)

Many Supertree Methods MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more... Matrix Representation with Parsimony (Most commonly used and most accurate)

Today’s Outline Supertree and combined analysis methods Why we need better supertree methods SuperFine: a new supertree method that is fast and more accurate than other supertree methods –Strict Consensus Merger (SCM) –Resolving polytomies –Performance of SuperFine (compared to MRP and combined anaylses) –applications and future work

gene 1 gene 2... gene k... Taxa Previous Simulation Studies 2. Generate sequence data 1. Generate Model Tree 4. Construct Source Trees Select Subsets 5. Apply Supertree Method 6. Compare to Model Tree

What does lead to missing data? Evolution (gain and loss of genes) Dataset selection Limited resources (time, money, etc.)

My Simulation Study 1.Generate model trees ( taxa) 2.Simulate gene gain and loss and generate sequences 3.Simulate techniques for gene and taxon selection Clade-based datasets Scaffold dataset 4.Generate source trees and a combined dataset 5.Apply supertree and combined analysis methods 6.Compare each estimated tree to the model tree, and record topological error

Experimental Parameters Number of taxa in model tree: 100, 500, and 1000 –Generate 5, 15 and 25 clade-based datasets, respectively Scaffold density: 20%, 50%, 75%, and 100% Six super-methods: –Combined analysis using ML and MP –MRP on ML and MP source trees –Weighted MRP on ML and MP source trees (MRP = Matrix Representation with Parsimony)

A B C F D EA B D F C E Quantifying Topological Error True TreeEstimated Tree False positive (FP): An edge in the estimated tree not in the true tree False negative (FN): An edge in the true tree missing from the estimated tree

Comparison of MRP-ML and CA-ML (False Negative Rate) Scaffold Density (%)

We still need supertree methods! Combined analysis cannot be used for: –Datasets that are very large –Incompatible data types –Unavailable sequence data

Outline Supertree and combined analysis methods Why we need better supertree methods SuperFine: a new supertree method that is fast and more accurate than other supertree methods –Strict Consensus Merger (SCM) –Resolving polytomies –Performance of SuperFine (compared to MRP and combined anaylses) –applications and future work

Methods that Led to SuperFine The Strict Consensus Merger (SCM) (Huson et al. 1999) Quartet MaxCut (QMC) (Snir and Rao 2008)

Strict Consensus Merger (SCM) a b c d e f g ab c d h ij e f g h ij a b c d a b c d e f g a b c d h ij

Theorem Let S be a collection of source trees and T be a SCM tree on S. Then for every s in S, ∑(T| L(s) )  ∑(s), where T| L(s) is the induced subtree of T on the leafset of s.

Corollary Let S be a collection of source trees, T be a SCM tree T on S, and let v be a vertex of T. Let T’ be a subtree of T rooted at a vertex u adjacent to v, such that v is not a vertex of T’ Then for every s in S, one of the following holds –L(s)  L(T’) –L(s)  L(T’) =  –L(s)  L(T’) | L(s) - L(T’)  ∑(s)

Intuition for the Theorem a b c d e f g ab c d h ij e f g h ij a b c d a b c d e f g a b c d h ij

Performance of SCM Low false positive (FP) rate (Estimated supertree has few false edges) High false negative (FN) rate (Estimated supertree is missing many true edges)

Methods that Led to SuperFine The Strict Consensus Merger (SCM) (Huson et al. 1999) Quartet MaxCut (QMC) (Snir and Rao 2008)

Quartet MaxCut (QMC) QMC is a heuristic for the following optimization problem: Given a collection Q of quartet trees, find a supertree T, with leaf set L(T) =  q  Q L(q), that displays the maximum number of quartet trees in Q

12|34, 23|45, 34|56, 45|67 are compatible quartet trees with supertree Adding the quartet 17|23 creates an incompatible set of quartet trees. An “optimal” supertree would be the same as above, because it agrees with 4 out of 5 quartet trees. Maximizing # of Quartet Trees Displayed

QMC as a Supertree Method Step 1: Encode source trees as a set of quartets Step 2: Apply QMC

Idea behind SuperFine First, construct a supertree with low false positives using SCM The Strict Consensus Merger Then, refine the tree to reduce false negatives by resolving each polytomy using QMC Quartet Max Cut

Resolving a single polytomy, v Step 1: Encode each source tree as a collection of quartet trees on {1,2,...,d}, where d=degree(v) Step 2: Apply Quartet MaxCut (Snir and Rao) to the collection of quartet trees, to produce a tree t on leafset {1,2,...,d} Step 3: Replace the star tree at v by tree t Why?

Back to Our Example e f g a b c d h ij abce h i j dfg a b c d e f g ab c d h ij

Where We Use the Theorem e f g a b c d h ij For every s in S, ∑(T| L(s) )  ∑(s) a b c d e f g ab c d h ij

Step 1: Encode each source tree as a collection of quartet trees on {1,2,...,d} a b c d e f g ab c d h ij

Step 2: Apply Quartet MaxCut (QMC) to the collection of quartet trees QMC

Theorem For each source tree, and each polytomy v of degree d, the encoding of each source tree with leaf labels {1,2,...,d} is well-defined and produces no conflicting quartet trees.

Replace polytomy using tree from QMC abce h i j dfg e f g a b c d h ij h d g f i j a b c e

False Negative Rate Scaffold Density (%)

False Negative Rate Scaffold Density (%)

False Positive Rate Scaffold Density (%)

Running Time SuperFine vs. MRP Scaffold Density (%)

Running Time SuperFine vs. MRP MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%)

Observations SuperFine is much more accurate than MRP, with comparable performance only when the scaffold density is 100% SuperFine is almost as accurate as CA- ML SuperFine is extremely fast

Future Work Exploring algorithm design space for Superfine –Different quartet encodings –Not using SCM in Step 1 –Parallel version –Post-processing step to minimize Sum-of-FN to source trees Using Superfine to enable phylogeny estimation –without an alignment –on many marker combined datasets Using Superfine in conjunction with divide-and-conquer methods to create more accurate phylogenetic methods Exploration of impact of source tree collections (in particular the scaffold) on supertree analyses Revisiting specific biological supertrees