Presentation is loading. Please wait.

Presentation is loading. Please wait.

SupreFine, a new supertree method Shel Swenson September 17th 2009.

Similar presentations


Presentation on theme: "SupreFine, a new supertree method Shel Swenson September 17th 2009."— Presentation transcript:

1 SupreFine, a new supertree method Shel Swenson September 17th 2009

2 Tree of Life challenges: - millions of species - millions of species - lots of missing data - lots of missing data Reconstructing the Tree of Life Two possible approaches: - Combined Analysis - Supertree Methods

3 Two competing approaches gene 1 gene 2... gene k... Combined Analysis Species

4 Combined Analysis Methods gene 1 S1S1 S2S2 S3S3 S4S4 S7S7 S8S8 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA gene 3 TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC TAGTGATGCA S1S1 S3S3 S4S4 S7S7 S8S8 gene 2 GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC S4S4 S5S5 S6S6 S7S7

5 Combined Analysis gene 1 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 gene 2gene 3 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC TAGTGATGCA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

6 ... Analyze separately Supertree Method Two competing approaches gene 1 gene 2... gene k... Combined Analysis Species

7 Why use supertree methods? Missing data Large dataset sizes Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry) Unavailable sequence data (only trees)

8 Many Supertree Methods MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more... Matrix Representation with Parsimony (Most commonly used and most accurate)

9 Today’s Outline Supertree and combined analysis methods Why we need better supertree methods SuperFine: a new supertree method that is fast and more accurate than other supertree methods –Strict Consensus Merger (SCM) –Resolving polytomies –Performance of SuperFine (compared to MRP and combined anaylses) –applications and future work

10 gene 1 gene 2... gene k... Taxa Previous Simulation Studies 2. Generate sequence data 1. Generate Model Tree 4. Construct Source Trees... 3. Select Subsets 5. Apply Supertree Method 6. Compare to Model Tree

11 What does lead to missing data? Evolution (gain and loss of genes) Dataset selection Limited resources (time, money, etc.)

12 My Simulation Study 1.Generate model trees (100-1000 taxa) 2.Simulate gene gain and loss and generate sequences 3.Simulate techniques for gene and taxon selection Clade-based datasets Scaffold dataset 4.Generate source trees and a combined dataset 5.Apply supertree and combined analysis methods 6.Compare each estimated tree to the model tree, and record topological error

13 Experimental Parameters Number of taxa in model tree: 100, 500, and 1000 –Generate 5, 15 and 25 clade-based datasets, respectively Scaffold density: 20%, 50%, 75%, and 100% Six super-methods: –Combined analysis using ML and MP –MRP on ML and MP source trees –Weighted MRP on ML and MP source trees (MRP = Matrix Representation with Parsimony)

14 A B C F D EA B D F C E Quantifying Topological Error True TreeEstimated Tree False positive (FP): An edge in the estimated tree not in the true tree False negative (FN): An edge in the true tree missing from the estimated tree

15 Comparison of MRP-ML and CA-ML (False Negative Rate) Scaffold Density (%)

16 We still need supertree methods! Combined analysis cannot be used for: –Datasets that are very large –Incompatible data types –Unavailable sequence data

17 Outline Supertree and combined analysis methods Why we need better supertree methods SuperFine: a new supertree method that is fast and more accurate than other supertree methods –Strict Consensus Merger (SCM) –Resolving polytomies –Performance of SuperFine (compared to MRP and combined anaylses) –applications and future work

18 Methods that Led to SuperFine The Strict Consensus Merger (SCM) (Huson et al. 1999) Quartet MaxCut (QMC) (Snir and Rao 2008)

19 Strict Consensus Merger (SCM) a b c d e f g ab c d h ij e f g h ij a b c d a b c d e f g a b c d h ij

20 Theorem Let S be a collection of source trees and T be a SCM tree on S. Then for every s in S, ∑(T| L(s) )  ∑(s), where T| L(s) is the induced subtree of T on the leafset of s.

21 Corollary Let S be a collection of source trees, T be a SCM tree T on S, and let v be a vertex of T. Let T’ be a subtree of T rooted at a vertex u adjacent to v, such that v is not a vertex of T’ Then for every s in S, one of the following holds –L(s)  L(T’) –L(s)  L(T’) =  –L(s)  L(T’) | L(s) - L(T’)  ∑(s)

22 Intuition for the Theorem a b c d e f g ab c d h ij e f g h ij a b c d a b c d e f g a b c d h ij

23 Performance of SCM Low false positive (FP) rate (Estimated supertree has few false edges) High false negative (FN) rate (Estimated supertree is missing many true edges)

24 Methods that Led to SuperFine The Strict Consensus Merger (SCM) (Huson et al. 1999) Quartet MaxCut (QMC) (Snir and Rao 2008)

25 Quartet MaxCut (QMC) QMC is a heuristic for the following optimization problem: Given a collection Q of quartet trees, find a supertree T, with leaf set L(T) =  q  Q L(q), that displays the maximum number of quartet trees in Q. 1 2 3 4 5 6 7 1 5 4 2 1 2 4 5

26 12|34, 23|45, 34|56, 45|67 are compatible quartet trees with supertree Adding the quartet 17|23 creates an incompatible set of quartet trees. An “optimal” supertree would be the same as above, because it agrees with 4 out of 5 quartet trees. Maximizing # of Quartet Trees Displayed 1 2 3 4 5 6 7 2 35 4 1 3 4 2 4 5 7 6 3 5 6 4

27 QMC as a Supertree Method Step 1: Encode source trees as a set of quartets Step 2: Apply QMC

28 Idea behind SuperFine First, construct a supertree with low false positives using SCM The Strict Consensus Merger Then, refine the tree to reduce false negatives by resolving each polytomy using QMC Quartet Max Cut

29 Resolving a single polytomy, v Step 1: Encode each source tree as a collection of quartet trees on {1,2,...,d}, where d=degree(v) Step 2: Apply Quartet MaxCut (Snir and Rao) to the collection of quartet trees, to produce a tree t on leafset {1,2,...,d} Step 3: Replace the star tree at v by tree t Why?

30 Back to Our Example e f g a b c d h ij abce h i j dfg 123 456 a b c d e f g ab c d h ij 1 1 1 4 1 6 5 11 1 4 2 33

31 Where We Use the Theorem e f g a b c d h ij For every s in S, ∑(T| L(s) )  ∑(s) 4 1 6 5 1 4 2 3 a b c d e f g ab c d h ij

32 Step 1: Encode each source tree as a collection of quartet trees on {1,2,...,d} 1 23 4 1 4 56 a b c d e f g ab c d h ij 4 1 6 5 1 4 2 3

33 Step 2: Apply Quartet MaxCut (QMC) to the collection of quartet trees 1 23 4 1 4 56 QMC 1 23 4 6 5

34 Theorem For each source tree, and each polytomy v of degree d, the encoding of each source tree with leaf labels {1,2,...,d} is well-defined and produces no conflicting quartet trees.

35 Replace polytomy using tree from QMC 1 23 4 6 5 abce h i j dfg e f g a b c d h ij h d g f i j a b c e

36 False Negative Rate Scaffold Density (%)

37 False Negative Rate Scaffold Density (%)

38 False Positive Rate Scaffold Density (%)

39 Running Time SuperFine vs. MRP Scaffold Density (%)

40 Running Time SuperFine vs. MRP MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%)

41 Observations SuperFine is much more accurate than MRP, with comparable performance only when the scaffold density is 100% SuperFine is almost as accurate as CA- ML SuperFine is extremely fast

42 Future Work Exploring algorithm design space for Superfine –Different quartet encodings –Not using SCM in Step 1 –Parallel version –Post-processing step to minimize Sum-of-FN to source trees Using Superfine to enable phylogeny estimation –without an alignment –on many marker combined datasets Using Superfine in conjunction with divide-and-conquer methods to create more accurate phylogenetic methods Exploration of impact of source tree collections (in particular the scaffold) on supertree analyses Revisiting specific biological supertrees


Download ppt "SupreFine, a new supertree method Shel Swenson September 17th 2009."

Similar presentations


Ads by Google