The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, 16-18 August 2006 Barbara Holland.

Slides:



Advertisements
Similar presentations
1 Modified Mincut Supertrees Roderic Page University of Glasgow.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
An introduction to maximum parsimony and compatibility
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
1 Constructing Splits Graphs Author: Andreas W.M. Dress Daniel H. Huson Presented by: Bakhtiyar Uddin.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical.
Molecular Evolution Revised 29/12/06
Phylogeny Tree Reconstruction
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Bioinformatics Algorithms and Data Structures
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
The Tree of Life From Ernst Haeckel, 1891.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
CSC401 – Analysis of Algorithms Lecture Notes 12 Dynamic Programming
Tree Evaluation Tree Evaluation. Tree Evaluation A question often asked of a data set is whether it contains ‘significant cladistic structure’, that is.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Phylogeny Tree Reconstruction
Building Phylogenies Parsimony 2.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
1 Tricks for trees: Having reconstructed phylogenies what can we do with them? DIMACS, June 2006 Mike Steel Allan Wilson Centre for Molecular Ecology and.
Supertrees: Algorithms and Databases Roderic Page University of Glasgow DIMACS Working Group Meeting on Mathematical and Computational.
Phylogenetic trees Sushmita Roy BMI/CS 576
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
SuperTriplets: a triplet-based supertree approach to phylogenomics Vincent Ranwez, Alexis Criscuolo and Emmanuel J.P. Douzery.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Constructing evolutionary trees from rooted triples Bang Ye Wu Dept. of Computer Science and Information Engineering Shu-Te University.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
Estimating Species Tree from Gene Trees by Minimizing Duplications
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Doug Raiford Lesson 9.  3 Approaches  Distance  Parsimony  Maximum Likelihood  Have already seen a distance method 12/18/20152Phylogenetics Part.
Understanding sets of trees CS 394C September 10, 2009.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
LIMITATIONS OF ALGORITHM POWER
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Why use phylogenetic networks?
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Quartet distance between general trees Chris Christiansen Thomas Mailund Christian N.S. Pedersen Martin Randers.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CS 581 Tandy Warnow.
Dynamic Programming Dynamic Programming 1/18/ :45 AM
CS 581 Tandy Warnow.
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
September 1, 2009 Tandy Warnow
Phylogenetic Trees Jasmin sutkovic.
Clustering.
Imputing Supertrees and Supernetworks from Quartets
Presentation transcript:

The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland

What is the bootstrap? Like in many other areas where statistical inference is applied, in phylogenetics it is not just of interest to get a point estimate of the phylogenetic tree. We would also like some measure of confidence in our point estimate.  Is our tree likely to change if we got more data, or if we had used slightly different data?  How robust is our result to sampling error? The bootstrap is a useful tool for answering these sorts of questions.

Assessing confidence in trees In 1985 Felsenstein introduced the idea of the bootstrap to phylogenetics. For each boostrap sample  Create a new alignment by resampling the columns of the observed alignment  Construct a tree for the ‘bootstrap’ alignment Can be applied to any method that starts from a sequence alignment, e.g., parsimony, likelihood, clustering methods if the distances are derived from an alignment… The bootstrap support for each edge is the number of bootstrap trees that edge appears in.

a ATATAAA bATTATAA cTAAAATA dTATAAAT a ATTTAAA bATTATAA cTAAAATA dTAAAAAT a AAATAAA bATTATAA cTAAAATA dTTTAAAT a ATATAAA bATTATAA cTAAAATA dTATAAAT a ATTTAAA bATAATAA cTAAAATA dTAAAAAT a a aa b b bb c c c c d d d d a b c d 0.75

Example where the bootstrap is useful Simulate data on the four taxon tree below (JC model) Use sequence lengths of 100, 1000, and ((a,b),(c,d)) 5.7%97%100% ((a,c),(b,d)) 42.8%<5%0 ((a,d),(b,c)) 49.8%<5% abcd

Example where the bootstrap is not so useful Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences Use total sequence lengths of 100, 1000, and ((a,b),(c,d)) 64%80%98% ((a,c),(b,d)) 33%20%<5 ((a,d),(b,c)) 3%0%< abc d acb d 55% 45%

Consensus trees Consensus trees attempt to summarise the information contained in a set of trees, where each tree in the set is on the same taxa. Some consensus tree methods are specific to rooted trees.

Why are consensus methods required? Many phylogenetic methods produce a collection of trees rather than a single best tree.  Monte Carlo Markov Chain (MCMC)  Bootstrapping.  Equally parsimonious trees Sometimes trees for different genes produce a collection of trees.

Terminology: Splits and clades Each edge in an unrooted tree corresponds to a split or bipartition of the taxa set. Each edge in a rooted tree corresponds to a clade.

Splits cat dog mouse turtle parrot dog, cat | mouse, turtle, parrot cat, dog, mouse | turtle, parrot cat, dog, mouse, parrot | turtle

Clades dogcatmouseparrotturtle

Clades dogcatmouseparrotturtle

Clades dogcatmouseparrotturtle

Strict Consensus The strict consensus tree contains only those splits/clades that appear in all trees cat dog mouse turtle parrot cat dogmouse turtle parrot cat dog mouse turtle parrot cat dog mouse turtle parrot

Semi-strict The semi-strict consensus tree also contains those splits/clades that don’t conflict with any of the input trees cat dog mouse turtle parrot cat dog mouse turtle parrot cat dog mouse turtle parrot

Majority-Rule The majority-rule consensus tree contains only those splits/clades that appear in more than 50% of the input trees cat dog mouse turtle parrot cat dogmouse turtle parrot cat dog mouse turtle parrot cat dogmouse turtle parrot

Terminolgy: 3-taxon statements 3-taxon statements are triples of three species that show two species to be more closely related than is the third. E.g. the tree below displays the 3-taxon statements ((dog,cat),mouse) ((dog,mouse),parrot) ((mouse,parrot),turtle) …and others… dogcatmouseparrot turtle

Terminology: Rooted trees, hierarchies, clusters, and partitions abcdef abcd | ef a | bcd | ef Partitions a | b | cd | ef Hierarchy of clusters {a}{b}{c}{d}{e}{f} {a,b,c,d} {b,c,d} {c,d} {e,f}

Products of partitions Given k partitions p 1, p 2, p 3,…, p k of the same set of taxa, the product of these partitions is the partition where a and b are in the same block if and only if the are in the same block for each p i Example: The product of abc|de and ad|bce is a|bc|d|e

Adams Consensus Adams consensus method only applies to rooted trees. It preserves all the 3-taxon statements that are common to all of the input trees. Recursive algorithm that looks at the product of the maximal partitions of each of the input trees

AdamsTree algorithm (from Bryant 2003) Procedure AdamsTree( T 1,…T k ) if T 1 contains only 1 leaf then return T 1 else construct the product of the maximal partitions of the input trees For each block B in the partition do construct AdamsTree( T 1 |B, …T k |B ) Attach the roots of these trees to a new node v return this tree end

Adams consensus example abcdef ebcdaf Maximal partition abcd | ef Maximal partition bcde | af Product of maximal partitions a|bcd|e|f {b,c,d} {a} {e} {f}

Adams consensus example cont. abcdef ebcdaf Maximal partition b | cd Maximal partition b | cd Product of maximal partitions b | cd {b,c,d} {a} {e} {f} Restrict to b,c,d {b}{c,d}

Adams consensus example cont. abcdef ebcdaf Maximal partition c | d Maximal partition c | d Product of maximal partitions c | d Restrict to c,d {b,c,d} {a} {e} {f} {b}{c,d} {c}{d}

What about an “Adams” like method for unrooted trees? Instead of triples we would need to consider statements about quartets of taxa. If a quartet ((a,b),(c,d)) appeared in all the input trees it should be displayed in the output. Easy enough?

Three requirements (Steel, Dress and Böcker 2000) 1. Relabelling of the species at the tip of the tree should yeild the same answer relabelled in the appropriate way 2. The input order of the trees should not matter 3. A quartet that appears in all the input trees should appear in the output tree

No method can satisfy these 3 requirements Counter example a b c d e f a b c d e f

Supertree methods Super-tree methods take a set of trees on overlapping taxa sets and return a tree (or sometimes a ‘fail’ message) Biological relevance  Not all genes are present in all species  Not all genes are easy to sequence for all species Assembling the Tree of Life  Computationally impossible to try and build a tree for all taxa  Use a divide and conquer approach  And then use supertree methods to piece the Tree of Life together

Concept: Refinement refines a b c d e a b c d e The trees below are also refinements a b d c e a b e d c

Concept: Restriction a b c d e f g h The label set X = {a,b,c,d,e,f,g,h} We can restrict T to any subset of the labels X’ T

Concept: Restriction a b c d e f g h E.g. The restriction to {a,c,e,g} T a ce g Find the subtree and then supress the degree two vertices

Concept: Displaying A tree T (on label set X ) displays a tree T’ (on label set X’ subset of X ) if T restricted to the labels X’ is a refinement of T’ E.g. a b c d e displays f and a c d e f a b d f

Concept: Displaying BUT a b c d e Does not display f or a c d e f a d b c

The BUILD algorithm Polynomial-time algorithm due to Aho et al (1981) Takes a set of rooted input trees and either outputs a supertree that displays all of the input trees or returns a fail message.

BUILD algorithm Recursive algorithm, at each step it constructs a graph associated with the triples displayed by the input trees. Depending on whether this associated graph is connected or disconnected the algorithm either terminates or subdivides the problem. What is this associated graph?

The associated graph Nodes of the graph are the complete label set, i.e. all the labels that appear in any of the input trees Put an edge between two nodes a and b if there is at least one input tree that displays the rooted triple (( a, b ), c ) for some c. If this graph is connected stop and report a fail message Otherwise call the algorithm again once for each connected component, restricting the input to the labels in that component.

BUILD Example (from Semple and Steel) abcecbe d abfd a b c e f d {a,b,c,f}{d,e}

BUILD example continued Subproblem 1: Restrict input to {a,b,c,f} abccb abf a b c f {a,b,c,f}{d,e} {a,b} {c} {f}

BUILD example continued Subproblem 2 and 3 on {d,e}, and {a,b} are trivial so the final tree is {a,b,c,f}{d,e} {a,b} {c} {f} {a} {b} {d} {e} abcfde

What if the trees don’t agree? If the input trees are not compatible BUILD will return a fail message. It is also of interest to have methods that will return some output even if the input trees cannot all be displayed by a single supertree. Matrix representation with parsimony (MRP) is one such method…

Matrix Representation with Parsimony (MRP) Supertree method invented independently by Baum and Ragan (1992). Recode the input trees as a character matrix where each edge in each input tree defines a character. Do a parsimony analysis of the resulting character matrix. Take the strict consensus of the most parsimonious trees.

MRP example a b cd e f a b dc e g e d g h a ????? b ????? c ????? d e f ?????????????? g????????? h??????????????????

MRP example a b cd e f a b dc e g e d g h a b c d e f g h 10 most parsimonious trees Strict consensus: