Effects of Rooting on Phylogenic Algorithms Margareta Ackerman Joint work with David Loker and Dan Brown.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Linked Based Clustering and Its Theoretical Foundations Paper written by Margareta Ackerman and Shai Ben-David Yan T. Yang Presented by Yan T. Yang.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Characterization of Linkage-Based Algorithms Margareta Ackerman Joint work with Shai Ben-David and David Loker University of Waterloo To appear in COLT.
Introduction to Bioinformatics
Weighted Clustering Margareta Ackerman Work with Shai Ben-David, Simina Branzei, and David Loker.
Discerning Linkage-Based Algorithms Among Hierarchical Clustering Methods Margareta Ackerman and Shai Ben-David IJCAI 2011.
The Saitou&Nei Neighbor Joining Algorithm ©Shlomo Moran & Ilan Gronau.
Clustering II.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Graphs and Trees This handout: Trees Minimum Spanning Tree Problem.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Super-Resolution of Remotely-Sensed Images Using a Learning-Based Approach Isabelle Bégin and Frank P. Ferrie Abstract Super-resolution addresses the problem.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Towards Theoretical Foundations of Clustering Margareta Ackerman Caltech Joint work with Shai Ben-David and David Loker.
PHYLOGENETIC TREES Dwyane George February 24,
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
1 Oblivious Routing in Wireless networks Costas Busch Rensselaer Polytechnic Institute Joint work with: Malik Magdon-Ismail and Jing Xi.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.
1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Formal Foundations of Clustering Margareta Ackerman Work with Shai Ben-David, Simina Branzei, and David Loker.
Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009 A UNIQUENESS THEOREM FOR CLUSTERING.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Introduction Wireless Ad-Hoc Network  Set of transceivers communicating by radio.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Hierarchical Clustering
Clustering CSC 600: Data Mining Class 21.
Consistent and Efficient Reconstruction of Latent Tree Models
dij(T) - the length of a path between leaves i and j
Inferring a phylogeny is an estimation procedure.
Hierarchical Clustering
Hierarchical clustering approaches for high-throughput data
Phylogenetic Trees.
Introduction Wireless Ad-Hoc Network
Consensus Partition Liang Zheng 5.21.
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Phylogeny.
Hierarchical Clustering
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

Effects of Rooting on Phylogenic Algorithms Margareta Ackerman Joint work with David Loker and Dan Brown

Hierarchical Clustering & Phylogency

Ph ylogeny is an application of Hierarchical Clustering. They are closely related! Phylogeny meets Hierarchical Clustering Unfortunately, there is a disconnect between these fields.

A step towards bridging the gap: We bring techniques from cluster analysis to study Phylogenetic algorithms. We apply a recent framework for clustering algorithm selection to Phylogeny [(Ackerman, Ben-David, and Loker, ‘10), (Ackerman, Ben-David, and Loker, ‘10), (Ackerman & Ben-David, IJCAI ‘11), (Zedah and Ben- David, ‘09)] Bridging the Gap

Given the same input, different Phylogenetic algorithms can produce radically different results. 5 How should a user decide which algorithm to use? Selecting Phylogenetic Algorithms

This framework lets a user utilize prior knowledge to select an algorithm Identify properties that distinguish between different input-output behaviour of clustering paradigms The properties should be: 1) Intuitive and “user-friendly” 2) Useful for distinguishing clustering algorithms 6 Framework for Selecting Phylogenetic Algorithms

Rooting Phylogenetic Trees Formal Framework Properties of Hierarchical Algorithms Analysis of Linkage-Based Algorithms Analysis of Neighbor Joining Conclusions and Future Direction Outline

A common solution: Introduce distant taxa (or, elements) and root where the distant taxa connect with the ingroup. How to Root Phylogenetic Trees? E

The addition of an outgroup can CHANGE the topology of the ingroup. When Rooting Changes the Ingroup After adding outgroup E

Empirical studies demonstrate that when using some algorithms, ingroup topology can be disrupted when an outgroup is added [(Holland et. al., ‘03), (Shavit et. al., ‘07), (Lin et. al, ‘02), (Slack et. al., ‘03) ] We perform a theoretical analysis of this phenomenon, proving that some algorithms are immune to this problem, while others are highly volatile. This Happens in Practice!

Independently of our work, it was shown that when using BME, the ingroup topology can change arbitrarily when an outlier is added (Cueto and Matsen, 2010) Previous Work

Linkage-based algorithms (including UPGMA) do not change ingroup when the outgroup is sufficiently far away Using Neighbor Joining, ingroup topology is effected by outgroups even if the outgroup is arbitrarily far away Our Contributions

Rooting Phylogenetic Trees Formal Framework Properties of Hierarchical Algorithms Analysis of Linkage-Based Algorithms Analysis of Neighbor Joining Conclusions and Future Direction Outline

C_iD C_i C_i is a cluster in a dendrogram D if there exists a node in the dendrogram so that C_i is the set of its leaf descendents. 14 Formal Setup

C = {C 1, …, C k } D C = {C 1, …, C k } is a clustering in a dendrogram D if –C i D1≤ i ≤ k –C i is a cluster in D for all 1≤ i ≤ k, and –Clusters are disjoint 15 Formal Setup

A A Hierarchical Clustering Algorithm A maps X d (X,d) Input: A data set X with a distance function d, denoted (X,d) to X Output: A dendrogram of X Y ⊆ X Z ⊆ X The distance between Y ⊆ X and Z ⊆ X is the length of the minimum edge between them d(Y,Z) = min y in Y, z in Z d(y,z) 16 Formal Setup

Rooting Phylogenetic Trees Formal Framework Properties of Hierarchical Algorithms Analysis of Linkage-Based Algorithms Analysis of Neighbor Joining Conclusions and Future Direction Outline

(X u O, d) A Given a data set (X u O, d) and algorithm A, XO X is unaffected by O A(X, d)A(X u O, d) if A(X, d) is a sub-dendrogram of A(X u O, d). XO Otherwise, X is affected by O. A(X,d)A(O,d) A(X u O,d) Unaffected by an Outgroup

Ingroup A (X, d) (O, d’)(X,d) (O,d’) X O Algorithm A is outgroup-independent if for any data sets (X, d) and (O, d’), if (X,d) and (O,d’) are sufficiently far apart then X is unaffected by O. Outgroup Outgroup Independence

A (X, d) (O, d’)(X,d) (O,d’) X O Algorithm A is outgroup-independent if for any data sets (X, d) and (O, d’), if (X,d) and (O,d’) are sufficiently far apart then X is unaffected by O. A(X,d) A(O,d’) A(X u O,d*) d* (X,d) (O,d’) d* puts (X,d) and (O,d’) sufficiently far apart Outgroup Independence

A (X,d) c (O,d’) X OcX O An algorithm A is outgroup volatile if for any data set (X,d) and any constant c, there exist (O,d’) with distance between X and O at least c, such that X is affected by O. OA If O is a singleton, then A is outlier volatile. Outgroup Volatility

Rooting Phylogenetic Trees Formal Framework Properties of Hierarchical Algorithms Analysis of Linkage-Based Algorithms Analysis of Neighbor Joining Conclusions and Future Direction Outline

Theorem : Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. We use the following general result to show that Linkage-Based algorithms are outgroup-independent.

If we select a cluster from the dendrogram, and run the algorithm the data underlying this cluster, we obtain a result that is consistent with the original dendrogram. D = A(X,d) D’ = A(X’,d) X’={x 1, …, x 4 } 24 Locality

A(X,d) C C(X,d) C on dataset (X,d) C(X,d’) C on dataset (X,d’) Outer-consistent change 25 If A is outer-consistent, then A(X,d’) will also include the clustering C. Outer Consistency

(X, d) Given any pair of data sets (X, d) and (X’, d’)d* X u X’X X’A(X u X’, d*) (X’, d’), there exists d* over X u X’, so that X and X’ are the children of the root in A(X u X’, d*). 2-Richness (X,d) (X, d’) (X, d*) X A(X O,d*) A(X u O,d*) X’

Proof: We want to show that given any if the data sets are placed sufficiently far apart, then A(X,d) is a sub-dendrogram of A(X u O, d*). Theorem : Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. (X,d) (O, d’) (X O,d’’) (X u O,d’’) A(X,d) A(X O,d*) A(X u O,d*)

Proof: First, apply 2-richness. Given X O there exists d’’ over X u O, X O,d’’). so that X and O are children of A(X u O,d’’). Theorem : Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. (X,d) (O, d’) (X O,d’’) (X u O,d’’) X A(X O,d’’) A(X u O,d’’) O c

Proof: c Let d* be any distance function extending d and d’ where the min distance between X and O is at least c. X O,d*). Then by outer-consistency, X and O are children of the root of A(X u O,d*). Theorem : Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. (X O,d’’) (X u O,d’’) X A(X O,d*) A(X u O,d*) O c (X O,d*) (X u O,d*)

Proof: A(X O,d*). Finally, by locality, A(X,d) is a sub-dendrogram of A(X u O,d*). Therefore, whenever (X,d) and (O,d’) are sufficiently far apart, X is unaffected by O. Theorem : Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. X A(X O,d*) A(X u O,d*) O A(X,d)

XCreate a leaf node for every element of X Insert image 31 Linkage Based Algorithm

XCreate a leaf node for every element of X Repeat the following until a single tree remains: –Consider clusters represented by the remaining root nodes. 32 Linkage Based Algorithm

XCreate a leaf node for every elements of X Repeat the following until a single tree remains: –Consider clusters represented by the remaining root nodes. Merge the closest pair of clusters by assigning them a common parent node. 33 ? Linkage Based Algorithm

The choice of linkage function distinguishes between different linkage-based algorithms. Examples of common linkage-functions –UPGMA : average between-cluster distance –Single-linkage : shortest between-cluster distance –Complete-linkage : maximum between-cluster distance X1X1X1X1 X2X2X2X2 34 Examples of Linkage Based Algorithms

Proof: We can show that all linkage-based algorithms are 2-outer-rich, outer- consistent, and local. Result follows by previous Theorem. Theorem : All Linkage-Based algorithms are outgroup independent.

Rooting Phylogenetic Trees Formal Framework Properties of Hierarchical Algorithms Analysis of Linkage-Based Algorithms Analysis of Neighbor Joining Conclusions and Future Direction Outline

Most widely-used distance-based method for phylogenetic reconstruction Works well in practice If there is a tree that fits the distance matrix (additive), it will find it Neighbour Joining

This remains the case when distances of the ingroup are additive. Theorem : Neighbor joining is outlier volatile.

Theorem : (X,d) O d ∗ X ∪ O dd ∗ (X,O) Given any data set (X,d), there exists a set of outliers O and a distance function d ∗ over X ∪ O extending d, where d ∗ (X,O) can be arbitrarily large, such that NJ(X ∪ O, d ∗ )|X NJ(X ∪ O, d ∗ )|X is an arbitrary dendrogram. Outgroups can lead to arbitrary dendrograms A(X,d) A(X O,d*)|X A(X u O,d*)|X

Rooting Phylogenetic Trees Formal Framework Properties of Hierarchical Algorithms Analysis of Linkage-Based Algorithms Analysis of Neighbor Joining Conclusions and Future Direction Outline

Present a formal framework for the analysis of the effects of outgroups on the ingroup topology for computationally efficiently hierarchical algorithms Prove that all Linkage-Based algorithms, which include UPGMA, are outgroup independent Prove that NJ is outgroup volatile This only addresses rooting - We do not claim that UPGMA is in general better than NJ. Conclusions

How to choose outgroups for rooting NJ? Perform a similar analysis of Likelihood methods Future Work