Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University.

Slides:



Advertisements
Similar presentations
Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas.
Advertisements

Master Thesis Algorithms. Algorithms – Who? Faculty Lars Arge Gerth Stølting Brodal Gudmund Skovbjerg Frandsen Kristoffer Arnsfelt Hansen Peter Bro Miltersen.
Master Thesis Preparation Algorithms Gerth Stølting Brodal.
Gerth Stølting Brodal University of Aarhus Monday June 9, 2008, IT University of Copenhagen, Denmark International PhD School in Algorithms for Advanced.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
. Intro to Phylogenetic Trees Lecture 5 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran. Slight modifications by Benny.
Evolving Graphs & Dynamic Networks : Old questions  New insights Afonso Ferreira S. Bhadra CNRS I3S & INRIA Sophia Antipolis
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
PLGW01 - September Inferring Phylogenies from LCA distances (back to the basics of distance-based phylogenetic reconstruction) Ilan Gronau Shlomo.
The Saitou&Nei Neighbor Joining Algorithm ©Shlomo Moran & Ilan Gronau.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Phylogeny Tree Reconstruction
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
The Evolution Trees From: Computational Biology by R. C. T. Lee S. J. Shyu Department of Computer Science Ming Chuan University.
Bioinformatics Algorithms and Data Structures
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
Course Review COMP171 Spring Hashing / Slide 2 Elementary Data Structures * Linked lists n Types: singular, doubly, circular n Operations: insert,
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
Phylogenetic trees Sushmita Roy BMI/CS 576
Algorithm Engineering, September 2013Data Structures, February-March 2010Data Structures, February-March 2006 Cache-Oblivious and Cache-Aware Algorithms,
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
九大数理集中講義 Comparison, Analysis, and Control of Biological Networks (7) Partial k-Trees, Color Coding, and Comparison of Graphs Tatsuya Akutsu Bioinformatics.
Time O(log d) Exponential-search(13) Finger Search Searching in a sorted array time O(log n) Binary-search(13)
Descendent Subtrees Comparison of Phylogenetic Trees with Applications to Co-evolutionary Classifications in Bacterial Genome Yaw-Ling Lin 1 Tsan-Sheng.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
What is of interest to calculate ? for open problems Semple and Steel.
Path Minima on Dynamic Weighted Trees Pooya Davoodi Aarhus University Aarhus University, November 17, 2010 Joint work with Gerth Stølting Brodal and S.
Evolutionary tree reconstruction (Chapter 10). Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
On the Scalability of Computing Triplet and Quartet Distances Morten Kragelund Holt Jens Johansen Gerth Stølting Brodal 1 Aarhus University.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
MotivationLocating the k largest subsequences: Main ideasResults Problem definitions Problem instance ( k=5 ) Bibliography
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Trees CSIT 402 Data Structures II 1. 2 Why Do We Need Trees? Lists, Stacks, and Queues are linear relationships Information often contains hierarchical.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Memory Hierarchies [FLPR12] Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. Cache- Oblivious Algorithms. ACM Transactions on Algorithms,
Computing the all-pairs quartet distance on a set of evolutionary trees Martin Stissing 1 Thomas Mailund 1,2 Christian Nørgaard Storm Pedersen 1 Gerth.
Quartet distance between general trees Chris Christiansen Thomas Mailund Christian N.S. Pedersen Martin Randers.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Consistent and Efficient Reconstruction of Latent Tree Models
dij(T) - the length of a path between leaves i and j
Technion – Israel Institute of Technology
Thesis Preparation Algorithms Gerth Stølting Brodal.
Multiple Sequence Alignment (I)
Gene Tree Estimation Through Affinity Propagation
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Presentation transcript:

Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University of Southern Denmark Thomas Mailund, Christian N. S. Pedersen, Andreas Sand Aarhus University, Bioinformatics Research Center Computer Science Institute, Charles University, Prague, Czech Republic, 18 September 2014 Work presented at SODA 2013 and ALENEX 2014

Outline  Evolutionary trees – rooted vs. unrooted, binary vs. arbitrary degree  Tree distances – Robinson-Foulds, triplet, quartet  Results and previous work – triplet, quartet distances  Algorithms – triplet (quartet)  Experimental results (ALENEX 2014)

Evolutionary Tree BonoboChimpanzeeHumanNeanderthalGorillaOrangutan Time Rooted

Unrooted Evolutionary Tree Dominant modern approach to study evolution is from DNA analysis

Constructing Evolutionary Trees – Binary or Arbitrary Degrees ? Sequence data Distance matrix 123···n n Neighbor Joining Saitou, Nei 1987 [ O(n 3 ) Saitou, Nei 1987 ] Refined Buneman Trees Moulton, Steel 1999 [ O(n 3 ) Brodal et al ] Buneman Trees Buneman 1971 [ O(n 3 ) Berry, Bryan 1999 ] ··· n.... Binary trees (despite no evidence in distance data) Arbitrary degrees (strong support for all edges ; few branches) Arbitrary degree (compromise ; good support for all edges)

Data Analysis vs Expert Trees – Binary vs Arbitrary Degrees ? Linguistic expert classification (Aryon Rodrigues) Neighbor Joining on linguistic data Cultural Phylogenetics of the Tupi Language Family in Lowland South America. R. S. Walker, S. Wichmann, T. Mailund, C. J. Atkisson. PLoS One. 7(4), 2012.

Evolutionary Tree Comparison ?? split 1357| T2T T1T1 8 CommonOnly T 1 Only T |246835| | |24848| Robinson-Foulds distance = # non-common splits = = 3 [Day 1985] O(n) time algorithm using 2 x DFS + radix sort D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorial mathematics, VI, Lecture Notes in Mathematics, pages 119–126. Springer, 1979.

8 8 Robinson-Foulds Distance (unrooted trees) ?? T1T1 CommonOnly T 1 Only T 2 (none)12567| | | | | | | | | T2T RF-dist(T 1, T 2 ) = = 9 RF-dist(T 1 \{8}, T 2 \{8}) = 0 Robinson-Foulds very sensitive to outliers D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorial mathematics, VI, Lecture Notes in Mathematics, pages 119–126. Springer, 1979.

resolved : ij|kl Quartet Distance (unrooted trees) QuartetT1T1 T2T2 {1,2,3,4}14|23 {1,2,3,5}13|2515|23 {1,2,4,5}14| {1,3,4,5}14| {2,3,4,5}25|3423|45 i j k l i j k l unresolved : ijkl (only non-binary trees) T1T1 T2T2 4 G. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34: , 1985.

Triplet Distance (rooted trees) TripletT1T1 T2T2 {1,2,3}2|13 {1,2,4}1|244|12 {1,2,5}1|255|12 {1,3,4}4|13 {1,3,5}5|13 {1,4,5}1|45 {2,3,4}3|244|23 {2,3,5}3|255|23 {2,4,5}5|242|45 {3,4,5}3|45 resolved : k|ij ij k i j k unresolved : ijk (only non-binary trees) T1T T2T2 3 D. E. Critchlow, D. K. Pearl, C. L. Qian: The triples distance for rooted bifurcating phylogenetic trees. Systematic Biology, 45(3): , 1996.

Rooted Triplet distance Unrooted Quartet distance BinaryO(n 2 ) O(n  log 2 n) O(n  log n) CPQ 1996 SBFPM 2013 [SODA 2013] O(n 3 ) O(n 2 ) O(n  log 2 n) O(n  log n) D 1985 BTKL 2000 BFP 2001 BFP 2003 Arbitrary degrees O(n 2 ) O(n  log n) BDF 2011 [SODA 2013] O(d 9  n  log n) O(n ) O(d  n  log n) SPMBF 2007 NKMP 2011 [SODA 2013] [ALENEX 2014] Computational Results

Distance Computation T 2 ResolvedUnresolved T1T1 Resolved A : Agree C B : Disagree UnresolvedDE ij k i j ki j ki j k ij k jk i ik j ik j Sufficient to compute A and E D + E and C + E unresolved in one tree (For binary trees C, D and E are all zero) ij k jk i O( n ) triplets & quartets O( n ·log n ) triplets & quartets O( n ·log n ) triplets O( d · n ·log n ) quartets

Parameterized Triplet & Quartet Distances B + α·(C + D), 0  α  1 BDF 2011 O(n 2 ) for triplet, NKMP 2011 O(n ) for quartet [SODA 2013/ALENEX 2014] O(n·log n) and O(d·n·log n), respectively T 2 ResolvedUnresolved T1T1 Resolved A : Agree C B : Disagree UnresolvedDE ij k i j ki j ki j k ij k jk i ik j ik j ij k jk i

Counting Unresolved Triplets in One Tree n 1 n 2 n 3 ··· n d v Computable in O(n) time using DFS + dynamic programming n 1 n 2 n 3 ··· n d v Triplet anchored at v Quartet anchored at v Quartets (root tree arbitrary)

Counting Agreeing Triplets (Basic Idea) v 1i j d T1T1 j T2T2 c i w 0

Efficient Computation Limit recolorings in T 1 (and T 2 ) to O(n·log n) v v 1 2 d 0 v 1 0 v 0 v 1 0 v Count T 2 contribution (precondition) Recolor Recurse Recolor & recurse T1T1 Reduce recoloring cost in T 2 from O(n 2 ) to O(n·log 2 n) Reduce recoloring cost in T 2 from O(n·log 2 n) to O(n·log n) T 2 arbitrary height degree H(T 2 ) binary height O(log n)  Contract T 2 and reconstruct H(T 2 ) during recursion

Counting Agreeing Triplets (II) v 1i j d T1T1 0 node in H(T 2 ) = component composition in T 2 + Contribution to agreeing triplets at node in H(T 2 ) i ij i i j j i i C2C2 C1C1

From O(n·log 2 n) to O(n·log n) Update O(1) counters for all colors through node Colored path lengths v 1i j d T1T1 0 nvnv nini w Compressed version of T 2 of size O(n v ) Total cost for updating counters a (1) a (2) l=a (0) a (3) a (4) a (5) T1T1

Counting Quartets... Bottleneck in computing disagreeing resolved-resolved quartets v 1i j d T1T1 0 i j i j G1G1 G2G2 T2T2 double-sum  factor d time  Root T 1 and T 2 arbitrary  Keep up to 7d d + 29 different counters per node in H(T 2 )...

Distance Computation T 2 ResolvedUnresolved T1T1 Resolved A : Agree C B : Disagree UnresolvedDE ij k i j ki j ki j k ij k jk i ik j ik j Sufficient to compute A and E ij k jk i O( n ·log n ) triplets & quartets O( n ·log n ) triplets O( d · n ·log n ) quartets

ALENEX 2014: Implementation (M.Sc. thesis Morten Kragelund Holt and Jens Johansen) Worst-case #counters per node in HDT(T 2 )  First implementation for triplets for arbitrary degree  Space usage  10 KB per node for quartet (binary trees) can handle  1,000,000 leaves  64 bit integers, except 128 bit integers for values > n 3 quartet distance of up to  2,000,000 leaves BinaryArbitrary degree timecounterstimecounters TripletO(n log n)6 4d+2 QuartetO(n log n)40 O(max(d 1, d 2 ) n log n) 2d d + 22 (B, with T 1  T 2 ) O(min(d 1, d 2 ) n log n) 7d d + 29 (B, no swap) d d + 12 (E, no swap)

Experimental Results Quartet Distance – Binary Trees [SODA 2013] MP 2004 NKMP 2011  [ALENEX 2014] are the first O(n  log n) implementations  MP 2004 overhead from working with polynomials

Experimental Results Quartet Distance – High Degree Trees d = 256 NKMP 2011 [SODA 2013] max d = 1024  [ALENEX 2014] are the first n  poly(log n,d) implementation

Experimental Results Triplet Distance – Binary Trees  [ALENEX 2014] are the first O(n  log n) implementation  SBFPM 2013 only binary trees, no contractions [SODA 2013] SBFPM 2013

Experimental Results Triplet Distance – High Degree Trees  [ALENEX 2014] first implementation  Triplet distance appears hardest for binary trees SODA 2013 [SBFPM 2013] [SODA 2013], d = 2 [SODA 2013], d = 256 [SODA 2013], d = 1024

Rooted Triplet distance Unrooted Quartet distance BinaryO(n 2 ) O(n  log 2 n) O(n  log n) CPQ 1996 SBFPM 2013 [SODA 2013] O(n 3 ) O(n 2 ) O(n  log 2 n) O(n  log n) D 1985 BTKL 2000 BFP 2001 BFP 2003 [SODA 2013] Arbitrary degrees O(n 2 ) O(n  log n) BDF 2011 [SODA 2013] O(d 9  n  log n) O(n ) O(d  n  log n) SPMBF 2007 NKMP 2011 [SODA 2013] [ALENEX 2014] Summary d = minimal degree of any node in T 1 and T O(n·log n) ? o(n·log n) ? = fastest implementation for large n

References  On the Scalability of Computing Triplet and Quartet Distances. M.K. Holt, J. Johansen, G.S. Brodal. ALENEX  Algorithms for Computing the Triplet and Quartet Distances for Binary and General Trees. A. Sand, M.K. Holt, J. Johansen, R. Fagerberg, G.S. Brodal, C.N.S. Pedersen, T. Mailund. Biology - Special Issue on Developments in Bioinformatic Algorithms,  A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees. A. Sand, G.S. Brodal, R. Fagerberg, C.N.S. Pedersen, T. Mailund. BMC Bioinformatics  Efficient Algorithms for Computing the Triplet and Quartet Distance Between Trees of Arbitrary Degree. G.S. Brodal, R. Fagerberg, C.N.S. Pedersen, T. Mailund, A. Sand. SODA  A sub-cubic time algorithm for computing the quartet distance between two general trees. J. Nielsen, A. K. Kristensen, T. Mailund, C.N.S. Pedersen. Algorithms in Molecular Biology  Computing the Quartet Distance Between Evolutionary Trees of Bounded Degree. M. Stissing, C.N.S. Pedersen, T. Mailund, G.S. Brodal, R. Fagerberg. APBC  QDist - Quartet Distance between Evolutionary Trees. T. Mailund and C.N. S. Pedersen. Bioinformatics  Computing the Quartet Distance Between Evolutionary Trees in Time O(n log n). G.S. Brodal, R. Fagerberg, C.N.S. Pedersen. Algorithmica  Computing the Quartet Distance Between Evolutionary Trees in Time O(n log 2 n). G.S. Brodal, R. Fagerberg, C.N.S. Pedersen. ISAAC birc.au.dk/software