On the Scalability of Computing Triplet and Quartet Distances Morten Kragelund Holt Jens Johansen Gerth Stølting Brodal 1 Aarhus University.

Slides:



Advertisements
Similar presentations
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Advertisements

Gerth Stølting Brodal University of Aarhus Monday June 9, 2008, IT University of Copenhagen, Denmark International PhD School in Algorithms for Advanced.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Comp 122, Spring 2004 Binary Search Trees. btrees - 2 Comp 122, Spring 2004 Binary Trees  Recursive definition 1.An empty tree is a binary tree 2.A node.
AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
Chapter 4: Trees Part II - AVL Tree
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
One more definition: A binary tree, T, is balanced if T is empty, or if abs ( height (leftsubtree of T) - height ( right subtree of T) )
CSCE 3110 Data Structures & Algorithm Analysis
Computing Triplet and Quartet Distances Between Trees Gerth Stølting Brodal, Morten Kragelund Holt, Jens Johansen Aarhus University Rolf Fagerberg University.
A new crossover technique in Genetic Programming Janet Clegg Intelligent Systems Group Electronics Department.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
CS 206 Introduction to Computer Science II 10 / 14 / 2009 Instructor: Michael Eckmann.
Rank-Pairing Heaps Bernhard Haeupler, Siddhartha Sen, and Robert Tarjan, ESA
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Tirgul 6 B-Trees – Another kind of balanced trees.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
CSE 373 Data Structures Lecture 15
David Luebke 1 10/13/2015 CS 332: Algorithms Linear-Time Sorting Algorithms.
CSC 41/513: Intro to Algorithms Linear-Time Sorting Algorithms.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Copyright Curt Hill Balance in Binary Trees Impact on Performance.
A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Trees CS 105. L9: Trees Slide 2 Definition The Tree Data Structure stores objects (nodes) hierarchically nodes have parent-child relationships operations.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Data Structure II So Pak Yeung Outline Review  Array  Sorted Array  Linked List Binary Search Tree Heap Hash Table.
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
Bahareh Sarrafzadeh 6111 Fall 2009
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
HeapSort 25 March HeapSort Heaps or priority queues provide a means of sorting: 1.Construct a heap, 2.Add each item to it (maintaining the heap.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Trees CSIT 402 Data Structures II 1. 2 Why Do We Need Trees? Lists, Stacks, and Queues are linear relationships Information often contains hierarchical.
CS6045: Advanced Algorithms Sorting Algorithms. Sorting So Far Insertion sort: –Easy to code –Fast on small inputs (less than ~50 elements) –Fast on nearly-sorted.
Computing the all-pairs quartet distance on a set of evolutionary trees Martin Stissing 1 Thomas Mailund 1,2 Christian Nørgaard Storm Pedersen 1 Gerth.
Quartet distance between general trees Chris Christiansen Thomas Mailund Christian N.S. Pedersen Martin Randers.
CSE 373 Data Structures Lecture 7
Analysis of Algorithms
Binary Search Trees What is a binary search tree?
COMP261 Lecture 23 B Trees.
Chapter 10 Trees © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
CC 215 Data Structures Trees
Lecture 1 (UNIT -4) TREE SUNIL KUMAR CIT-UPES.
B/B+ Trees 4.7.
Integer Programming An integer linear program (ILP) is defined exactly as a linear program except that values of variables in a feasible solution have.
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Backtracking And Branch And Bound
Hashing Exercises.
CSE 373 Data Structures Lecture 7
Cse 373 April 26th – Exam Review.
AVL Trees "The voyage of discovery is not in seeking new landscapes but in having new eyes. " - Marcel Proust.
Binomial Heaps Chapter 19.
Backtracking And Branch And Bound
COP3530- Data Structures B Trees
Trees and Binary Trees.
Hierarchical clustering approaches for high-throughput data
Trees CSE 373 Data Structures.
Linear-Time Sorting Algorithms
Backtracking And Branch And Bound
Trees CSE 373 Data Structures.
Binary Search Trees Comp 122, Spring 2004.
Trees CSE 373 Data Structures.
Presentation transcript:

On the Scalability of Computing Triplet and Quartet Distances Morten Kragelund Holt Jens Johansen Gerth Stølting Brodal 1 Aarhus University

Introduction Trees are used in many branches of science. Phylogenetic trees are especially used in biology and bioinformatics. We want to measure how different two such trees are. 2 Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances Athene Noctua Macropus Giganteus Ursus Arctos Sus Scrofa Domesticus Equus Asinus Oryctolagus Cuniculus Panthera Tigris Homo Sapiens

Distances Natural in some cases. Between trees? 3 ? Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Triplets and Quartets Triplets Used in rooted trees. Sub-trees consisting of three leaves. in a tree with n leaves. With 2,000 leaves, 1,331,334,000 triplets. Naïve algorithm runs in at least Ω(n 3 ). Number of disagreeing triplets. Quartets Used in unrooted trees. Sub-trees consisting of four leaves. in a tree with n leaves. With 2,000 leaves, 664,668,499,500 quartets. Naïve algorithm runs in at least Ω(n 4 ). Number of disagreeing quartets. 4

Goal Comparison of two trees (T 1 and T 2 ) with the same set of leaf-labels. – Numerical value of the difference of the two trees. – Number of different triplets (quartets) in the two input trees. A tree has a distance of 0 to itself. 5 Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Brodal et al. [SODA13] For binary trees C, D and E are all zero 6 Triplets Quartets d2d2 Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Brodal et al. [SODA13] BinaryArbitrary degree TripletsO(n lg n) Up to 4d+2 counters in each HDT node QuartetsO(n lg n) 2d d + 22 counters O(max(d 1, d 2 ) n lg n) 2d d + 22 counters A lot of counters . Is this even feasible? Why the d factor on arbitrary degree quartets? – d 2 counters 7 Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Overview Basic idea – Each triplet (quartet) is anchored somewhere in T 1. – Run through T 1, and for each triplet (quartet), check if they are anchored the same way in T 2. The algorithm consists of four parts 1.Coloring 2.Counting 3.Hierarchical Decomposition Tree (HDT) 4.Extraction and contraction 8 Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

1. Coloring Consists of two steps 1.Leaf-linkingO(n) 2.Recursive coloringO(n lg n) 9 v Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

2. Counting Using the coloring of T 1 and T 2 we count the number of similar triplets (quartets). No reason to look at all triplets (would be much too slow) – Instead, look at inner nodes. In each inner node, we can keep track of the number of different triplets (quartets), rooted at the given node. Using counting and coloring, the triplet distance can be calculated in O(n 2 ). 10 Resolved Disagreeing Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

3. Hierarchical Decomposition Tree (HDT) Problem: T 2 is unbalanced. Solution: Hierarchical Decomposition Trees. 11 HDT C G I C C CC CCC G GGGG G G I I I II C G GG G Built in linear timeLocally balancedTriplet distance in O(n lg 2 n) Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

4. Extraction and Contraction Ensuring that the HDT is small, we can cut off that lg n factor. If the HDT is too large, remove the irrelevant parts. 12 O(n lg n) Remove lg n factor Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Optimizations 1.[SODA13] hints at constructing HDTs early. Problem: HDTs take up a lot of memory. Solution: Postpone HDT construction. Result:25-50% reduction in memory usage. 4-10% reduction in runtime. 2.Utilizing the standard C++ vector data structure. Problem: Relatively slow (for our needs). Solution: A purpose-built linked list implementation. Result:6-9% reduction in runtime on binary trees. 3.Allocating memory whenever needed. Problem: (Relatively) slow to allocate memory. Solution: Allocation in large blocks. Result:18-25% improvement in the runtime % increase in memory usage on large input. 13 On input with more than 10,000 leaves 25% improvement in runtime45% reduction in memory usage Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Limitations Two primary limitations in our implementation: Integer representation – and are in the order of n 3 and n 4. – With signed 64-bit integers, quartet distance of only 55,000 leaves. – Solution: Signed 128-bit integers for n 4 counters. Quartet distance of up to 2,000,000 leaves. Recursion depth – OS imposed limitation in recursion stack depth. – Input, consisting of a very long chain, will fail. – Windows: Height ~4,000. – Linux: Height ~48,000. – Solution: Purpose built stack implementation*. 14 *Not done in the implementation Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Results: [SODA13] 15 LeavesTime (s) 1, , , ,000,000N/A It works, and it is fast! Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Improvements Why max (d 1, d 2 )? – d-counters given by first input tree – [SODA13]: Calculates 6 out of 9 cases. – [SODA13]: d 1 = 2, d 2 = 1024 is much slower than d 1 = d 2 = min xx x Add 5d d + 7 counters Total 7d d + 29 counters Remove need for swapping O(min(d 1, d 2 ) n lg n) Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Results: Improved 17 LeavesTime (s) 1, , , ,000, Faster in alle cases Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

More improvements 18 A+B is a choiceCount A+E insteadFaster? Triplets Quartets d2d2 Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

More improvements To count B 14 cases 92 sums 5d d + 8 counters O(min(d 1, d 2 ) n lg n) To Count E 5 cases 21 sums 1d d + 12 counters O(min(d 1, d 2 ) n lg n) 19

Results: More improvements 20 Fastest in the field LeavesTime (s) 1, , , ,000, Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances

Overview BinaryArbitrary degree Triplets[SODA13]: O(n lg n) Quartets[SODA13]: O(n lg n)[SODA13]: O(max(d 1, d 2 ) n lg n) [ALENEX14]: O(min(d 1, d 2 ) n lg n) Balanced tree, leaves [SODA13]: ~34 seconds[SODA13]: ~7 seconds [SODA13]: ~125 seconds[SODA13]: ~139 seconds [ALENEX14] v1: ~83 seconds [ALENEX14] v1: ~112 seconds [ALENEX14] v2: ~62 seconds [ALENEX14] v2: ~45 seconds 21 Holt, Johansen, Brodal On the Scalability of Computing Triplet and Quartet Distances d 1 = d 2 = 256

Conclusion [SODA13] is both practical and implementable. We have – Performed a thorough study of the alternative choices not studied in [SODA13]. – Theoretically, and practically, found good choices for the parameters. – Shown that [SODA13], and derivatives, successfully scales up to trees with millions of nodes. Open problem – Current algorithm makes heavy use of random accesses, and doesn't scale to external memory. – Current algorithm is single-threaded. Morten Kragelund Holt, Jens Johansen, Gerth Stølting Brodal On the Scalability of Computing Triplet and Quartet Distances 22