CIS786, Lecture 4 Usman Roshan.

Slides:



Advertisements
Similar presentations
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CIS786, Lecture 5 Usman Roshan.
BNFO 602 Phylogenetics Usman Roshan.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
CIS786, Lecture 3 Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
Carmine Cerrone, Raffaele Cerulli, Bruce Golden GO IX Sirmione, Italy July
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
9/1/ Ultrametric phylogenies By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Gene expression & Clustering (Chapter 10)
Algorithms for Network Optimization Problems This handout: Minimum Spanning Tree Problem Approximation Algorithms Traveling Salesman Problem.
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
A new Ad Hoc Positioning System 컴퓨터 공학과 오영준.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
The Disk-Covering Method for Phylogenetic Tree Reconstruction
Challenges in constructing very large evolutionary trees
Techniques for MSA Tandy Warnow.
CIPRES: Enabling Tree of Life Projects
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
Lecture 7 – Algorithmic Approaches
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

CIS786, Lecture 4 Usman Roshan

Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search

ILS for MP We saw that ratchet improves upon iterative improvement We saw that TNT’s sophisticated and faster implementation outperforms ratchet and PAUP* implementations But can we do even better?

Disk Covering Methods (DCMs) DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. DCMs to date –DCM1: for improving statistical performance of distance-based methods. –DCM2: for improving heuristic search for MP and ML –DCM3: latest, fastest, and best (in accuracy and optimality) DCM

DCM2 technique for speeding up MP searches 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary

DCM1 and DCM2 decompositions DCM1 decomposition : NJ gets better accuracy on small diameter subproblems DCM2 decomposition: Getting a smaller number of smaller subproblems speeds up solution

Supertree Methods

Strict Consensus Merger

Tree Refinement e a bc d f g h a b cd f g h e d e a b c f g h a b cf g h de

The big question Why DCMs? Can DCMs improve upon existing Methods such as neighbor-joining or PAUP* or TNT?

Improving sequence length requirements of NJ Can DCM1 improve upon NJ? We examine this question under simulation

DCM1(NJ)

Computing tree for one threshold

Recall simulation studies

Experimental results True tree selection (phase II of DCM1) Uniformly random trees Birth-death random trees Sequence length requirements on birth- death random trees

Comparing tree selection techniques

Error rates on uniform random trees

Error as a function of evolutionary rate NJDCM1-NJ+MP

100 taxa, 90% accuracy Sequence length requirements as a function of evolutionary rates

400 taxa, 90% accuracy

Sequence length requirements as a function of #taxa DCM1-NJ+MPNJ

Conclusion DCM1-NJ+MP improves upon NJ on large and divergent settings Why did it work? Smaller datasets with low evolutionary diameters AND reliable supertree method  accurate subtrees (on subsets)  accurate supertree

Conclusion

Previously we saw a comparison of DCM components for solving MP DCM2 better than DCM1 decomposition SCM better than MRP (in DCM context) Constrained refinement better than Inferred Ancestral States technique Higher thresholds take longer but can produce better trees

Comparison of DCM components for solving MP

I. Comparison of DCMs (1,322 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (1,322 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets.

DCM2 decomposition on 500 rbcL genes (Zilla dataset) DCM2 decomposition Blue: separator Red: subset 1 Pink: subset 2 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is very large 2.Subsets are very large 3.Scattered subsets

Doesn’t look anything like this

2.Find separator X in G which minimizes max where are the connected components of G – X 3.Output subproblems as. DCM2 Input: distance matrix d, threshold, sequences S Algorithm: 1a. Compute a threshold graph G using q and d 1b. Perform a minimum weight triangulation of G DCM3 decomposition DCM3 Input : guide-tree T on S, sequences S Algorithm: 1.Compute a short quartet graph G using T. The graph G is provably triangulated. DCM3 advantage: it is faster and produces smaller subproblems than DCM2

DCM3 decomposition - example

Approx centroid-edge DCM3 decomposition – example 1.Locate the centroid edge e (O(n) time) 2.Set the closest leaves around e to be the separator (O(n) time) 3.Remaining leaves in subtrees around e form the subsets (unioned with the separator)

Time to compute DCM3 decompositions An optimal DCM3 decomposition takes O(n 3 ) to compute – same as for DCM2 The centroid edge DCM3 decomposition can be computed in O(n 2 ) time An approximate centroid edge decomposition can be computed in O(n ) time (from hereon we assume we are using the approximate centroid edge decomposition)

DCM2 decomposition on 500 rbcL genes (Zilla dataset) DCM2 decomposition Blue: separator Red: subset 1 Pink: subset 2 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is very large 2.Subsets are very large 3.Scattered subsets

DCM3 decomposition on 500 rbcL genes (Zilla dataset) DCM3 decomposition Blue: separator (and subset) Red: subset 2 Pink: subset 3 Yellow: subset 4 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is small 2.Subsets are small 3.Compact subsets

Dataset: 4583 actinobacteria ssu rRNA from RDP. Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets. DCM3 followed by TNT-ratchet doesn’t improve over TNT Recursive-DCM3 followed by TNT-ratchet doesn’t improve over TNT Comparison of DCMs Hours Average MP score above optimal, shown as a percentage of the optimal TNTDCM2DCM3Rec-DCM3

Local optima is a problem Phylogenetic trees Cost Global optimum Local optimum

Local optima is a problem Average MP score above optimal, shown as a percentage of the optimal Hours

Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search

Iterated local search: Recursive-Iterative-DCM3 Local optimum Output of Recursive-DCM3 Local search

Rec-I-DCM3(TNT-ratchet) improves upon unboosted TNT-ratchet Hours Average MP score above optimal, shown as a percentage of the optimal TNTDCM2DCM3Rec-DCM3Rec-I-DCM3 Comparison of DCMs for solving MP

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default to recursion to iteration to recursion+iteration.

Improving upon TNT But what happens after 24 hours? We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. Can we improve upon the default TNT search?

Improving upon TNT

2000 Eukaryotes rRNA

domain+2-org rRNA

13921 Proteobacteria rRNA

Improving upon TNT What about better TNT heuristics? Can Rec-I- DCM3 improve upon them? Rec-I-DCM3 improves upon default TNT but we don’t know what happens for better TNT heuristics. Therefore, for a large-scale analysis figure out best settings of the software (e.g. TNT or PAUP*) on the dataset and then use it in conjunction with Rec-I-DCM3 with various subset sizes