CS 394C: Computational Biology Algorithms

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
CIS786, Lecture 4 Usman Roshan.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
Distance-based phylogeny estimation
The Disk-Covering Method for Phylogenetic Tree Reconstruction
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Statistical tree estimation
Distance based phylogenetics
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
Algorithm Design and Phylogenomics
CIPRES: Enabling Tree of Life Projects
Professor Tandy Warnow
Mathematical and Computational Challenges in Reconstructing Evolution
New methods for simultaneous estimation of trees and alignments
Mathematical and Computational Challenges in Reconstructing Evolution
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
Tandy Warnow Department of Computer Sciences
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
The Most General Markov Substitution Model on an Unrooted Tree
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Presentation transcript:

CS 394C: Computational Biology Algorithms Tandy Warnow Department of Computer Sciences University of Texas at Austin

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Molecular Systematics V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Phylogeny estimation methods Distance-based (Neighbor joining, NQM, and others): mostly statistically consistent and polynomial time Maximum parsimony and maximum compatibility: NP-hard and not statistically consistent Maximum likelihood: NP-hard and usually statistically consistent (if solved exactly) Bayesian Methods: statistically consistent if run long enough

Distance-based methods Theorem: Let (T,) be a Cavender-Farris model tree, with additive matrix [(i,j)]. Let >0 be given. The sequence length that suffices for accuracy with probability at least 1-  of NJ (neighbor joining) and the Naïve Quartet Method is O(log n e(O(max (i,j))).

Neighbor joining (although statistically consistent) has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Maximum Parsimony Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T leaf-labeled by sequences in S additional sequences of length k labeling the internal nodes of T such that is minimized.

Maximum parsimony (example) Input: Four sequences ACT ACA GTT GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTA ACA ACT GTT ACA GTT GTA GTA ACA ACT GTT

Maximum Parsimony ACT GTA ACA ACT GTT GTA ACA ACT 2 1 1 2 GTT 3 3 GTT MP score = 7 MP score = 5 GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Optimal MP tree

Maximum Parsimony Optimal labeling can be computed in polynomial ACT ACA GTT GTA 1 2 MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in polynomial time using Dynamic Programming

Solving NP-hard problems exactly is … unlikely #leaves #trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900 Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia

Approaches for “solving” MP and ML (and other NP-hard problems in phylogeny) Hill-climbing heuristics (which can get stuck in local optima) Randomized algorithms for getting out of local optima Approximation algorithms for MP (based upon Steiner Tree approximation algorithms) -- however, the approx. ratio that is needed is probably 1.01 or smaller! Phylogenetic trees Cost Global optimum Local optimum

Problems with techniques for MP and ML Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

MP and Cavender-Farris Consider a tree (AB,CD) with two very long branches leading to A and C, and all other branches very short. MP will be statistically inconsistent (and “positively misleading”) on this tree.

Problems with existing phylogeny reconstruction methods Polynomial time methods (generally based upon distances) have poor accuracy with large diameter datasets. Heuristics for NP-hard optimization problems take too long (months to reach acceptable local optima).

Warnow et al.: Meta-algorithms for phylogenetics Basic technique: determine the conditions under which a phylogeny reconstruction method does well (or poorly), and design a divide-and-conquer strategy (specific to the method) to improve its performance Warnow et al. developed a class of divide-and-conquer methods, collectively called DCMs (Disk-Covering Methods). These are based upon chordal graph theory to give fast decompositions and provable performance guarantees.

Disk-Covering Method (DCM)

Improving phylogeny reconstruction methods using DCMs Improving the theoretical convergence rate and performance of polynomial time distance-based methods using DCM1 Speeding up heuristics for NP-hard optimization problems (Maximum Parsimony and Maximum Likelihood) using Rec-I-DCM3

DCM1 Warnow, St. John, and Moret, SODA 2001 Exponentially converging method Absolute fast converging method DCM SQS A two-phase procedure which reduces the sequence length requirement of methods. The DCM phase produces a collection of trees, and the SQS phase picks the “best” tree. The “base method” is applied to subsets of the original dataset. When the base method is NJ, you get DCM1-NJ.

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem: DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Rec-I-DCM3 significantly improves performance (Roshan et al. CSB 2004) Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset. Similar improvements obtained for RAxML (maximum likelihood).

Summary (so far) Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima. The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.

Summary NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions Many real problems have beautiful and natural combinatorial and graph-theoretic formulations