CS 581 Tandy Warnow.

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetic Trees Lecture 4
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
1. 2 Rooting the tree and giving length to branches.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
CSC401 – Analysis of Algorithms Lecture Notes 12 Dynamic Programming
CIS786, Lecture 3 Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Building Phylogenies Parsimony 2.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetics II.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
1 Minimum Routing Cost Tree Definition –For two nodes u and v on a tree, there is a path between them. –The sum of all edge weights on this path is called.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
Phylogenetic Trees - Parsimony Tutorial #12
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Statistical tree estimation
Character-Based Phylogeny Reconstruction
Professor Tandy Warnow
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
Why Models of Sequence Evolution Matter
Dynamic Programming Merge Sort 1/18/ :45 AM Spring 2007
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Merge Sort 4/28/ :13 AM Dynamic Programming Dynamic Programming.
Algorithms for Inferring the Tree of Life
Dynamic Programming Merge Sort 5/23/2019 6:18 PM Spring 2008
Presentation transcript:

CS 581 Tandy Warnow

Today (Chapters 3-4, 8.8) Constructing unrooted trees from bipartitions Compatible binary characters Constructing trees from compatible binary characters Introducing maximum parsimony Maximum parsimony is not statistically consistent under standard models of sequence evolution!

Maximum Parsimony Problem Definition Solving MP on a fixed tree Finding the best MP tree Parsimony informative (and parsimony uninformative) sites Statistical consistency (or lack thereof) under CFN model

Maximum Parsimony Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T leaf-labeled by sequences in S additional sequences of length k labeling the internal nodes of T such that is minimized, where H(i,j) denotes the Hamming distance between sequences at nodes i and j

Hamming Distance Steiner Tree Problem Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T leaf-labeled by sequences in S additional sequences of length k labeling the internal nodes of T such that is minimized, where H(i,j) denotes the Hamming distance between sequences at nodes i and j

Maximum parsimony (example) Input: Four sequences ACT ACA GTT GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTA ACA ACT GTT ACA GTT GTA GTA ACA ACT GTT

Maximum Parsimony ACT GTA ACA ACT GTT GTA ACA ACA 2 1 1 2 GTT 3 2 GTT MP score = 6 MP score = 5 GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Optimal MP tree

MP: computational complexity ACT ACA GTT GTA 1 2 MP score = 4 For four leaves, we can do this by inspection

MP: computational complexity ACT ACA GTT GTA 1 2 MP score = 4 Using dynamic programming, the optimal labeling can be computed in O(r2nk) time r = # states (4 for nucleotides, 20 for AA, etc.) n = # leaves k = # characters (or sequence length)

DP algorithm Dynamic programming algorithms on trees are common – there is a natural ordering on the nodes given by the tree. Example: computing the longest leaf-to-leaf path in a tree can be done in linear time, using dynamic programming (bottom-up).

Two variants of MP Unweighted MP: all substitutions have the same cost Weighted MP: there is a substitution cost matrix that allows different substitutions to have different costs. For example: transversions and transitions can have different costs. Even if symmetric, this complicates the calculation – but not by much.

Fitch’s algorithm for unweighted MP on a fixed tree We process the characters independently. Let c be the character we are examining, and let c(v) be the state of leaf v. Let A(v) denote the set of optimal nucleotides at node v (for an MP solution to the subtree rooted at v). Hence A(v)={c(v)} if v is a leaf.

Fitch’s algorithm for fixed-tree (unweighted) maximum parsimony

Sankoff’s DP algorithm for weighted MP Assume a given rooted binary tree T and a single character. Root tree T at some internal node. Now, for every node v in T and every possible letter x, compute Cost(v,x) := optimal cost of subtree of T rooted at v, given that we label v by x. Base case: easy General case?

DP algorithm (cont.) Cost(v,x) = miny{Cost(v1,y)+cost(x,y)} + miny{Cost(v2,y)+cost(x,y)} where v1 and v2 are the children of v, and y ranges over the possible states (e.g., nucleotides), and cost(x,y) is an arbitrary cost function.

DP algorithm (cont.) We compute Cost(v,x) for every node v and every state x, from the “bottom up”. The optimal cost is minx{Cost(root,x)} We can then pick the best states for each node in a top-down pass. However, here we have to remember that different substitutions have different costs.

MP: solvable in polynomial time if the tree is given ACT ACA GTT GTA 1 2 MP score = 4 Optimal labeling can be computed in O(r2nk) time r = # states (4 for nucleotides, 20 for AA, etc.) n = # leaves k = # characters (or sequence length)

But finding the best tree is NP-hard! ACT ACA GTT GTA 1 2 MP score = 4 Optimal labeling can be computed in O(r2nk) time

Solving NP-hard problems exactly is … unlikely #leaves #trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900 Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia

Approaches for “solving” MP Hill-climbing heuristics (which can get stuck in local optima) Randomized algorithms for getting out of local optima Approximation algorithms for MP (based upon Steiner Tree approximation algorithms). Phylogenetic trees Cost Global optimum Local optimum

NNI moves

TBR moves

Summary (so far) Maximum Parsimony is an NP-hard optimization problem, but can be solved exactly (using dynamic programming) in polynomial time on a fixed tree. Heuristics for MP are reasonably fast, but apparent convergence can be misleading. And some datasets can take a long time.

Is Maximum Parsimony statistically consistent under CFN? Recall the CFN model of binary sequence evolution: iid site evolution, and each site changes with probability p(e) on edge e, with 0 < p(e) < 0.5. Is MP statistically consistent under this model?

Statistical consistency under CFN We will say that a method M is statistically consistent under the CFN model if: For all CFN model trees (T,Θ) (where Θ denotes the set of substitution probabilities on each of the branches of the tree T), as the number L of sites goes to infinity, the probability that M(S)=T converges to 1, where S is a set of sequences of length L.

Is MP statistically consistent? We will start with 4-leaf CFN trees, so the input to MP is a set of four sequences, A, B, C, D. Note that there are only three possible unrooted trees that MP can return: ((A,B),(C,D)) ((A,C),(B,D)) ((A,D),(B,C))

Analyzing what MP does on four leaves MP has to pick the tree that has the least number of changes among the three possible trees. Consider a single site (i.e., all the sequences have length one). Suppose the site is A=B=C=D=0. Can we distinguish between the three trees?

Analyzing what MP does on four leaves Suppose the site is A=B=C=D=0. Suppose the site is A=B=C=D=1 Suppose the site is A=B=C=0, D=1 Suppose the site is A=B=C=1, D=0 Suppose the site is A=B=D=0, C=1 Suppose the site is A=C=D=0, B=1 Suppose the site is B=C=D=0, A=1

Uninformative Site Patterns Uninformative site patterns are ones that fit every tree equally well. Note that any site that is constant (same value for A,B,C,D) or splits 3/1 is parsimony uninformative. On the other hand, all sites that split 2/2 are parsimony informative!

Parsimony Informative Sites [A=B=0, C=D=1] or [A=B=1, C=D=0] These sites support ((A,B),(C,D)) [A=C=1, B=D=0] or [A=C=0, B=D=1] These sites support ((A,C),(B,D)) [A=D=0,B=C=1] or [A=D=1, B=C=0] These sites support ((A,D),(B,C))

Calculating which tree MP picks When the input has only four sequences, calculating what MP does is easy! Remove the parsimony uninformative sites Let I be the number of sites that support ((A,B),(C,D)) Let J be the number of sites that support ((A,C),(B,D)) Let K be the number of sites that support ((A,D),(B,C)) Whichever tree is supported by the largest number of sites, return that tree. (For example, if I >max{J,K}, then return ((A,B),(C,D).) If there is a tie, return all trees supported by the largest number of sites.

MP on 4 leaves Consider a four-leaf tree CFN model tree ((A,B),(C,D)) with a very high probability of change (close to ½) on the internal edge (separating AB from CD) and very small probabilities of change (close to 0) on the four external edges. What parsimony informative sites have the highest probability? What tree will MP return with probability increasing to 1, as the number of sites increases?

MP on 4 leaves Consider a four-leaf tree CFN model tree ((A,B),(C,D)) with a very high probability of change (close to ½) on the two edges incident with A and B, and very small probabilities of change (close to 0) on all other edges. What parsimony informative sites have the highest probability? What tree will MP return with probability increasing to 1, as the number of sites increases?

MP on 4 leaves Consider a four-leaf tree CFN model tree ((A,B),(C,D)) with a very high probability of change (close to ½) on the two edges incident with A and C, and very small probabilities of change (close to 0) on all other edges. What parsimony informative sites have the highest probability? What tree will MP return with probability increasing to 1, as the number of sites increases?

Summary (updated) Maximum Parsimony (MP) is statistically consistent on some CFN model trees. However, there are some other CFN model trees in which MP is not statistically consistent. Worse, MP is positively misleading on some CFN model trees. This phenomenon is called “long branch attraction”, and the trees for which MP is not consistent are referred to as “Felsenstein Zone trees” (after the paper by Felsenstein). The problem is not limited to 4-leaf trees…