More statistical stuff CS 394C Feb 6, 2012. Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
BNFO 602 Phylogenetics Usman Roshan.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Phylogenetic Trees Presenter: Michael Tung
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
BINF6201/8201 Molecular phylogenetic methods
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
Introduction to Bioinformatics Resources for DNA Barcoding
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Statistical tree estimation
Distance based phylogenetics
Maximum likelihood (ML) method
Recitation 5 2/4/09 ML in Phylogeny
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

More statistical stuff CS 394C Feb 6, 2012

Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are not statistically consistent Maximum Likelihood Phylogenetic Estimation Software

Simplest model of binary character evolution: Cavender-Farris For each edge e, there is a probability p(e) of the property “changing state” (going from 0 to 1, or vice-versa), with 0<p(e)<0.5 (to ensure that CF trees are identifiable). Every position evolves under the same process, independently of the others.

Cavender-Farris pattern probabilities Let x and y denote nodes in the tree, and p xy denote the probability that x and y exhibit different states. Theorem: Let p i be the substitution probability for edge e i, and let x and y be connected by path e 1 e 2 e 3 …e k. Then 1-2p xy = (1-2p 1 )(1-2p 2 )…(1-2p k )

And then take logarithms The theorem gave us: 1-2p xy = (1-2p 1 )(1-2p 2 )…(1-2p k ) If we take logarithms, we obtain ln(1-2p xy ) = ln(1-2p 1 ) + ln(1-2p 2 )+…+ln(1-2p k ) Since these probabilities lie between 0 and 0.5, these logarithms are all negative. So let’s multiply by -1 to get positive numbers.

Branch Lengths Let w(e) = -1/2 ln(1-2p(e)), where p(e) is the probability of change on edge e. By the previous theorem, D xy = -1/2 ln(1-2p xy ) defines an additive matrix. We can estimate D xy using d xy = -1/2 ln(1-2H(x,y)/k), where H(x,y) is the Hamming distance and k is the sequence length This is a statistically consistent distance estimator for the CF model.

CF model, again Instead of defining a CF model tree using substitution probabilities, p(e), we give w(e) = expected number of times the site will change on edge e, under a Poisson random process. In this case, p(e) is the probability of an odd number of changes on edge e. It is not that hard to show that w(e) = -1/2 ln(1-2p(e))

Rates-across-sites Most models allow for sites to vary somewhat, but restrict this variability to “rates” of evolution (so some sites evolve faster, some evolve slower) These rates are used to scale the branch lengths w(e) up or down (they do NOT scale the substitution probabilities) Typically these rates are drawn from some nice distribution (like a gamma distribution), to ensure identifiability of the tree from the data

DNA substitution models Every edge has a substitution probability The model also allows 4x4 substitution matrices on the edges: –Simplest model: Jukes-Cantor (JC) assumes that all substitutions are equiprobable –General Time Reversible (GTR) Model: one 4x4 substitution matrix for all edges –General Markov (GM) model: different 4x4 matrices allowed on each edge

Jukes-Cantor DNA model Character states are A,C,T,G (nucleotides). All substitutions have equal probability. On each edge e, there is a value p(e) indicating the probability of change from one nucleotide to another on the edge, with 0<p(e)<0.75 (to ensure that JC trees are identifiable). The state (nucleotide) at the root is random (all nucleotides occur with equal probability). All the positions in the sequence evolve identically and independently.

Tree Estimation Distance-based methods can be statistically consistent, if (a) statistically consistent distance estimator is used, and (b) the tree estimation technique has some error tolerance Maximum parsimony can be statistically inconsistent UPGMA can be statistically inconsistent

Maximum Parsimony Parsimony-informative sites on 4-leaf trees Parsimony informative sites more generally: at least two “big states” The Felsenstein Zone tree

Computing the probability of the data Given a model tree (with all the parameters set) and character data at the leaves, you can compute the probability of the data. Small trees can be done by hand. Large examples are computationally intensive - but still polynomial time (using an algorithmic trick).

Cavender-Farris model calculations Consider an unrooted tree with topology ((a,b),(c,d)) with p(e)=0.1 for all edges. What is the probability of all leaves having state 0? We show the brute-force technique.

Brute-force calculation Let E and F be the two internal nodes in the tree ((A,B),(C,D)). Then Pr(A=B=C=D=0) = Pr(A=B=C=D=0|E=F=0) + Pr(A=B=C=D=0|E=1, F=0) + Pr(A=B=C=D=0|E=0, F=1) + Pr(A=B=C=D=0|E=F=1) The notation “Pr(X|Y)” denotes the probability of X given Y.

Calculation, cont. Technique: Set one leaf to be the root Set the internal nodes to have some specific assignment of states (e.g., all 1) Compute the probability of that specific pattern Add up all the values you get, across all the ways of assigning states to internal nodes

Calculation, cont. Calculating Pr(A=B=C=D=0|E=F=0) There are 5 edges, and thus no change on any edge. Since p(e)=0.1, then the probability of no change is 0.9. So the probability of this pattern, given that the root is a particular leaf and has state 0, is (0.9) 5. Then we multiply by 0.5 (the probability of the root A having state 0). So the probability is (0.5)x(0.9) 5.

Calculation, cont. Calculating Pr(A=B=C=D=0|E=F=1) There is a change on every edge except the internal edge. Since p(e)=0.1, the probability of no change is 0.9. So the probability of this pattern, given that the root is a particular leaf and has state 0, is (0.1) 4 (0.9). Then we multiply by 0.5 (the probability of the root A having state 0). So the probability is (0.5)x(0.1) 4 x(0.9).

Calculating Pattern Probabilities The brute-force calculation uses exponential time Dynamic Programming makes it possible to do this in polynomial time

Fixed-tree maximum parsimony Input: tree topology T with leaves labelled by sequences of same length Output: optimal assignment of sequences to internal nodes to minimize the parsimony score

Recall DP algorithm for maximum parsimony Cost(v,a) = min cost of the subtree T v, given that we label v by letter a. Questions: How to initialize? How to order the subproblems? Where is the answer? What is the running time?

DP algorithm for calculating CF pattern probabilities Calculate: W(v,a) = Prob(S v |label(v)=a), the probability of the sequences at the leaves of the subtree T v, given label(v) = a. Questions: How to initialize? How to order the subproblems? Where is the answer? What is the running time?

DP algorithm, continued Note: the input is a CF model tree, so substitution probabilities on the edges are part of the input The running time is polynomial You can root the tree on any edge

CF maximum likelihood for fixed tree Input: tree topology T and sequences at the leaves of T Output: substitution probabilities  so that Pr(S|(T,  )) is maximized The DP algorithm does not solve this

Maximum Likelihood Input: sequence data S Output: the model tree (tree T and substitution parameters  ) such that Pr(S|(T,  )) is maximized. NP-hard. Important in practice. Good heuristics! But what does it mean?

Maximum likelihood under Cavender-Farris Given a set S of binary sequences, find the Cavender-Farris model tree (tree topology and edge parameters) that maximizes the probability of producing the input data S. ML, if solved exactly, is statistically consistent under Cavender-Farris (and under the DNA sequence models, and more complex models as well). The problem is that ML is hard to solve.

“Solving ML” Technique 1: compute the probability of the data under each model tree, and return the best solution. Problem: Exponentially many trees on n sequences, and uncountably infinite number of ways of setting the parameters on each of these trees!

“Solving ML” Technique 2: For each of the tree topologies, find the best parameter settings. Problem: Exponentially many trees on n sequences, and calculating the best setting of the parameters on any given tree is hard! Even so, there are hill-climbing heuristics for both of these calculations (finding parameter settings, and finding trees).

Bayesian MCMC analyses Algorithm is a random walk through space of all possible model trees (trees with substitution matrices on edges, etc.). From your current model tree, you perturb the tree topology and numerical parameters to obtain a new model tree. Compute the probability of the data (character states at the leaves) for the new model tree. –If the probability increases, accept the new model tree. –If the probability is lower, then accept with some probability (that depends upon the algorithm design and the new probability). Run for a long time…

Bayesian MCMC estimation After the random walk has been run for a very long time… Gather a random sample of the trees you visit Return: –Statistics about the random sample (e.g., how many trees have a particular bipartition), OR –Consensus tree of the random sample, OR –The tree that is visited most frequently Bayesian methods, if run long enough, are statistically consistent methods (the tree that appears the most often will be the true tree with high probability). MrBayes is standard software for Bayesian analyses in biology.

Summary for CF (and GTR) Maximum Likelihood is statistically consistent if solved exactly Bayesian MCMC methods, if run long enough Distance-based methods (like Neighbor Joining and the Naïve Quartet Method) But not maximum parsimony, not maximum compatibility, and not UPGMA

Software MrBayes is the most popular Bayesian methods (but there are others) RAxML is the most popular software for ML estimation on large datasets, but other software may be almost as accurate and faster (in particular FastTree) Protein sequence data presents additional challenges, due to model selection Issues: running time, memory, and models…R (General Time Reversible) model

Phylogeny estimation statistical issues Is the phylogeny estimation method statistically consistent under the given model? How much data does the method need need to produce a correct tree? Is the method robust to model violations? Is the character evolution model reasonable?