Phylogenetics LLO9 Maximum Likelihood and Its Applications

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

An Introduction to Phylogenetic Methods
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Maximum Parsimony.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Models of sequence evolution GTR HKY Jukes-Cantor Felsenstein K2P Tree building methods: some examples Assessing phylogenetic data Popular phylogenetic.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
A brief introduction to phylogenetics
Lecture 2: Principles of Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
Phylogenetic basis of systematics
Lecture 16 – Molecular Clocks
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Models of Sequence Evolution
Goals of Phylogenetic Analysis
Patterns in Evolution I. Phylogenetic
Summary and Recommendations
CS 581 Tandy Warnow.
Why Models of Sequence Evolution Matter
Chapter 19 Molecular Phylogenetics
The Most General Markov Substitution Model on an Unrooted Tree
#30 - Phylogenetics Distance-Based Methods
Lecture 11 – Increasing Model Complexity
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Phylogenetics LLO9 Maximum Likelihood and Its Applications Prepared by, Jaya Seelan Sathiya Seelan PhD

Previously: The parsimony criterion Bootstrapping Models of molecular evolution Distance methods Model-based distance corrections for distance analyses (e.g., Neighbor-Joining, UPGMA) Today: The likelihood criterion More on models of molecular evolution Maximum likelihood analysis Performance of distance, parsimony, and ML analyses in simulation (Huelsenbeck, 1995) Bayesian phylogenetic inference

Parsimony criterion: The “shortest” tree is optimal. Tree length is dependent on the step matrix for transformation costs. Pros: Intuitive Analytically tractable Flexible--many different weighting schemes possible Can combine different kinds of data Cons: May be inconsistent in the statistical sense, meaning that as more and more data are accumulated, results can converge on an incorrect solution. Consider tree with a short internal branch and asymmetric terminal branches, reflecting unequal rates of evolution. Using parsimony, correct reconstruction of the internal branch requires a character that changed along the internal branch, but not on terminal branches. It is unlikely that such informative characters will occur, but uninformative or misinformative characters may be common. Parsimony can be "positively misleading" in cases like these, because the number of misinformative characters will increase as the number of characters increases. So, parsimony will cause you to become more confident in the wrong answer. This tree scenario has been called the "Felsenstein zone”. The phylogenetic artifact is called long branch attraction. Swofford et al p. 427 fig. 8

Maximum likelihood methods are explicitly model-based. The likelihood criterion: The tree that maximizes the likelihood of the observed data is optimal. L = P(datatree, model) Likelihood (L) is the probability of the data (alignment), given a tree (with topology and branch lengths specified) and a probabilistic model of evolution. Assumptions (the fine print): The tree is correct The probability that a position has a certain state at time 1 depends only on the state at time 0; knowing that it had some state prior to time 0 is irrelevant--this is called a Markov process Data (individual sites) are independent A uniform evolutionary process operated across the entire tree (why might this be false? endosymbiosis? loss of function?), i.e., the process of evolution is a homogeneous Markov process. Maximum likelihood methods are explicitly model-based. Examples of models….

A C A C G T G T Two simple models of molecular evolution: Jukes-Cantor (JC69) one-parameter model Assumes that all transformations between nucleotides occur at the same rate Kimura (K80 or K2P) two-parameter model Assumes that transitions and transversions occur at different rates (supported by empirical data).   A C A C   Transitions      Transversions  G T G T   JC69 K2P

JC69* rate matrix “To” state “From” state 1 parameter: a Models of molecular evolution are based on substitution rate matrices (and which can be transformed into substitution probability matrices). Models vary in the numbers and kinds of parameters used to determine elements in the rate matrix JC69* rate matrix 1 parameter: a “To” state “From” state *Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pages 21-132 in H. N. Munro (ed.), Mammalian Protein Metabolism. Academic Press, New York. © Paul Lewis

Generalized time reversible (GTR) model: Transformation rates determined by mean substitution rate (m), relative rate parameters (a-e) and base frequencies (e.g., pA) 9 parameters: pA pC pG a b c d e m -m (pAc + pCe + pGf) Identical to the JC69 model if a = b = c = d = e = f = 1 and all the base frequencies are set to ¼. *Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 20:86-93. © Paul Lewis

The models discussed so far are interconvertible by adding or restricting parameters. Swofford et al. p. 434

Elaborations to GTR (and other nucleotide models): Modifications to accommodate rate heterogeneity Discrete gamma () model of rate heterogeneity (rates of evolution for each site are distributed according to a “discretized” gamma distribution) Proportion of invariant (I) sites model (some sites assumed to be invariant) Combinations of the above, e.g., GTR +  + I model Molecular clock models Strict clock models (uniform rates of evolution across the phylogeny) Relaxed clock models (rates allowed to vary across branches) Correlated relaxed clock (rates of adjacent branches are correlated) Uncorrelated relaxed clock (rates of adjacent branches are independent) Other kinds of models used in phylogenetics: Amino acid sequence models: Dayhoff; PAM; Wagner, etc Codon-based nucleotide models Morphological character models Various methods exist for choosing models of molecular evolution (aiming for a model that is not underparameterized [poor fit to observed data] or overparameterized [poor predictive power; computationally intensive]). Criteria: Likelihood ratio test; Akaike Information Criterion (AIC); Bayesian Information Criterion (BIC) Programs by David Posada and colleagues: Modeltest (nucleotides) http://darwin.uvigo.es/software/modeltest.html, ProtTest (amino acids) http://darwin.uvigo.es/software/prottest.html

Calculating the likelihood of a tree L = P(datatree, model) Consider a tree, with four species and branch lengths specified, and an aligned dataset For each site, there are four observed tip states and sixteen (4x4) possible combinations of ancestral states Each reconstruction of ancestral states has a probability, based on the transformation probability matrix and branch lengths The likelihood of the tree for the one site is equal to the sum of the probabilities of each reconstruction of ancestral states The likelihood of the entire dataset given the tree is the product of the likelihoods of the individual sites, or the sum of the log likelihoods of the sites This calculation is repeated on multiple trees, and the tree that provides the highest likelihood score is preferred. The process of generating the trees can use exhaustive, branch-and-bound, or heuristic searches (often with starting trees generated with parsimony or distance searches). Calculating the likelihood of a tree Graur and Li Fig 5.19

Likelihood calculations, continued Likelihood calculations, continued. Brute force approach would be to calculate L for all 16 combinations of ancestral states and sum

Pruning algorithm* (same result, much less time) Many calculations can be done just once, and then reused many times *The pruning algorithm was introduced by: Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368-376 © Paul Lewis

Recent algorithms have dramatically improved the efficiency of ML calculations. e.g., RAxML by Alexi Stamatakis http://icwww.epfl.ch/~stamatak/index.htm GARLI by Derrick Zwickl http://www.bio.utexas.edu/faculty/antisense/garli/Garli.html Nonetheless, ML calculations are still a lot of work. Are they worth it? John Huelsenbeck performed a classic study to test accuracy of parsimony, ML, and distance methods (Huelsenbeck, J. 1995. Performance of phylogenetic methods in simulation. Syst. Biol. 44: 17-48) Simulations: 4-taxon trees with varying branch lengths (including the Felsenstein zone) implying change in 1% to 75%; total of 1296 (parsimony, distance) or 676 (ML) combinations of branch lengths. Datasets “evolved” on each tree under three models: Jukes-Cantor, Kimura 2-parameter (with 5:1 transition/transversion bias), JC with among-site rate heterogeneity 1000 simulated datasets were produced for each of 1296 trees; except ML, with 100 datasets for each of 676 trees Phylogenetic analyses were performed for each tree/dataset combination using 26 analytical methods, with 100, 500, 1000, 10000 or infinite sites.

Results: Results for simple parsimony on data generated under Jukes-Cantor model. 95% isocline indicates region within which the correct tree is estimated 95% of the time. Increasing data improves accuracy, up to a point--in the Felsenstein zone, even infinite data do not allow correct reconstruction of the phylogeny. Parsimony is inconsistent.

Results continued: Results for 26 methods on data generated under Jukes-Cantor model. Black: correct tree is recovered 100% of the time. White: correct tree never recovered. Gray: intermediate. White lines: 95% isocline (region within which the correct tree is estimated 95% of the time). Black lines: 33% isocline (equivalent to picking a tree at random).

What does it mean?: For all methods, more data are better, but… Parsimony is inconsistent, even when character-state weighting is applied. Model-based distance corrections improve performance of minimum evolution distance method UPGMA performs poorly, perhaps due to implicit assumption of equal rate of evolution ML performs best, but still has trouble in the FZ

ML appears to be an accurate method under a wide range of treespace (and modelspace). But… ML is still computationally intensive. Intuitively the Likelihood of the data L = P(datatree, model) is not really what we care about. (After all, we have observed the data.) What we really want to know is, what is the probability of the tree: P(treedata) Bayes’ theorem allows estimation of the posterior probability of a tree, given a prior probability (marginal probability) of the tree.

Questions Now….you may be able to compare and contrast the advantages and disadvantages of Parsimony and Likelihood analysis. Which one is better?

Maximum Parsimony Maximum Likelihood More? chooses the tree that requires the smallest number of character state changes chooses the tree that maximizes the probability of the data trees are scored based on a character dataset, and the tree with the best score is selected character-based method score is a measure of the number of evolutionary changes (e.g., A changing to T) that would be required to generate the data given that particular tree. parametric statistical method, in that employs an explicit model of character evolution. requires the lowest evolutionary changes depends on the complete specification of the data and a probability model to describe the data “most parsimonious tree assumption is that evolution rarely happens. substitution model should be optimized to fit the observed data optimality criterion where we do choose the shortest tree among all contenders. More? a simple method and easily understood operation. It does not seem to depend on an explicit model of evolution results can converge on an incorrect solution and also underestimates branch lengths. uninformative or misinformative characters Long branch attraction

Maximum likelihood analysis using ITS Use of Maximum Likelihood analysis Seelan et al. 2015 Maximum likelihood analysis using ITS