Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Phylogenetic Trees Lecture 4
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
We have shown that: To see what this means in the long run let α=.001 and graph p:
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
07/05/2004 Evolution/Phylogeny Introduction to Bioinformatics MNW2.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Phylogenetic trees Sushmita Roy BMI/CS 576
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura.
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Chapter 8 Molecular Phylogenetics: Measuring Evolution.
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
MOLECULAR PHYLOGENETICS Four main families of molecular phylogenetic methods :  Parsimony  Distance methods  Maximum likelihood methods  Bayesian methods.
Calculating branch lengths from distances. ABC A B C----- a b c.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
Bioinformatics Overview
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Phylogenetic basis of systematics
Maximum likelihood (ML) method
Models of Sequence Evolution
Goals of Phylogenetic Analysis
Summary and Recommendations
CS 581 Tandy Warnow.
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
Lecture 19: Evolution/Phylogeny
Lecture 11 – Increasing Model Complexity
Summary and Recommendations
Phylogenetic analysis of AquK2P.
Presentation transcript:

Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University of Singapore § Statistics, University of California Reference: BMC Evolutionary Biology 5:1 (2005)

Molecular Phylogenetics From alignments to trees. Many methods: parsimony, distance, stochastic models.

Reversible Models Almost all substitution models are reversible: for example, Pr(anc=A, des=C) = Pr(anc=C, des=A). Rooted trees that give the same unrooted tree are indistinguishable.

Stationary Models Character states have the same frequencies everywhere on the tree. Root can be identified (Yang 1994, Huelsenbeck et al. 2001).

Nonstationary Models Yang and Roberts (1985) Galtier and Gouy (1998)

NON-STATIONARY STATIONARY REVERSIBLE SUBSTITUTION MODELS

The Simplest NSTA Model Parameters: rooted tree topology θ: root base frequency Q: rate matrix (calibrated) branch lengths No relationship between θ and Q.

Specialisations If θ is the equilibrium distribution of Q, get STA. If in addition, Q satisfies the detailed balance conditions, get REV.

Probability of alignment Felsenstein’s algorithm can be used to compute the probability of one site. Multiplying across sites gives probability of alignment.

Tree Inference Fix a rooted tree. Find the most likely parameter values. The maximum likelihood is the support of the tree. Choose tree with highest support.

Site Heterogeneity Codon positions, secondary structure. Deterministic or random relative rates can be accommodated in the model. Two deterministic models: codon position, and codon position + fast/slow.

Two deterministic models codon: 3 fixed unknown rates, corresponding to codon positions, with weighted average 1. codonsite: get two classes of amino acids (fast/slow) from CLUSTAL alignment output. Coupled with codon positions, get 6 unknown rates with weighted average 1.

Test Data Sets A: human, chimp, gorilla B: human, mouse, rat C: human, chimp, gorilla, orangutan D: human, chimp, mouse, rat E: human, mouse, chicken, frog 13 mitochondrial protein-coding genes

Method Unrooted tree is assumed known. For each rooted tree consistent with the unrooted tree, its support is the maximum loglikelihood upon finding the MLE of the process parameters and branch lengths.

Method (continued) Three processes: REV, STA, NSTA Three site models: novar (no variations), codon (3 classes), codonsite (6 classes).

Method (continued) Two outcomes (a) number of genes for which the correct rooted tree is the most likely (b) does the model get the right rooted tree when the loglikelihoods are summed over genes?

Number of successes AB C DE novar NSTA STA47344 codon NSTA STA36225 codon site NSTA STA35156

Combined genes: Does it get the right tree? ABCDE novar NSTANYYYN STAYNNYN codon NSTAYYYYY STAYNNYY codonsite NSTAYYYYY STAYNNNN

Discussion (1) In general, NSTA fits much better than STA, which fits much better than REV, by the likelihood ratio test criterion. Not only does NSTA get the right tree more often than STA, it is also more discriminative: the best tree has much larger support compared to the other trees.

Discussion (2) The codon+site model of site variation is very crude, and this may explain why the performance is worse than codon model. Need to use better methods. Also need to compare with random model, like discrete gamma.

Discussion (3) The NSTA only has 3 more parameters than STA, and 6 more than REV, so the extra computation is not heavy. Also, since it is possible to identify the root, perhaps NSTA should be used routinely.

Discussion (4) Constraint on NSTA: base compositions of sequences that are equally distant from the root are the same. This may not hold. Software freely available upon request.