Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
An Introduction to Phylogenetic Methods
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Lecture 8 – Searching Tree Space. The Search Tree.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
A brief introduction to phylogenetics
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Newer methods for tree building
Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
Fabio Pardi PhD student in Goldman Group European Bioinformatics Institute and University of Cambridge, UK Joint work with: Barbara Holland, Mike Hendy,
Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida.
Christian Arnold Bioinformatics Group, University of Leipzig Bioinformatics Herbstseminar October 23th, 2009 Three Weeks of Experience at the formatics.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Molecular Systematics
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Phylogenetic Trees - Parsimony Tutorial #12
Nonparametric estimation of phylogenetic tree distributions
Bayesian inference Presented by Amir Hadadi
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
The Most General Markov Substitution Model on an Unrooted Tree
Lecture 8 – Searching Tree Space
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

Bayes estimators for phylogenetic reconstruction Ruriko Yoshida

Figure 19.1 Genomes 3 (© Garland Science 2007) Origins of Species

Tree (or web) of life  History of life is written in the genes. bacteriaarchaea eukarya Mount DW (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York

Gene tree on a Species tree  Maddison WP (1997) Gene trees in species trees. Systematic Biology 46:

Phylogenetic tree reconstruction  The maximum likelihood estimation (MLE) methods : These methods describe evolution in terms of a discrete-state continuous-time Markov process.  The Balanced Minimum Evolution (BME) method: This is a distance based method and weighted Least Square method. The Neighbor-Joining (NJ) method is a greedy algorithm of the BME method (Steel and Gasquel, 2008).  Bayesian inference for trees : Use Bayes Theorem and MCMC to estimate the posterior distribution rather than obtaining the point estimation.

Closeness  Tree uncertainty is a pervasive issue in phylogenetics.  To help cope with tree uncertainty, bootstrapping and Bayesian sampling methods provide a collection of possible trees instead of a single tree estimate.  Using bootstrapping or Bayesian sampling, one common practice is to identify highly supported tree features (e.g. splits) which occur in almost all the tree samples. Highly supported features are regarded as likely features of the true tree.  Similarly, in simulation studies it is common to judge reconstruction methods based on how close they get to the true tree.

Bayes estimators  This leads us to ask whether reconstruction accuracy (i.e. closeness to the true tree) can be improved, by attempting to directly optimize accuracy.  Even though the true tree is unknown, we can still optimize reconstruction accuracy using a Bayesian approach.  In the Bayesian view, the true tree is a random variable T distributed according to the posterior distribution P(T|D), where D is input data such as sequence data.

Bayes estimators  If d() measures distance between trees, and T’ is a tree estimate, then the expected distance between T’ and the true tree is E T~P(T |D) d(T, T’).  Thus, to maximize reconstruction accuracy, we should choose our tree estimate to be T * = argmin T’ E T~P(T |D) d(T, T’) where T * is known as a Bayes estimator.

Tree Distances  Many popular distances between trees can be easily expressed as a squared euclidean distance, after embedding trees in an appropriately chosen vector space.  Important examples include Robinson-Foulds distance (symmetric difference), quartet distance, and the squared path difference.  With Robinson--Foulds distance, Holder showed the Bayes estimator with the RF distance is the majority-rule consensus tree (Holder, Systematics Biology 2008).  We focus on squared euclidean distances, specifically the squared path difference.

Path Difference Metrics  The path difference metric is a topological distance, i.e. it only depends on the topologies of the tree.  The path difference metric v p (T) in R n(n-1)/2, where n is the number of taxa, is the integer vector whose the ijth entry, i < j, counts the number of edges between leaves i and j in the tree T.  The path difference metric was studied in (Steel and Penny, 1993).  The squared path difference is d p (T', T) = || v p (T) – v p (T') || 2.

Example For the trees T 1 and T 2, we have: v p (T 1 ) = (2,3,4,4,3,4,4,3,3,2) v p (T 2 ) = (2,4,4,3,4,4,3,2,3,3) d p (T 1,T 2 ) = || v p (T 1 ) – v p (T 2 ) || 2 = 6 T 1 T 2

Optimization We want to find the optimal solution T * such that T * = argmin_ T’ E T~ P(T | D) d p (T, T'). Since the number of tree topologies on n taxa grows exponentially in n, computing the Bayes estimator T * under a general distance function can be computationally hard (NP hard).

Hill Climbing Techniques  However, hill climbing techniques such as those used in ML methods often work quite well in practice for tree reconstruction. Hill climbing techniques can similarly be used to find local minima of the empirical expected loss.  Hill climbing requires a way to move from one tree topology to another. Three types of combinatorial tree moves are often used for this purpose; Nearest Neighbor Interchange (NNI), Subtree- Prune-and-Regraft (SPR), and Tree-Bisection-Reconnect (TBR).  Here we use NNI moves.

Nearest Neighbor Interchange Here we assume unrooted trees (if we want to have a rooted tree we add an outgroup and make it as unrooted tree with n + 1 taxa). A, B, C, D are clades

Nearest Neighbor Interchange So we can think of this as a tree with 4 leaves and since there are only 3 tree topologies we swap this tree with one of the following two different tree topologies.

Simulations  For simulated data, we used the first 1000 examples from the data set presented in (Guindon and Gasquel, 2003).  Trees on 40 taxa were generated according to a Markov process. For each generated tree, 40 homologous sequences (no indels) of length 500 were generated, under the Kumura two- parameter (K2P) model, with a transition/transversion ratio of 2.0. Specifically the Seq-Gen program was used to generate the sequences.  The data is available from the website

Simulations  For each set of homologous sequences D in the simulated data, we used the software MrBayes to obtain 15,000 samples from the posterior distribution P(T|D). Specifically, we ran MrBayes under the K2P model, discarded the initial 25% of samples as a burn-in, used a 50 generation sample rate, and ran for 1,000,000 generations in total.  We computed a consensus tree computed via MrBayes, a ML tree estimate for each data set, using the hill climbing software PHYML as described in the paper. We also computed a NJ tree using the software PHYLIP, using pairwise distances computed by PHYLIP.

Siulations

Simulations

NIGMS NIGMS NSF NSF Acknowledgments  Collaborators: - Peter Huggins (Fall 2010, UK) -David Haws (Stat, UK) -Wenbin Li (CS, UK) -Thomas Friedrich (Stat, UK) -Jinze Liu (CS, UK) P. Huggins, W. Li, D. Haws, T. Friedrich, J. Liu, and R. Yoshida. Bayes estimators for phylogenetic reconstruction, accepted conditionally in Systematic Biology. Available at

“Are you done yet?” Thank you for your attention!