Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Bioinformatics and Phylogenetic Analysis
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Similar Sequence Similar Function Charles Yan Spring 2006.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Lecture 15 - Hypothesis Testing A. Competing a priori hypotheses - Paired-Sites Tests Null Hypothesis : There is no difference in support for one tree.
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
How to Raise the Dead: The Nuts & Bolts of Ancestral Sequence Reconstruction Jeffrey Boucher Theobald Laboratory.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogeny GENE why is coalescent theory important for understanding phylogenetics (species trees)? coalescent theory lets us test our assumptions.
A brief introduction to phylogenetics
Lecture 2: Principles of Phylogenetics
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Molecular Systematics
 Tue Introduction to models (Jarno)  Thu Distance-based methods (Jarno)  Fri ML analyses (Jarno)  Mon Assessing hypotheses.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
Systematics and Phylogenetics Ch. 23.1, 23.2, 23.4, 23.5, and 23.7.
Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Lecture 15 - Hypothesis Testing
Phylogenetic basis of systematics
Bayesian inference Presented by Amir Hadadi
Goals of Phylogenetic Analysis
Methods of molecular phylogeny
Chapter 19 Molecular Phylogenetics
The Most General Markov Substitution Model on an Unrooted Tree
Molecular data assisted morphological analyses
Presentation transcript:

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Phylogeny Estimation Traditional approaches –Neighbour-joining algorithm –Tree searches with optimality criterion Parsimony Maximum likelihood Bayesian Approaches

Traditional approaches Neighbour-joining algorithm –extremely popular. –relatively fast. –performs well when the divergence between sequences is low. The first step in the algorithm is converting the DNA or protein sequences into a distance matrix that represents the evolutionary distance between sequences H H H H H

Traditional approaches Neighbour-joining algorithm –A serious weakness for distance methods, is that the observed differences between sequences are not accurate reflections of the evolutionary distances between them Multiple substitutions

Traditional approaches Parsimony –In contrast to distance-based approaches, parsimony and ML map the history of gene sequences onto a tree. –In parsimony, the score is simply the minimum number of mutations that could possibly produce the data. –Parsimony has a few obvious disadvantages. The score of a tree is completely determined by the minimum number of mutations among all of the reconstructions of ancestral sequences. Another serious drawback of parsimony arises because it fails to account for the fact that the number of changes is unlikely to be equal on all branches in the tree.

Traditional approaches Maximum likelihood In ML, a hypothesis is judged by how well it predicts the observed data; the tree that has the highest probability of producing the observed sequences is preferred. To use this approach, we must be able to calculate the probability of a data set given a phylogenetic tree. –model of sequence evolution that describes the relative probability of various events. –These probabilities take into account the possibility of unseen events. From many perspectives, ML is the most appealing way to estimate phylogenies. All possible mutational pathways that are compatible with the data are considered and the likelihood function is known to be a consistent and powerful basis for statistical inference in general.

Bayesian phylogenetics Bayesian approaches Bayesian approaches to phylogenetics are relatively new, but they are already generating a great deal of excitement because the primary analysis produces a tree estimate and measures of uncertainty for the groups on the tree. The field of Bayesian statistics is closely allied with ML.

Bayesian vs ML Maximum likelihood vs. Bayesian estimation –Maximum likelihood search for tree that maximizes the chance of seeing the data (P (Data | Tree)) –Bayesian inference search for tree that maximizes the chance of seeing the tree given the data (P (Tree | Data))

The phylogenetic inference process Data collecting, the first step Typically, a few outgroup sequences are included to root the tree. Insertions and deletions obscure which of the sites are homologous. Multiple-sequence alignment is the process of adding gaps to a matrix of data so that the nucleotides in one column of the matrix are related to each other by descent from a common ancestral residue.

The phylogenetic inference process In addition to the data, the scientist must choose a model of sequence evolution. Increasing model complexity improves the fit to the data but also increases variance in estimated parameters. Model selection strategies attempt to find the appropriate level of complexity on the basis of the available data. Model complexity can often lead to computational intractability. model Free parameters GTR 8 TN93 5 HKY85 F84 4 F81 3 K80 1 JC69 0

The phylogenetic inference process

Bootstrapping The bootstrapping approach involves the generation of pseudoreplicate data sets by re- sampling with replacemtent the sites in the original data matrix. When optimality-criterion methods are used, a tree search is performed for each data set, and the resulting tree is added to the final collection of trees.

Markov chain Monte Carlo The Markov chain Monte Carlo (MCMC) methodology is similar to the tree-searching algorithm. From an initial tree, a new tree is proposed. The moves that change and the tree must involve a random choice. The MCMC algorithm also specifies the rules for when to accept or reject a tree.

Markov chain Monte Carlo

heated landscape Markov chain Monte Carlo Metropolis Coupled Markov Chain Monte Carlo

Markov chain Monte Carlo

Bootstrapping and MCMC generate a sample of trees

Note that MCMC yields a much larger sample of trees in the same computational time, because it produces one tree for every proposal cycle versus one tree per tree search in the traditional approach. However, the sample of trees produced by MCMC is highly auto-correlated. As a result, millions of cycles through MCMC are usually required, whereas many fewer (of the order of 1,000) bootstrap replicates are sufficient for most problems.

Bayesian phylogenetics Bayesian approaches Bayesian methods are exciting because they allow complex models of sequence evolution to be implemented. –estimating divergence times –finding the residues that are important to natural selection –detecting recombination points

Comparison of Methods

Conclusions The estimation of phylogenies has become a regular step in the analysis of new gene sequences. Still too early to tell if Bayesian approaches will revolutionize tree estimation in general, but it is already clear that MCMC-based approaches are extending the field by answering previously intractable questions.

References Swofford, D. L., Olsen, G. J., Waddell, P. J. & Hillis, D. M. in Molecular Systematics (eds Hillis, D. M., Moritz, C. & Mable, B. K.) 407–514 (Sinauer Associates, Sunderland, Massachusetts, 1996). An excellent review of parsimony, ML and distance approaches to phylogenetic inference. Goldman, N., Anderson, J. P. & Rodrigo, A. G. Likelihoodbased tests of topologies in phylogenetics. Syst. Biol. 49, 652–670 (2000). A useful taxonomy of the hypothesis-testing approaches for likelihood-based phylogenetics.