FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Robert Hirt Institute for Cell and Molecular Biosciences, Newcastle University, UK.

Slides:

Advertisements

Similar presentations

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.

Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

An Introduction to Phylogenetic Methods

BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.

 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.

Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.

1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Molecular Evolution Revised 29/12/06

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.

Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.

MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES Robert Hirt Institute for Cell and Molecular Biosciences, Newcastle University,

Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.

Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.

Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.

Probabilistic methods for phylogenetic trees (Part 2)

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Multiple Sequence Alignments

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Terminology of phylogenetic trees

BINF6201/8201 Molecular phylogenetic methods

Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.

Molecular phylogenetics

Christian M Zmasek, PhD 15 June 2010.

Tree Inference Methods

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Applied Bioinformatics Week 8 Jens Allmer. Practice I.

A brief introduction to phylogenetics

Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU.

Calculating branch lengths from distances. ABC A B C----- a b c.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

Pairwise Sequence Analysis-III

Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.

Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.

Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,

Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.

Phylogeny Ch. 7 & 8.

N=50 s=0.150 replicates s>0 Time till fixation on average: t av = (2/s) ln (2N) generations (also true for mutations with negative “s” ! discuss among.

Phylogenetic Analysis YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM.

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Bayesian II Spring Major Issues in Phylogenetic BI Have we reached convergence? If so, do we have a large enough sample of the posterior?

HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),

Phylip PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). PHYLIP is the most widely-distributed.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Phylogenetics LLO9 Maximum Likelihood and Its Applications

Distance based phylogenetics

Inferring a phylogeny is an estimation procedure.

Maximum likelihood (ML) method

Summary and Recommendations

Lateral Transfer of an EF-1α Gene

#30 - Phylogenetics Distance-Based Methods

Lecture 11 – Increasing Model Complexity

Summary and Recommendations

Presentation transcript:

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Robert Hirt Institute for Cell and Molecular Biosciences, Newcastle University, UK

Agenda Remind you that molecular phylogenetics is complex –the more you know about the compared proteins and the method used, the better Try to avoid the black box approach a much as possible! Give an overview of the phylogenetic methods and software used with protein alignments - some practical issues…

From DNA/protein sequences to trees Modified from Hillis et al., (1993). Methods in Enzymology 224, Align Sequences Phylogenetic signal? Patterns—>evolutionary processes? Test phylogenetic reliability Distances methods Choose a method MBML Characters based methods Single treeOptimality criterion Calculate or estimate best fit tree LSMENJ Distance calculation (which model?) Model? MP Wheighting? (sites, changes)? Model? Sequence data * * *

Phylogenies from proteins Parsimony Distance matrices Maximum likelihood Bayesian methods * * *

A G G G C G B A G G C T C A T T C T D A T T T G A B C D A - d1 d2 d3 B - d4 d5 C - d6 D - A B C D Distance matrix MP ML BM Explicit model of sequence evolution Distances methods Characters based methods

Phylogenetic trees from protein alignments Distance methods - model for distance estimation –Simple formula (e.g. Kimura, use of Dij, LogDet) –Complex models Probability of amino acid changes - Mutational Data Matrices Site rate heterogeneity Maximum likelihood and Bayesian methods- MDM based models are used for lnL calculations of sites -> lnL of trees Site rate heterogeneity Homogenous versus heterogeneous models Estimations of data specific rate matrices (amino acid groupings - GTR like)

Software: an overview SEAVIEW – alignment editing, parsimony, simples distances, ML PHYLIPv distance, MP, and ML methods (and more) –Some complex protein models PAM, JTT ± site rate heterogeneity –Bootstrapping - bootstrap support values TREE-PUZZLEv5.2 - distance and a ML method –ML - quartet method –Complex protein models JTT, WAG…matrices ± site rate heterogeneity –From quartets to n-taxa tree - PUZZLE support values –Some sequence tree statistics - aa frequency and heterogeneity between sequences PHYML - ML methods (protein and DNA) – In SEAVIEW MRBAYES - Bayesian –Complex protein models JTT, WAG…matrices ± site rate heterogeneity Data partitioning –Posteriors as support values PHYLOBAYESv2.3 (CAT model) P4 –All the things you can dream off… almost… ask Peter Foster –Heterogeneous models among taxa or sites –Estimation of amino acid rate matrices for grouped categories (6x6 rate matrices can be calculated)

PHYLIP3.62 Protpars: parsimony Protdist: models for distance calculations: –PAM1, JTT, Kimura formula (PAM like), others... –Correction for rate heterogeneity between sites ! Removal of invariant sites? (not estimated, see TREE-PUZZLE5.2 !) NJ and LS distance trees (± molecular clock) Proml: protein ML analysis (no estimation of site rate heterogeneity - see TREE-PUZZLE5.2) –Coefficient of variation (CV) versus alpha shape parameter CV=1/alpha 1/2 Bootstrapping

A two step approach - two choices! 1) Estimate all pairwise distances Choose a method (100s) - has an explicit model for sequence evolution Simple formula Complex models - PAM, JTT, site rate variation 2) Estimate a tree from the distance matrix Choose a method: with (ME, LS) or without an optimality criterion (NJ)? Distance methods

Simple and complex models dij = -Ln (1 - Dij - (Dij 2 /5)) (Kimura) Simple and fast but can be unreliable - underestimates changes, hence distances, which can lead to misleading trees - PHYLIP, CLUSTALX, SEAVIEW Dij is the fraction of residues that differs between sequence i and j (Dij = 1 - Sij) dij = ML [P( ), (  pinv ), X ij ] (bad annotation!) ML is used to estimate the dij based on the sequence alignment and a given model - MDM, gamma shape parameter and pinv - PHYLIP, PUZZLE. Each site is used for the calculation of dij, not just the Dij value. More realistic complexity in relation to protein evolution and the subtle patterns of amino acid exchange rates… Note: the values of the different parameters (alpha+pinv) have to be either estimated, or simply chosen (MDM), prior the dij calculations

1) Choosing/estimating the parameter of a model 1) Mutation Data Matrices: PAM, JTT, WAG… What are the properties of the protein alignment (% identity, amino acid frequencies, globular, membrane)? Can be corrected for the specific dataset amino acid frequencies (-F) - in some software only Compare ML of different models for a given data and tree  ModelGenerator and ProtTest are designed for this 2) Alpha and pInv values have to be estimated on a tree TREE-PUZZLE can do that. Reasonable trees give similar values…

2) Inferring the phylogenetic trees from the estimated dij a) Without an optimality criterion Neighbor-joining (NJ) (PHYLIP, SEAVIEW) Different algorithms exist - improvement of the computing If the dij are additive, or close to it, NJ will find the ME tree BIONJ (SEAVIEW), WEIGBOR, FastME b) With an optimality criterion Least squares (FITCH) Minimum evolution (in PAUP - now also PHYLIP)

Seeks to minimise the weighted squared deviation of the tree path length distances from the distance estimates - uses an objective function Fitch Margoliash Method 1968 E =   wij |dij - pij|  i=1j=i+1 T-1T dij = F(Xij) pairwise distances estimate - from the data using a specific model (or simply Dij) pij = length of path between i and j implied on a given tree dij = pij for additive datasets (all methods will find the right tree) E = the error of fitting dij to pij T = number of taxa if  = 2 weighted least squares wij = the weighting scheme

Minimum Evolution Method For each possible alternative tree one can estimate the length of each branch from the estimated pairwise distances between taxa (using the LS method) and then compute the sum (S) of all branch length estimates. The minimum evolution criterion is to choose the tree with the smallest S value S =  V k k=1 2T-3 With V k being the length of the branch k on a tree

Distance methods Advantages: –Can be fast (NJ) –Some distance methods (LogDet) can be superior to more complex approached (ML) in some conditions (shown for DNA alignments) –Distance trees can be used to estimate parameter values for more complex models and then used in a ML method –Provides trees with branch length Disadvantages: –Can loose information by reducing the sequence alignment into pairwise distances –Can produce misleading (like any method) trees in particular if distance estimates are not realistic (bad models), deviates from additivity

Character based methods Maximum likelihood based methods –Quartet puzzling method - TREE-PUZZLE –Standard ML - PHYML, PROML (PHYLIP) Bayesian based methods –MrBayes v3.1 –Phylobayes v2.3 –P4 (Peter Foster)

A G G G C G B A G G C T C A T T C T D A T T T G A B C D A - d1 d2 d3 B - d4 d5 C - d6 D - A B C D Distance matrix MP ML BM Explicit model of sequence evolution Distances methods Characters based methods

TREE-PUZZLE5.2 Protein maximum likelihood method using “quartet puzzling” –With various protein rate matrices (JTT, WAG…) –Can include correction for rate heterogeneity between sites - pinv + gamma shape (can estimates the values) –Can estimate amino acid frequencies from the data –List site rates categories for each site (2-16) –Composition statistics –Molecular clock test –Can deal with large datasets Can be used for ML pairwise distance estimates with complex models - used with puzzleboot to perform bootstrapping with PHYLIP

A gamma distribution can be used to model site rate heterogeneity Yang 1996 TREE, 11,

TREE-PUZZLE5.2 1) Parameters (pInv-gamma) are estimated on a NJ n-taxa tree 2) Calculate the ML tree for all possible quartets (4- taxa) 3) Combine quartets in a n-taxa tree (puzzling step) 4) Repeat the puzzling step numerous times (with randomised order of quartet input) 5) Compute a majority rule consensus tree from all n-trees - has the puzzle support value The quartet ML tree search method has four steps: Puzzle support values are not bootstrap values!

TREE-PUZZLE5.2 Models for amino acid changes: –PAM, JTT, BLOSUM64, mtREV24, WAG (with correction for amino acid frequencies) –Correction for specific dataset amino acid frequencies –Discrete gamma model for rate heterogeneity between sites 4-16 categories. -> output gives the rate category for each site. Can be used to partition your data and analyse them separately… Taxa composition heterogeneity test Molecular clock test

Combination of categories that contributes the most to thelikelihood (computation done without clock assumption assuming quartet-puzzling tree):

Can be used to calculate pairwise distances with a broad diversity of models - puzzleboot (Holder & Roger) – Can be used in combination with PHYLIP programs for bootstrapping: –SEQBOOT –NJ or LS… – CONSENSE But PHYML can do ML bootstrapping in a fare amount of time… TREE-PUZZLE5.2

Advantages: –Can handle larger numbers of taxa for maximum likelihood analyses –Implements various models (BLOSUM, JTT, WAG…) and can incorporate a correction for rate heterogeneity (pinv+gamma) –Can estimate for a given tree the gamma shape parameter and the fraction of constant sites and attribute to each site a rate category Disadvantages: –Quartet based tree search - amplification of the long branch attraction artefact within each quartet analysis?

MrBayes 3.1 Bayesian approach –Iterative process leading to improvement of trees and model parameters and that will provide the most probable trees (and parameter values) Complex models for amino acid changes: –PAM and JTT, WAG (with correction for amino acid frequencies, but you have to type it!?!?!) –Correction for rate heterogeneity between sites (pinv, discrete gamma, site specific rates) Powerful parameter space search –Tree space (tree topologies) –Shape parameter (alpha shape parameter, pinv) –Can work with large dataset –Provides probabilities of support for clades (posterior probabilities)

MrBayes 3.1 MrBayes will produce a population of trees and parameter values - obtained by a Markov chain (mcmcmc). If the chain is working well these will have converged to “probable” values In practice we plot the results of an mcmcmc to determine the region of the chain that converged to probable values. The “burn in” is the region of the mcmcmc that is ignored for calculation of the consensus tree –Trees and parameter values from the region of equilibrium are used to estimate a consensus tree –The number of trees recovering a given clade corresponds to the posterior for that clade, the probability that this clade exists –The mcmcmc uses the lnL function to compare trees between generations

Most methods provide a single tree and parameters value – Bootstrapping provide a distribution of tree topologies – Puzzling steps also provides a distribution tree topologies Bootstrap values - Puzzle support values - Posteriors values ??? But not to sure how to interpret these different support values - in each case the support values are for a given dataset and method used Posteriors are typically higher then bootstrap and puzzle support values?! MrBayes 3.1

MrBayes 3.1: some options

#NEXUS begin data; dimensions ntax=8 nChar=500; format datatype=protein gap=- missing=?; matrix Etc… Begin mrbayes; log start filename=d.res.nex.log replace; prset aamodelpr=fixed(wag); lset rates=invgamma Ngammacat=4; set autoclose=yes; mcmc ngen=5000 printfreq=500 samplefreq=10 nchains=4 savebrlens=yes startingtree=random filename=d.res.nex.out; quit; end; [ Begin mrbayes; log start filename=d.res.nex.con.log replace; sumt filename=d.res.nex.out.t burnin=150 contype=allcompat; end; ] MrBayes 3.1.2: an example Block I Block II

A Bayesian analysis -Propose a starting tree topology and parameters values (branch length, alpha, pinv), calculate lnL -Change one of these and compare the lnL with previous proposal -If the lnL is improved accept it -If not, accept it only sometimes -Do many, many of these… -Plot the change of lnL in relationship to the number of generations run -Determine the region where the chain converged and calculate the consensus tree for that region -> consensus tree with posteriors for clade support #generations (mcmcmc) Tree lnL Zooming in

alpha pinv “Burn in” determines the trees to be ignored for consensus tree calculation -Was the chain run long enough? -Do we get the same result from an independent chain? #generations (mcmcmc)

A | B | | C | |(0.98) | | | D | | |(0.99) |(0.49) E | | F |(0.96) | G |(0.81) H Consensus tree with a burn in of 1500 (150) Showing posterior values for the different clades - probability for a given clade to be correct (for the given data and method used!!!)

1.Rate matrix choice (20x20 matrices) –LG, WAG, BLOSSUM62, etc… 2.Recoding protein datasets –20x20 --> 6x6 rate matrix (or else) –Implemented in P4 and PhyloBayes v2.3 3.Models considering amino acid composition heterogeneity (CAT model, NDCH model) Model choice in protein analyses

Effect of using different rate matrices on phylogenetics Keane et al PHYML MtRev matrix PHYML WAG matrix

-Numerous eukaryote do not possess mitochondria -They possess instead hydrogenosomes or mitosomes -What is the evolutionary origin of these organelles and what are their function?

Trichomonas NuoF localises in the hydrogenosomes Complex I News and views by Gray (2005). Nature 434,

Amino acids categories - recoding in p4 –Sulfhydryl: C (1) –Smallhydrophilic: S, T, A, P, G (2) –Acid,amide: D, E, N, Q (3) –Basic: H, R, K (4) –Smallhydrophobic: M, I, L, V (5) –Aromatic: F, Y, W (6) x1 x2 x3 x4 x5 2- x6 x7 x8 x9 3- x10 x11 x12 4- x13 x14 5- x15 6- Recoding into 6 states (1-6) allows the estimation of a GTR like matrix with 14 free parameters

Why recoding amino acids? Potential advantages: Allows to generate a rate matrix specific for the investigated alignment Contributes to mitigating amino acid composition heterogeneity and homoplasy due to frequent changes within categories - equivalent to DNA transversion analyses Potential disadvantage: Loss of potential useful signal by reducing the alphabet from 20 to 6 letters (or else)

WAG (20x20) +pInv+G GTRrecodedDayhoff classes (6x6) +pInv+G  -proteobacteria Recoding effect on NuoF phylogeny

The CAT model Most models average amino acid composition over the alignment A C D E F G H I K L M N P Q R S T V W Y 20 stationary equilibrium frequencies (avg. from alignment) Rate Matrix 180 pairwise relative rates (JTT, WAG, MtREV) X = Q From G. Naylor

However most sites in alignments have restricted set of residues A C D E F G H I K L M N P Q R S T V W Y 20 stationary probabilities equilibrium frequencies averaged over alignment Poor description of reality (for this site). A C D E F G H I K L M N P Q R S T V W Y Site specific vector of 20 probabilities Better Consider this site From G. Naylor

Not possible to have a separate model tailored to each site (too many parameters) - but possible to assign sites to an “optimal and reasonable” number of “categories” with comparable evolutionary freedom to vary Can have a model tailored to each category and implement a “mixture” of models Lartillot (2007) proposed such a mixture model to allow categories of sites associated with different biochemical roles to have different amino acid equilibrium frequencies. Implemented in the Phylobayes software The CAT model From G. Naylor

Amino acide Equil.Freq.profiles Categories (models)12 3 ….. K A C D E F G H I K L M N P Q R S T V W Y Site specific vector of 20 probabilities Yields a mixture of distributions that better capture the allowable state-space 1)2) 3) Multiply each distribution by rate matrix (WAG, JTT, MtREV etc) From Lartillot 2007 The CAT (mixture) model:

The CAT model in animal phylogenetics WAG +F+  The Ecdysozoa clade Lartillot et al. (2007) BMC Evol Biol 7: S4 Experiments with taxa sampling: - Outgroup - Ingroup pp < 1 only are shown The Coelomata clade

The CAT model in animal phylogenetics CAT model Lartillot et al. (2007) BMC Evol Biol 7: S4 The Ecdysozoa clade

CAT model and the mean number of residue per sites Posterior predictive analysis of the mean number of distinct residues observed at each column of the alignement (mean diversity). Lartillot et al. (2007) BMC Evol Biol 7: S4

Summary No single program allows thorough phylogenetic analyses of protein alignments Combination of PHYLIPv3.6, TREE-PUZZLEv5.2, PHYML, MrBAYESv3.1, PHYLOBAYESv2.3 and P4 allow detailed protein phylogenetics Experimenting with your data and available methods/models can lead to interesting and biologically relevant results (data method) –Incorporate site rate heterogeneity correction in the model or reduce heterogeneity by data editing (with and without invariant sites?) –Partitioning of the alignment (variant - various rates, invariant sites, secondary structure, protein domains, CAT model…) –Amino acid groupings (6 categories - GTR like) –LogDet for proteins - rare/absent changes? For long alignments? –DNA based LogDet or the protein alignment…? Do not take support values as absolute. Any support values is for a given method and data, only!

outside PM inside TM domain have very specific structural requirements: AA composition from TM domain is very distinct from non TM domains! Extracellular and intracellular domains may also have important functional differences --> different functional constrains can lead to different AA composition. More generally in any protein, surface exposed AA composition is typically distinct from “internal” AA. TM domains

outside PM inside Global alignment (AA) Sub-alignment 1 Model 1 Model 2 A BC D A BC D

outside PM inside Model 1Model 2 A BC D Partition 1 Partition 2

outside PM inside Global alignment (AA) A BC D AA recoding, can mitigate compositional difference between domains –Sulfhydryl: C (1) –Smallhydrophilic: S, T, A, P, G (2) –Acid,amide: D, E, N, Q (3) –Basic: H, R, K (4) –Smallhydrophobic: M, I, L, V (5) –Aromatic: F, Y, W (6)

outside PM inside Global alignment (AA) A BC D Global alignment (DNA) DNA based models LogDet? Codon 1,2 or 3?