Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Slides:

Advertisements

Similar presentations

B. Knudsen and J. Hein Department of Genetics and Ecology

Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.

Phylogenetic Trees Lecture 4

Molecular Evolution Revised 29/12/06

Tree Reconstruction.

A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.

. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.

CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.

Heuristic alignment algorithms and cost matrices

. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.

Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.

Phylogenetic Trees Presenter: Michael Tung

Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.

Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Phylogeny Tree Reconstruction

Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.

. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at

Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.

Class 3: Estimating Scoring Rules for Sequence Alignment.

CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.

Probabilistic methods for phylogenetic trees (Part 2)

Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.

Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.

Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.

Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Terminology of phylogenetic trees

Hidden Markov Models for Sequence Analysis 4

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.

Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)

Calculating branch lengths from distances. ABC A B C----- a b c.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

Expected accuracy sequence alignment Usman Roshan.

Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.

Phylogeny Ch. 7 & 8.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.

Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Expected accuracy sequence alignment Usman Roshan.

Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.

Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Phylogenetic basis of systematics

Distance based phylogenetics

Maximum likelihood (ML) method

Multiple Alignment and Phylogenetic Trees

Bayesian inference Presented by Amir Hadadi

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

CS 581 Tandy Warnow.

The Most General Markov Substitution Model on an Unrooted Tree

Presentation transcript:

Realistic evolutionary models Marjolijn Elsinga & Lars Hemel

Realistic evolutionary models Contents Models with different rates at different sites Models which allow gaps Evaluating different models Break Probabilistic interpretation of Parsimony Maximum Likelihood distances

Unrealistic assumptions 1 Same rate of evolution at each site in the substitution matrix - In reality: the structure of proteins and the base pairing of RNA result in different rates 2 Ungapped alignments - Discard useful information given by the pattern of deletions and insertions

Different rates in matrix Maximum likelihood, sites are independent X j for j = 1…n

Different rates in matrix (2) Introduce a site-dependent variable r u

Different rates in matrix (3) We don’t know r u, so we use a prior Yang [1993] suggests a gamma distribution g(r, α, α), with mean = 1 and variance = 1/α

Problem Number of terms grows exponentially with the number of sequences  computationally slow Solution: approximation - Replace integral by a discrete sum - Subdivide domain into m intervals - Let r k denote the mean of the gamma distribution in the kth interval

Solution Yang [1993] found m = 3.4 gives a good approximation Only m times as much computation as for non-varying sites

Evolutionary models with gaps (1) Idea 1: introduce ‘_’ as an extra character of the alphabet of K residues and replace the (KxK) matrix with a (K+1) x (K+1) matrix Drawback: no possibility to assign lower cost to a following gap, gaps are now independent

Evolutionary models with gaps (2) Idea 2: Allison, Wallace & Yee [1992] introduce delete and insertion states to ensure affine-type gaps Drawback: computationally intractable

Evolutionary models with gaps (3) Idea 3: Thorne, Kishino & Felsenstein [1992] use fragment substitution to get a degree of biological plausibility Drawback: usable for only two sequences

Finally Find a way to use affine-type gap penalties in a computationally reasonable way Mitchison & Durbin [1995] made a tree HMM which uses a profile HMM architecture, and treats paths through the model as objects that undergo evolutionary change

Assumptions needed again We will use a architecture quite simpler than that of the profile HMM of Krogh et al [1994]: it has only match and delete states Match state: M k Delete state: D k k = position in the model

Tree HMM with gaps (1) Sequence y is ancestor of sequence x Both sequences are aligned to the model, so both follow a prescribed path through the model

Tree HMM with gaps (2) x emits residu x i at M k y emits residu y j at M k Probability of substitution y j  x i is P(x i | y j,t)

Tree HMM with gaps (3) What if x goes a different path than y x: M k  D k+1 (= MD) y: M k  M k+1 (= MM) P(MD|MM, t)

Tree HMM with gaps (4) x: D k+1  M k+2 (= DM) y: M k+1  M k+2 (= MM) We assume that the choice between DD and DM is controlled by a mutational process that operates independently from y

Substitution matrix The probabilities of transitions of the path of x are given by priors: D k+1  M k+2 has probability q DM

How it works At position k: q yj P(x i |y j,t) Transition k  k+1: q MM P(MD|MM,t) Transition k+1  k+2: q MM q DM

An other example

Evaluating models: evidence Comparing models is difficult Compare probabilities: P(D|M 1 ) and P(D|M 2 ) by integrating over all parameters of each model Parameters θ Prior probabilities P(θ )

Comparing two models Natural way to compare M 1 and M 2 is to compute the posterior probability of M 1

Parametric Bootstrap Let be the maximum likelihood of the data D for the model M 1 Let be the maximum likelihood of the data D for the model M 2

Parametric bootstrap (2) Simulate datasets D i with the values of the parameters of M 1 that gave the maximum likelihood for D If Δ exceed almost all values of Δ i  M 2 captured more aspects of the data that M 1 did not mimic, therefore M 1 is rejected

Break

Probabilistic interpretation of various models Lars Hemel

Overview Review of last week’s method Parsimony – Assumptions, Properties Probabilistic interpretation of Parsimony Maximum Likelihood distances – Example: Neighbour joining More probabilistic interpretations – Sankoff & Cedergren – Hein’s affine cost algorithm Conclusion / Questions?

Review Parsimony = Finding a tree which can explain the observed sequences with a minimal number of substitutions

Parsimony Remember the following assumptions: – Sequences are aligned – Alignments do not have gaps – Each site is treated independently Further more, many families have: – Substitution matrix is multiplicative: – Reversibility:

Parsimony Basic step: counting the minimal number of changes for one site Final number of substitutions is summing over all the sites Weighted parsimony uses different ‘weights’ for different substitutions

Probabilistic interpretation of parsimony Given: A set of substitution probabilities P(b|a) in which we neglect the dependence on length t Calculate substitution costs S(a,b) = -log P(b|a) Felsenstein [1981] showed that by using these substitution costs, the minimal cost at site u for the whole tree T obtained by the weighted parsimony algorithm is regarded as an approximation to the likelihood

Probabilistic interpretation of parsimony Testing performance for tree-building algorithms can be done by generating trees probabilistic with sampling and then see how often a given algorithm reconstructs them correctly Sampling is done as follows: – Pick a residue a at the root with probability – Accept substitution to b along the edge down to node i with probability repetitive – Sequences of length N are generated by N independent repetitions of this procedure – Maximum likelihood should reconstruct the correct tree for large N

Probabilistic interpretation of parsimony Suppose we have tree T, with the following edgelengths And substitutionmatrix with p=0.3 for leaves 1,3 and p=0.1 for 2 and

Probabilistic interpretation of parsimony Tree with n leaves has (2n-5)!! unrooted trees

Probabilistic interpretation of parsimony Parsimony can constructs the wrong tree even for large N N N Parsimony Maximum likelihood

Probabilistic interpretation of parsimony Suppose the following example: A tree with A,A,B,B at the places 1,2,3 and 4 A A B B

Probabilistic interpretation of parsimony With parsimony the number of substitutions are calculated AA B B A A B B A A A B 2 1 Parsimony constructs the right tree with 1 substitution more often than the left tree with 2

Maximum Likelihood distances Suppose tree T, edge lengths and sampled sequences at the leafs We’ll try to compute the distance between and

By multiplicativety Maximum Likelihood distances

By reversibility and multiplicativity

Maximum Likelihood distances

ML distances between leaf sequences are close to additive, given large amount of data

Example: Neighbour joining i j k m

Use Maximum Likelihood distances Suppose we have a multiplicative reversible model Suppose we have plenty of data The underlying probabilistic model is correct Then Neighbour joining will construct any tree correctly.

Example: Neighbour joining Neighbour joining using ML distances It constructs the correct tree where Parsimony failed N

More probabilistic interpretations Sankoff & Cedergren – Simultaneously aligning sequences and finding its phylogeny, by using a character substitution model – Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing. Allison, Wallace & Yee [1992] – But as the original S&C method it is not practical for most problems.

More probabilistic interpretations Hein’s affine cost algorithm – Simultaneously aligning sequences and finding its phylogeny, by using affine gap penalties – Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing. – But when using plus in stead of max we have to include all the paths, which will cost at the first node above the leaf and at the next and so on. So all the speed advantages are gone.

Conclusion Probabilistic interpretations can be better – Compare ML with parsimony They can also be less useful, because of costs which get too high – Sankoff & Cedergren Neighbour joining constructs the correct tree if it has the correct assumptions So, the trick is to know your problem and to decide which method is the best Questions??