CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.

Slides:



Advertisements
Similar presentations
Maximum Likelihood:Phylogeny Estimation Neelima Lingareddy.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Phylogenetic Trees Lecture 4
MAT 4830 Mathematical Modeling 4.4 Matrix Models of Base Substitutions II
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods: Models of Evolution Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Class 3: Estimating Scoring Rules for Sequence Alignment.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Consensus Trees Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
The concept of likelihood refers given some data D, a decision must be made about an adequate explanation of the data. In the phylogenetic framework, one.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Phylogenetic Analysis
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Lecture 15: Reconstruction of Phylogeny Adaptive characters: 1.May indicate derived character (special adaptation) e.g. Raptorial forelegs in mantids 2.May.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Phylogenetic basis of systematics
Distance based phylogenetics
Lecture 6B – Optimality Criteria: ML & ME
Maximum likelihood (ML) method
Goals of Phylogenetic Analysis
Molecular basis of evolution.
Recitation 5 2/4/09 ML in Phylogeny
Inferring phylogenetic trees: Distance and maximum likelihood methods
Lecture 6B – Optimality Criteria: ML & ME
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Mathematical models are: IncomprehensibleIncomprehensible UselessUseless No fun at allNo fun at all

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Mathematical models are: IncomprehensibleIncomprehensible UselessUseless No fun at allNo fun at all

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Model = hypothesis !!!Model = hypothesis !!! Hypothesis (as used in most biological research):Hypothesis (as used in most biological research): –Precisely stated, but qualitative –Allows you to make qualitative predictions Arithmetic model:Arithmetic model: –Mathematically explicit (parameters) –Allows you to make quantitative predictions …

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The Scientific Method Observation of data Model of how system works Prediction(s) about system behavior (simulation)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The maximum likelihood approach I Starting point:Starting point: You have some observed data and a probabilistic model for how the observed data was produced Having a probabilistic model of a process means you are able to compute the probability of any possible outcome (given a set of specific values for the model parameters). Example:Example: –Data: result of tossing coin 10 times - 7 heads, 3 tails –Model: coin has probability p for heads, 1-p for tails. The probability of observing h heads among n tosses is: Goal:Goal: You want to find the best estimate of the (unknown) parameter values based on the observations. (here the only parameter is “p”)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The maximum likelihood approach II  Likelihood (Model) = Probability (Data | Model)  Maximum likelihood: Best estimate is the set of parameter values which gives the highest possible likelihood.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Maximum likelihood: coin tossing example Data: result of tossing coin 10 times - 7 heads, 3 tails Model: coin has probability p for heads, 1-p for tails.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling applied to phylogeny Observed data: multiple alignment of sequencesObserved data: multiple alignment of sequences H.sapiens globinA G G G A T T C A H.sapiens globinA G G G A T T C A M.musculus globinA C G G T T T - A M.musculus globinA C G G T T T - A R.rattus globinA C G G A T T - A R.rattus globinA C G G A T T - A Probabilistic model parameters (simplest case):Probabilistic model parameters (simplest case): –Nucleotide frequencies:  A,  C,  G,  T –Tree topology and branch lengths –Nucleotide-nucleotide substitution rates (or substitution probabilities):

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Other models of nucleotide substitution

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS More elaborate models of evolution Codon-codon substitution ratesCodon-codon substitution rates (64 x 64 matrix of codon substitution rates) Different mutation rates at different sites in the geneDifferent mutation rates at different sites in the gene (the “gamma-distribution” of mutation rates) Molecular clocksMolecular clocks (same mutation rate on all branches of the tree). Different substitution matrices on different branches of the treeDifferent substitution matrices on different branches of the tree (e.g., strong selection on branch leading to specific group of animals) Etc., etc.Etc., etc.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Different rates at different sites: the gamma distribution

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Computing the probability of one column in an alignment given tree topology and other parameters A C G G A T T C A A C G G T T T - A A A G G A T T - A A G G G T T T - A A A C CA G Columns in alignment contain homologous nucleotides Assume tree topology, branch lengths, and other parameters are given. For now, assume ancestral states were A and A (we’ll get to the full computation on next slide). Start computation at any internal or external node. Pr =  C P CA (t 1 ) P AC (t 2 ) P AA (t 3 ) P AG (t 4 ) P AA (t 5 ) t4t4 t5t5 t3t3 t1t1 t2t2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Computing the probability of an entire alignment given tree topology and other parameters Probability must be summed over all possible combinations of ancestral nucleotides. (Here we have two internal nodes giving 16 possible combinations) Probability of individual columns are multiplied to give the overall probability of the alignment, i.e., the likelihood of the model. Often the log of the probability is used (log likelihood)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Maximum likelihood phylogeny Data:Data: –sequence alignment Model parameters:Model parameters: –nucleotide frequencies, nucleotide substitution rates, tree topology, branch lengths. Choose random initial values for all parameters, compute likelihood Choose random initial values for all parameters, compute likelihood Change parameter values slightly in a direction so likelihood improves Change parameter values slightly in a direction so likelihood improves Repeat until maximum found Repeat until maximum found Results: Results: (1) ML estimate of tree topology (2) ML estimate of branch lengths (3) ML estimate of other model parameters (4) Measure of how well model fits data (likelihood). (likelihood).