CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.

Slides:



Advertisements
Similar presentations
1 Number of substitutions between two protein- coding genes Dan Graur.
Advertisements

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Molecular Evolution Revised 29/12/06
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
The Simple Regression Model
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Positive selection A new allele (mutant) confers some increase in the fitness of the organism Selection acts to favour this allele Also called adaptive.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods: Models of Evolution Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Consensus Trees Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
The concept of likelihood refers given some data D, a decision must be made about an adequate explanation of the data. In the phylogenetic framework, one.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Maximum Likelihood Molecular Evolution. Maximum Likelihood The likelihood function is the simultaneous density of the observation, as a function of the.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Tree Inference Methods
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
PHYLOGENETICS CONTINUED TESTS BY TUESDAY BECAUSE SOME PROBLEMS WITH SCANTRONS.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Confidence intervals and hypothesis testing Petter Mostad
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
N=50 s=0.150 replicates s>0 Time till fixation on average: t av = (2/s) ln (2N) generations (also true for mutations with negative “s” ! discuss among.
What is positive selection?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
 List the characteristics of the F distribution.  Conduct a test of hypothesis to determine whether the variances of two populations are equal.  Discuss.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Chapter 4: Basic Estimation Techniques
Chapter 4 Basic Estimation Techniques
Basic Estimation Techniques
Lecture 6B – Optimality Criteria: ML & ME
Neutrality Test First suggested by Kimura (1968) and King and Jukes (1969) Shift to using neutrality as a null hypothesis in positive selection and selection.
Linkage and Linkage Disequilibrium
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Basic Estimation Techniques
Molecular basis of evolution.
Lecture 6B – Optimality Criteria: ML & ME
CHAPTER 12 More About Regression
One-way Analysis of Variance
Presentation transcript:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Model = hypothesis !!!Model = hypothesis !!! Hypothesis (as used in most biological research):Hypothesis (as used in most biological research): –Precisely stated, but qualitative –Allows you to make qualitative predictions Arithmetic model:Arithmetic model: –Mathematically explicit (parameters) –Allows you to make quantitative predictions …

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Modeling: An example

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS y = ax + b Simple 2-parameter model Modeling: An example

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS y = ax + b Predictions based on model Modeling: An example

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Fit, parameter estimation y = ax + b Measure of how well the model fits the data: sum of squared errors (SSE) Best parameter estimates: those that give the smallest SSE (least squares model fitting)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Fit, parameter estimation y = 1.24x Measure of how well the model fits the data: sum of squared errors (SSE) Best parameter estimates: those that give the smallest SSE (least squares model fitting)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The maximum likelihood approach  Likelihood = Probability (Data | Model)  Maximum likelihood: Best estimate is the set of parameter values which gives the highest possible likelihood.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling applied to phylogeny Observed data: multiple alignment of sequencesObserved data: multiple alignment of sequences H.sapiens globinA G G G A T T C A H.sapiens globinA G G G A T T C A M.musculus globinA C G G T T T - A M.musculus globinA C G G T T T - A R.rattus globinA C G G A T T - A R.rattus globinA C G G A T T - A Probabilistic model parameters (simplest case):Probabilistic model parameters (simplest case): –Tree topology and branch lengths –Nucleotide frequencies:  A,  C,  G,  T –Nucleotide-nucleotide substitution rates (or substitution probabilities):

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Computing the probability of one column in an alignment given tree topology and other parameters A C G G A T T C A A C G G T T T - A A A G G A T T - A A G G G T T T - A A A C CA G Columns in alignment contain homologous nucleotides Assume tree topology, branch lengths, and other parameters are given. Assume ancestral states were A and A. Start computation at any internal or external node. Pr =  C P CA (t 1 ) P AC (t 2 ) P AA (t 3 ) P AG (t 4 ) P AA (t 5 ) t4t4 t5t5 t3t3 t1t1 t2t2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Computing the probability of an entire alignment given tree topology and other parameters Probability must be summed over all possible combinations of ancestral nucleotides. (Here we have two internal nodes giving 16 possible combinations) Probability of individual columns are multiplied to give the overall probability of the alignment, i.e., the likelihood of the model. Often the log of the probability is used (log likelihood)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection: How Do We Choose Between Different Types of Models? Select model with best fit?

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Over-fitting: More parameters always result in a better fit to the data, but not necessarily in a better description y = ax + b 2 parameter model Good description, poor fit y = ax 6 +bx 5 +cx 4 +dx 3 +ex 2 +fx+g 7 parameter model Poor description, good fit

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Selecting the best model: the likelihood ratio test The fit of two alternative models can be compared using the ratio of their likelihoods:The fit of two alternative models can be compared using the ratio of their likelihoods: LR =P(Data  M1) = L,M1 P(Data  M2) L,M2 Note that LR > 1 if model 1 has the highest likelihoodNote that LR > 1 if model 1 has the highest likelihood For nested models it can be shown that if the simplest (“null”) model is true, thenFor nested models it can be shown that if the simplest (“null”) model is true, then Δ = ln(LR 2 ) = 2*ln(LR) = 2* (lnL,M1 - lnL,M2) follows a  2 distribution with degrees of freedom equal to the number of extra parameters in the most complicated model. This makes it possible to perform stringent statistical tests to determine which model (hypothesis) best describes the data

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Asking biological questions in a likelihood ratio testing framework Fit two alternative, nested models to the data.Fit two alternative, nested models to the data. Record optimized likelihood and number of free parameters for each fitted model.Record optimized likelihood and number of free parameters for each fitted model. Test if alternative (parameter-rich) model is significantly better than null-model (i.e., the simplest model), given number of additional parameters (n extra ):Test if alternative (parameter-rich) model is significantly better than null-model (i.e., the simplest model), given number of additional parameters (n extra ): 1. Compute Δ = 2 x (lnL Alternative - lnL Null ) 2. Compare Δ to  2 distribution with n extra degrees of freedom Depending on models compared, different biological questions can be addressed (presence of molecular clock, presence of positive selection, difference in mutation rates among sites or branches, etc.) Depending on models compared, different biological questions can be addressed (presence of molecular clock, presence of positive selection, difference in mutation rates among sites or branches, etc.)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Positive selection I: synonymous and non-synonymous mutations 20 amino acids, 61 codons 20 amino acids, 61 codons  –Most amino acids encoded by more than one codon –Not all mutations lead to a change of the encoded amino acid –”Synonymous mutations” are rarely selected against CTA (Leu) CTC (Leu) CTG (Leu) CTT (Leu) CAA (Gln) CCA (Pro) CGA (Arg) ATA (Ile) GTA (Val) TTA (Leu) 1 non-synonymous nucleotide site 1 synonymous nucleotide site 1/3 synonymous 2/3 nonsynymous nucleotide site

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Positive selection II: non-synonymous and synonymous mutation rates contain information about selective pressure dN: rate of non-synonymous mutations per non-synonymous sitedN: rate of non-synonymous mutations per non-synonymous site dS: rate of synonymous mutations per synonymous sitedS: rate of synonymous mutations per synonymous site Recall: Evolution is a two-step process:Recall: Evolution is a two-step process: (1) Mutation (random) (2) Selection (non-random) Randomly occurring mutations will lead to dN/dS=1.Randomly occurring mutations will lead to dN/dS=1. Significant deviations from this most likely caused by subsequent selection.Significant deviations from this most likely caused by subsequent selection. dN/dS < 1: Higher rate of synonymous mutations: negative (purifying) selectiondN/dS < 1: Higher rate of synonymous mutations: negative (purifying) selection dN/dS > 1: Higher rate of non-synonymous mutations: positive selectiondN/dS > 1: Higher rate of non-synonymous mutations: positive selection

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Exercise: positive selection in HIV? Fit two alternative models to HIV data:Fit two alternative models to HIV data: –M1: two classes of sites with different dN/dS ratios: Class 1 has dN/dS<1Class 1 has dN/dS<1 Class 2 has dN/dS=1Class 2 has dN/dS=1 –M2: three distinct classes with different dN/dS ratios: Class 1 has dN/dS<1Class 1 has dN/dS<1 Class 2 has dN/dS=1Class 2 has dN/dS=1 Class 3 has dN/dS>1Class 3 has dN/dS>1 Use likelihood ratio test to examine if M2 is significantly better than M1,Use likelihood ratio test to examine if M2 is significantly better than M1, If M2 significantly better than M1 AND if some codons belong to the class with dN/dS>1 (the positively selected class) then you have statistical evidence for positive selection.If M2 significantly better than M1 AND if some codons belong to the class with dN/dS>1 (the positively selected class) then you have statistical evidence for positive selection. Most likely reason: immune escape (i.e., sites must be in epitopes)Most likely reason: immune escape (i.e., sites must be in epitopes) : Codons showing dN/dS > 1: likely epitopes