Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Molecular Evolution Revised 29/12/06
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Lecture 15 - Hypothesis Testing A. Competing a priori hypotheses - Paired-Sites Tests Null Hypothesis : There is no difference in support for one tree.
Molecular phylogenetics
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
A brief introduction to phylogenetics
Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU.
Calculating branch lengths from distances. ABC A B C----- a b c.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Molecular Systematics
 Tue Introduction to models (Jarno)  Thu Distance-based methods (Jarno)  Fri ML analyses (Jarno)  Mon Assessing hypotheses.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Modelling evolution Gil McVean Department of Statistics TC A G.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Lecture 15 - Hypothesis Testing
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Maximum likelihood (ML) method
Bayesian inference Presented by Amir Hadadi
Summary and Recommendations
The Most General Markov Substitution Model on an Unrooted Tree
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Maximum Likelihood

Likelihood The likelihood is the probability of the data given the model.

If we flip a coin and get a head and we think the coin is unbiased, then the probability of observing this head is 0.5. If we think the coin is biased so that we expect to get a head 80% of the time, then the likelihood of observing this datum (a head) is 0.8. The likelihood of making some observation is entirely dependent on the model that underlies our assumption. The datum has not changed, our model has. Therefore under the new model the likelihood of observing the datum has changed. Likelihood

Maximum Likelihood (ML) ML assumes a explicit model of sequence evolution. This is justifiable, since molecular sequence data can be shown to have arisen according to a stochastic process. ML attempts to answer the question: What is the probability that I would observe these data (a multiple sequence alignment) given a particular model of evolution (a tree and a process)?

Likelihood calculations In molecular phylogenetics, the data are an alignment of sequences We optimize parameters and branch lengths to get the maximum likelihood Each site has a likelihood The total likelihood is the product of the site likelihoods The maximum likelihood tree is the tree topology that gives the highest (optimized) likelihood under the given model. We use reversible models, so the position of the root does not matter.

What is the probability of observing a G nucleotide? If we have a DNA sequence of 1 nucleotide in length and the identity of this nucleotide is G, what is the likelihood that we would observe this G? In the same way as the coin-flipping observation, the likelihood of observing this G is dependent on the model of sequence evolution that is thought to underlie the data. Model 1: frequency of G = 0.4 => likelihood(G) = 0.4 Model 2: frequency of G = 0.1 => likelihood(G) = 0.1 Model 3: frequency of G = 0.25 => likelihood(G) = 0.25

What about longer sequences? If we consider a gene of length 2 gene 1 GA The the probability of observing this gene is the product of the probabilities of observing each character Model frequency of G = 0.4 frequencyof A= 0.15 p(G) = 0.4 p(A) =0.15 Likelihood (GA) = 0.4 x 0.15 = 0.06

…or even longer sequences? gene 1 GACTAGCTAGACAGATACGAATTAC Model simple base frequency model p(A)=0.15; p(C)=0.2; p(G)=0.4; p(T)=0.25; (the sum of all probabilities must equal 1) Likelihood (gene 1) =

Note about models You might notice that our model of base frequency is not the optimal model for our observed data. If we had used the following model p(A)=0.4; p(C) =0.2; p(G)= 0.2; p(T) = 0.2; The likelihood of observing the gene is L (gene 1) = L (gene 1) = The datum has not changed, our model has. Therefore under the new model the likelihood of observing the datum has changed.

Increase in model sophistication It is no longer possible to simply invoke a model that encompasses base composition, we must also include the mechanism of sequence change and stasis. There are two parts to this model - the tree and the process (the latter is confusingly referred to as the model, although both parts really compose the model).

Different Branch Lengths For very short branch lengths, the probability of a character staying the same is high and the probability of it changing is low. For longer branch lengths, the probability of character change becomes higher and the probability of staying the same is lower. The previous calculations are based on the assumption that the branch length describes one Certain Evolutionary Distance or CED. If we want to consider a branch length that is twice as long (2 CED), then we can multiply the substitution matrix by itself (matrix 2 ).

I (A) II (C) v = 0.1 v = 1.0 v =  t  = mutation rate t = time Maximum Likelihood Two trees each consisting of single branch

Jukes-Cantor model I (A) II (C) v = 0.1 v = 1.0

 AACC  CACT

1 j N 1 C G G A C A C G T T T A C 2 C A G A C A C C T C T A C 3 C G G A T A A G T T A A C 4 C G G A T A G C C T A G C C1C 2C2C 4G4G 3A3A 5 6 L (j) = p CCAG A A CCAG C A CCAG T T + p + … + p

L (j) = p CCAG A A CCAG C A CCAG T T + p + … + p

Likelihood of the alignment at various branch lengths

Strengths of ML Does not try to make an observation of sequence change and then a correction for superimposed substitutions. There is no need to ‘correct’ for anything, the models take care of superimposed substitutions. Accurate branch lengths. Each site has a likelihood. If the model is correct, we should retrieve the correct tree (If we have long-enough sequences and a sophisticated-enough model). You can use a model that fits the data. ML uses all the data (no selection of sites based on informativeness, all sites are informative). ML can not only tell you about the phylogeny of the sequences, but also the process of evolution that led to the observations of today’s sequences.

Weaknesses of ML Can be inconsistent if we use models that are not accurate. Model might not be sophisticated enough Very computationally-intensive. Might not be possible to examine all models (substitution matrices, tree topologies).

Models You can use models that: Deal with different transition/transversion ratios. Deal with unequal base composition. Deal with heterogeneity of rates across sites. Deal with heterogeneity of the substitution process (different rates across lineages, different rates at different parts of the tree). The more free parameters, the better your model fits your data (good). The more free parameters, the higher the variance of the estimate (bad).

Choosing a Model Don’t assume a model, rather find a model that fits your data. Models often have “free” parameters. These can be fixed to a reasonable value, or estimated by ML. The more free parameters, the better the fit (higher the likelihood) of the model to the data. (Good!) The more free parameters, the higher the variance, and the less power to discriminate among competing hypotheses. (Bad!) We do not want to over-fit the model to the data

What is the best way to fit a line (a model) through these points? How to tell if adding (or removing) a certain parameter is a good idea? Use statistics The null hypothesis is that the presence or absence of the parameter makes no difference In order to assess signifcance you need a null distribution

We have some DNA data, and a tree. Evaluate the data with 3 different models. model ln likelihood ∆ JC K2P GTR Evaluations with more complex models have higher likelihoods 2.The K2P model has 1 more parameter than the JC model 3.The GTR model has 4 more parameters than the K2P model 4.Are the extra parameters worth adding?

JC vs K2P K2P vs GTR We have generated many true null hypothesis data sets and evaluated them under the JC model and the K2P model. 95% of the differences are under 2.The statistic for our original data set was 91.95, and so it is highly significant. In this case it is worthwhile to add the extra parameter (tRatio). We have generated many true null hypothesis data sets and evaluated them under the K2P model and the GTR model. The statistic for our original data set was 1.79, and so it is not signifcant. In this case it is not worthwhile to add the extra parameters. You can use the  2 approximation to assess significance of adding parameters

Bayesian Inference

Maximum likelihood Search for tree that maximizes the chance of seeing the data (P (Data | Tree)) Bayesian Inference Search for tree that maximizes the chance of seeing the tree given the data (P (Tree | Data))

Bayesian Phylogenetics Maximize the posterior probability of a tree given the aligned DNA sequences Two steps - Definition of the posterior probabilities of trees (Bayes’ Rule) - Approximation of the posterior probabilities of trees Markov chain Monte Carlo (MCMC) methods

90 10 Bayesian Inference

Markov Chain Monte Carlo Methods Posterior probabilities of trees are complex joint probabilities that cannot be calculated analytically. Instead, the posterior probabilities of trees are approximated with Markov Chain Monte Carlo (MCMC) methods that sample trees from their posterior probability distribution.

MCMC A way of sampling / touring a set of solutions,biased by their likelihood 1 Make a random solution N1 the current solution 2 Pick another solution N2 3 If Likelihood (N1 < N2) then replace N1 with N2 4 Else if Random (Likelihood (N2) / Likelihood (N1)) then replace N1 with N2 5 Sample (record) the current solution 6 Repeat from step 2

Bayesian Inference