 14.4. Tue Introduction to models (Jarno)  16.4. Thu Distance-based methods (Jarno)  17.4. Fri ML analyses (Jarno)  20.4. Mon Assessing hypotheses.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Bayes rule, priors and maximum a posteriori

Probabilistic models Haixu Tang School of Informatics.

Presentation on Probability Distribution * Binomial * Chi-square

Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 

Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.

Outline input analysis input analyzer of ARENA parameter estimation

Parameter Estimation using likelihood functions Tutorial #1

Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Visual Recognition Tutorial

Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.

Maximum likelihood (ML) and likelihood ratio (LR) test

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.

Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.

Maximum likelihood (ML) and likelihood ratio (LR) test

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.

Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.

Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.

Evaluating Hypotheses

Statistical Background

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.

Visual Recognition Tutorial

Class 3: Estimating Scoring Rules for Sequence Alignment.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Probabilistic methods for phylogenetic trees (Part 2)

Maximum likelihood (ML)

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

7. Bayesian phylogenetic analysis using MrBAYES UST Jeong Dageum Thomas Bayes( ) The Phylogenetic Handbook – Section III, Phylogenetic.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

BINF6201/8201 Molecular phylogenetic methods

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

Tree Inference Methods

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

A brief introduction to phylogenetics

Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU.

Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Bayesian Phylogenetics. Bayes Theorem Pr(Tree|Data) = Pr(Data|Tree) x Pr(Tree) Pr(Data)

Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.

Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.

M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.

Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.

Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.

Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.

Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Modelling evolution Gil McVean Department of Statistics TC A G.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

Lecture 15 - Hypothesis Testing

MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.

Maximum likelihood (ML) method

Review of Probability and Estimators Arun Das, Jason Rebello

Bayesian inference Presented by Amir Hadadi

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

More about Posterior Distributions

Discrete Event Simulation - 4

Parametric Methods Berlin Chen, 2005 References:

Mathematical Foundations of BME Reza Shadmehr

Presentation transcript:

 Tue Introduction to models (Jarno)  Thu Distance-based methods (Jarno)  Fri ML analyses (Jarno)  Mon Assessing hypotheses (Jarno)  Tue Problems with molecular data (Jarno)  Thu Problems with molecular data (Jarno)  Phylogenomics  Fri Search algorithms, visualization, and  other computational aspects (Jarno) J

 Maximum likelihood methods of phylogenetic inference evaluate a hypothesis about evolutionary history (the branching order and branch lengths of a tree) in terms of a probability that a proposed model of the evolutionary process and the hypothesised history (tree) would give rise to the data we observe Maximum Likelihood

 The probability, P, of the data (D), given the hypothesis (H) ◦ L = P (D | H) Observed data (aligned sequences) Tree topology, branch lengths and model of evolution

 In statistical usage, a distinction is made depending on the roles of the outcome or parameter.  Probability is used when describing a function of the outcome given a fixed parameter value. For example, if a coin is flipped 10 times and it is a fair coin, what is the probability of it landing heads-up every time?  Likelihood is used when describing a function of a parameter given an outcome. For example, if a coin is flipped 10 times and it has landed heads- up 10 times, what is the likelihood that the coin is fair? [Wikipedia, article on likelihood] J

 An optimality criterion (as is parsimony)  Given a model and data we can evaluate a tree  We can choose between trees based on the likelihood of a given tree  The tree(s) with the highest likelihood is the best

JC Variable base frequencies 3 substitution types 2 substitution types Single substitution type 3 substitution types 2 substitution types Variable base frequencies Equal base frequencies F81 HKY85 F84 TrN GTR K2P K3ST SYM 6 substitution types

 Maximum Likelihood estimates parameter values of an explicit model from observed data  Likelihood provides ways of evaluating models in terms of their log likelihoods  Different trees can also be evaluated for their fit to the data under a particular model (likelihood ratio tests of two trees after Kishino & Hasegawa)

 Let's toss coin ten times (n). It lands 4 times heads up (x), 6 times tails up. What is probability of a head in a single toss? ◦ Compare: What is the likelihood of the data given the process?  Naturally p hat = x / n = 4 / 10 = 0.4  This is also a maximum likelihood estimater for p hat.  Let's see why... J

J

 The likelihood function can be solved analytically or using "brute force".  For example, result for p=0.4 is: ◦ L = 210 * 0.4^4 * 0.6^6 = ◦ logL = log(L) = ◦ -logL = -logL =  Analytically, the point where the derivative of the likelihood function is zero, and the second derivative is negative, is the maximum of the function.  Graphically... J

p Likelihood Maximum likelihood Maximum likelihood estimator of p

μ1μ1 Likelihood Precise estimate Imprecise estimate

l<-function(x, n) { p<-seq(0,1,0.01) L<-rep(NA, length(p)) for(i in 1:length(p)){ L[i]<-p[i]^x* (1-p[i])^(n-x)* (factorial(n)/ (factorial(x)* factorial(n-x))) } d<-data.frame(p=p, L=L, logL=log(L)) return(d) } plot(l(4,10)[,c(1,3)], ylim=c(-30,0), type="l") l2<-function(x, n) { p<-seq(0,1,0.01) L<-rep(NA, length(p)) for(i in 1:length(p)) { L[i]<- dbinom(4,size=10, prob=p[i],log=TRUE) } d<-data.frame(p=p, L=L) return(d) } plot(l2(), type="l") J

plot(l2(), type="L") J

 Why log likelihood?  L(0.99|10, 4) =  -logL(0.99|10, 4) = ◦ When you multiply very small values together, the result is even smaller, and at some point the precision disappears (a restriction of computers) ◦ The same does not happen with log values:  L = 210 * 0.4^4 * 0.6^6 =  logL = log(210) + 4*log(0.4)+6*log(0.6) = J

 DNA sequences can be thought of as four sided dice.  Thus, the previous coin example can be straight-forwardly generealized to DNA sequences. J

1 CGAGAC 2 AGCGAC 3 AGATTA 4 GGATAG What is the probability that unrooted Tree A (rather than another tree) could have generated the data shown under our chosen model ? Maximum likelihood tree reconstruction Tree A

1 CGAGA C 2 AGCGA C 3 AGATT A 4 GGATA G j The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model 4 x 4 possibilities Tree A C C A G Maximum likelihood tree reconstruction ACGT ?? Stationarity!

1 CGAGA C 2 AGCGA C 3 AGATT A 4 GGATA G j The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model A C G T A α α α C α α α G α α α T α α α Tree A Maximum likelihood tree reconstruction C C A G ACGT ??

1 CGAG A C 2 AGCG A C 3 AGAT T A 4 GGAT A G j The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model Tree A A C G T A α α α C α α α G α α α T α α α A A T ACGT ?? A Maximum likelihood tree reconstruction

A C C C G yx w z t1t1 t2t2 t6t6 t8t8 t4t4 t5t5 t3t3 t7t7 P(A,C,C,C,G,x,y,z,w|T)=Prob(x) Prob(y|x,t 6 ) Prob(A|y,t 1 ) Prob(C|y,t 2 ) Prob(z|x,t 8 ) Prob(C|z,t 3 ) Prob(w|z,t 7 ) Prob(C|w,t 4 ) Prob(G|w,t 5 ) t i are branch lengths (rate x time)

 Assume a Jukes-Cantor model (all nucleotide frequencies are equal). Further assume that the branch length is 0.1.  Then we can generate a so called P-matrix from the Jukes-Cantor model's Q-matrix:  These are probabilities of a nucleotide changing to some other nucleotide. J

 A: acct  B: gcct  L = (0.25 * )^1 * (0.25 * )^3 = e-05  logL = log(L) =  For other branch lengths, the P matrix can be multiplied by itself k times. This gives a P matrix for a k cex length.  A branch lenght can be optimized by maximizing the likelihood of a certain branch lenght. J

 Depending on the software, each iteration (in the tree optimization algorithm) has to for a certain tree topology:  Calculate the likelihood of the tree topology given the model and the observed data  Estimate the optimal branch lenghts J

 The likelihood of Tree A is the product of the likelihoods at each site  The likelihood is usually evaluated by summing the log of the likelihoods (because the summed probabilities are so small) at each site and reported as the log likelihood of the full tree  The Maximum likelihood tree is the one with the highest likelihood (might not be Tree A i.e. it could be another tree topology) ◦ Note: highest likelihood (largest value) = the largest –logL (closest to zero) = smallest logL (closest to zero) Maximum likelihood tree reconstruction

 The probability of any change is independent of the prior history of the site (a Markov Model)  Substitution probabilities do not change with time or over the tree (a homogeneous Markov process)  Change is time reversible e.g. the rate of change of A to T is the same as T to A Typical assumptions of ML substitution models

 A model is always a simplification of what happens in nature ◦ Assumes evolution works parsimoniously  A given model will give more weight to certain changes over others  ML – an objective criterion for choosing one weighting scheme over another?

Based largely on slides by Paul Lewis (

 D will stand for Data  H will mean any one of a number of things: ◦ a discrete hypothesis ◦ a distinct model (e.g. JC, HKY, GTR, etc.) ◦ a tree topology ◦ one of an infinite number of continuous model parameter values (e.g. ts:tv rate ratio)

 In ML, we choose the hypothesis that gives the highest (maximized) likelihood to the data  The likelihood is the probability of the data given the hypothesis L = P (D | H).  A Bayesian analysis expresses its results as the probability of the hypothesis given the data. ◦ this may be a more desirable way to express the result

 The posterior probability, [P (H | D)], is the probability of the hypothesis given the observations, or data (D)  The main feature in Bayesian statistics is that it takes into account prior knowledge of the hypothesis

Posterior probability of hypothesis H Likelihood of hypothesis Prior probability of hypothesis Probability of the data (a normalizing constant)

 Both ML and Bayesian methods use the likelihood function ◦ In ML, free parameters are optimized, maximizing the likelihood ◦ In a Bayesian approach, free parameters are probability distributions, which are sampled.

 Data D: 6 heads (out of 10 flips)  H = true underlying proportion of heads (the probability of coming up heads on any single flip)  if H = 0.5, coin is perfectly fair  if H = 1.0, coin always comes up heads (i.e. it is a trick coin)

 F: there exists true probability H of getting heads, H 0 : H=0.5 ◦ Does the data reject the null hypothesis?  B: what is the range around 0.5 that we are willing to accept as being in the ”fair coin” range? ◦ What is the probability that H is in this range? ◦ For the coin tossing example, we can calculate exactly the probabilities ◦ For more complex data, we need to explore the probability space  MCMC

 Start somewhere ◦ That “somewhere” will have a likelihood associated with it ◦ Not the optimized, maximum likelihood  Randomly propose a new state ◦ If the new state has a better likelihood, the chain goes there

 The target distribution is the posterior distribution of interest  The proposal distribution is used to decide where to go next; you have much flexibility here, and the choice affects the efficiency of the MCMC algorithm

 Pro: taking big steps helps in jumping from one “island” in the posterior density to another  Con: taking big steps often results in poor mixing  Solution: MCMCMC!

 MC 3 involves running several chains simultaneously (one “cold” and several “heated”)  The cold chain is the one that counts, the heated chains are “scouts”  Chain is heated by raising densities to a power less than 1.0 (values closer to 0.0 are warmer)

Marginal = taking into account all possible values for all parameters

 Record the position of the robot every 100 or 1000 steps (1000 represents more “thinning” than 100)  This sample will be autocorrelated, but not much so if it is thinned appropriately (can measure autocorrelation to assess this)  If using heated chains, only the cold chain is sampled  The marginal distribution of any parameter can be obtained from this sample

 Start with random tree and arbitrary initial values for branch lengths and model parameters  Each generation consists of one of these (chosen at random): ◦ Propose a new tree (e.g. Larget-Simon move) and either accept or reject the move ◦ Propose (and either accept or reject) a new model parameter value  Every k generations, save tree topology, branch lengths and all model parameters (i.e. sample the chain)  After n generations, summarize sample using histograms, means, credible intervals, etc.

 For topologies: discrete Uniform distribution  For proportions: Beta(a,b) distribution  flat when a=b  peaked above 0.5 if a=b and both are greater than 1  For base frequencies: Dirichlet(a,b,c,d) distribution  flat when a=b=c=d  all base frequencies close to 0.25 if v=a=b=c=d and v large (e.g. 300)  For GTR model relative rates: Dirichlet(a,b,c,d,e,f) distribution

 For other model parameters and branch lengths: Gamma(a,b) distribution ◦ Exponential(λ) equals Gamma(1, λ-1) distribution ◦ Mean of Gamma(a,b) is ab (so mean of an Exponential(10) distribution is 0.1) ◦ Variance of a Gamma(a,b) distribution is ab 2 (so variance of an Exponential(10) distribution is 0.01)

 Flat (uninformative) priors mean that the posterior probability is directly proportional to the likelihood ◦ The value of H at the peak of the posterior distribution is equal to the MLE of H  Informative priors can have a strong effect on posterior probabilities

1. Beware arbitrarily truncated priors 2. Branch length priors particularly important 3. Beware high posteriors for very short branch lengths 4. Partition with care (prefer fewer subsets) 5. MCMC run length should depend on number of parameters 6. Calculate how many times parameters were updated 7. Pay attention to parameter estimates 8. Run without data to explore prior 9. Run long and run often! 10. Future: model selection should include effects of priors

Marshall, D.C., Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in the land of long trees. Syst Biol 59,

1. Beware arbitrarily truncated priors 2. Branch length priors particularly important 3. Beware high posteriors for very short branch lengths 4. Partition with care (prefer fewer subsets) 5. MCMC run length should depend on number of parameters 6. Calculate how many times parameters were updated 7. Pay attention to parameter estimates 8. Run without data to explore prior 9. Run long and run often! 10. Future: model selection should include effects of priors

 Bayesian methods are here to stay in phylogenetics  Are able to take into account uncertainty in parameter estimates  Are able to relax most assumptions, including rate homogeneity among branches ◦ Timing of divergence analyses  Being heavily developed, new features and algorithms appear regularly