FIXETH LIKELIHOODS this is correct
Bayesian methods I: theory
Road map Bayes’ equations Attempt to explain equations Bayesian = priors, likelihood = no priors Bayesian = integration; maximum likelihood = maximization How to: grid method How to: SIR (one variety) How to: MCMC (one variety)
Not even enough to be dangerous In three lectures I can’t really fully explain the philosophy, pitfalls, or methods completely I’m a practitioner, not a statistician I can give you a flavor for Bayesian methods I can remove your fear factor (hopefully not increase it!) when someone shows you a Bayesian analysis I can show you how to find Bayesian posteriors using the grid, MCMC, and SIR algorithms
What I want you to learn I want you to understand the concepts of likelihood, prior, and posterior The theory and philosophy is hard to grasp and I don’t expect everyone to master them I do expect everyone to understand how to program the SIR and MCMC algorithms, and find Bayesian posteriors (it’s easier than understanding the theory!)
Key fisheries references Punt AE & Hilborn R (1997) Fisheries stock assessment and decision analysis: the Bayesian approach. Reviews in Fish Biology and Fisheries 7:35-63 André Punt Ray HilbornMarc Mangel Hilborn R & Mangel M (1997) The ecological detective: confronting models with data. Princeton University Press, Princeton, New Jersey. Chapters 9-10 Murdoch McAllister McAllister MK, Pikitch EK, Punt AE & Hilborn R (1994) A Bayesian approach to stock assessment and harvest decisions using the Sampling/Importance Resampling algorithm. CJFAS 51: Ellen PikitchAndré PuntRay Hilborn
Probability If I flip a fair coin 10 times, what is the probability of it landing heads up every time? Given the fixed parameter (p = 0.5), what is the probability of different outcomes? Probabilities add up to 1. I flipped a coin 10 times and obtained 10 heads. What is the likelihood that the coin is fair? Given the fixed outcomes (data), what is the likelihood of different parameter values? Likelihoods do not add up to 1. Hypotheses (parameter values) are compared using likelihood values (higher = better). Likelihood FISH 458 background
Probability What is the probability that 5 ≤ x ≤ 10 given a normal distribution with µ = 13 and σ = 4? Answer: What is the probability that –1000 ≤ x ≤ 1000 given a normal distribution with µ = 13 and σ = 4? Answer: What is the likelihood that µ = 13 and σ = 4 if you observed a value of (a) x = 10 (answer: the likelihood is 0.075) (b) x = 14 (answer: the likelihood is 0.097) Conclusion: if the observed value was 14, it is more likely that the parameters are µ = 13 and σ = 4, because is higher than Likelihood Area under curve between 5 and 10 Height of curve at x = 14 Height of curve at x = 10
458: we’ve focused on likelihoods If hypothesis H i is true, what is the likelihood of observing particular types of data? H i are different values of the model parameters Maximum likelihood: which hypothesis H i would result in the maximum likelihood? AICc: which hypothesis is the best? AICc weight: how much weight should be given to each competing hypotheses? likelihood
Some philosophy What is a 95% confidence interval? Up until now we have answered this in a frequentist philosophy: there is some fixed true value of the parameters that does not move around (e.g. µ = 13). We collect some data and estimate a 95% confidence interval. Meaning: If we were to repeat our experiment an infinite number of times, each time collecting some data each time, and calculating a 95% confidence interval, then the true value of µ would fall within the interval 95% of the time.
What people think 95% CI means Most people think that a particular calculated 95% CI has a 95% probability of including the true mean µ. This is not true. A particular 95% CI either includes µ (100% probability) or does not include µ (0% probability). The concept only makes sense when you think about repeating your experiment an infinite number of times, this is the frequentist paradigm. In reality we only have a single data set and a single calculated 95% CI.
What does p<0.05 mean? This concept is also frequentist. If the true value of the parameters were µ = 0 and σ = 1 (normal distribution), and I repeated my experiment an infinite number of times, then the mean of my data will be greater than 1.96 or smaller than no more than 5% of the time. Q: If I run the experiment once, and obtain a mean of 2.5, is µ = 0? A: the p-value is <0.05 so µ is not 0.
Why do we not trust the detector? We are not frequentists We are Bayesians We have some prior knowledge about the sun: it is 4.6 billion years old; the probability of it exploding in any particular night is less than 1/(365*4.6 billion) = 5.9× Instinctively we weighed up this very small probability against the much higher probability that the neutron detector lied
The Bayesian paradigm What is the probability that the sun exploded? How can we combine our prior knowledge with the new data and likelihood to estimate the posterior probability that the sun exploded? The parameter values are not viewed as fixed, therefore we can estimate the probability of the parameter being a particular value
Bayes’ Theorem What is the posterior probability that hypothesis H i is correct? This is really what we want to know. posterior likelihoodprior some horrible term
Bayes’ Theorem: another form posterior likelihoodprior sum of {likelihood of all hypotheses given the data × prior information}
A digression on notation
Bayes’ Theorem (probabilities) What is the posterior probability that hypothesis H i is correct? posterior probabilityprior probability of observing the data
Bayes’ Theorem (likelihoods) posterior likelihoodprior probability of observing the data likelihood of H i given the observed data = probability of the data given known values of H i Therefore because this is the same height on the same probability distribution
Two hypotheses: H 1 (the sun exploded) and H 2 (sun did not explode) Prior(H 1 ) = 5.9×10 -13, Prior(H 2 ) = 1–5.9× P(YES|H 1 ) = 35/36 = P(YES|H 2 ) = 1/36 = P(H 1 |YES) = 2.1× (Bayesian answer: extremely low probability that the sun exploded)
Updated Bayesian beliefs The prior probability of the sun exploding was 5.9× The posterior probability of the sun exploding was 2.1× The Bayesian statistician still believes H 1 (the sun exploded) is unlikely but his belief is slightly lower than it was before the experiment Given repeated questioning, if the machine keeps on saying YES the Bayesian will eventually believe the sun has exploded
Likelihood framework: A murder is committed, DNA found at the scene is compared with a DNA database of people and a match is found. If that person is innocent, the chance of the match is 1 in 3 million. Maximum likelihood: the matched person is guilty. This is L(data|H i ), the likelihood of a match given that the person is innocent. Bayesian framework: the DNA database includes 10 million people. The prior probability, prior(H 1 ), of any one person in the database being innocent is very high ( )/10 7. We would expect 3-4 matches in the database purely by chance, so it is very likely that some poor innocent person will be matched to the DNA evidence! P(data) is the probability of finding a match in the database. The posterior probability P(H 1 |data) of that person being innocent given the DNA match is There is a 77% chance they are innocent! If the DNA were tested only against the 10 people in the neighborhood, and there was a match, the prior probability of innocence would be 9/10, much lower. Prosecutor’s fallacy
Complications P(data) is the probability of the data: the probability of there being a match (either the murderer is in the database, or there is a false match with an innocent person). In this case the calculation is “straightforward” For most problems we face it will be very hard or impossible to directly calculate P(data) Instead of calculating P(data), we can circumvent this calculation by finding the posterior P(H i |data) numerically Algorithms for finding the posterior include grid, MCMC, and SIR posterior likelihoodprior probability of the data
What is hypothesis H i ? Imagine a grid of parameters r and K, where r = 0, 0.05, 0.1 and K = 1000, 2000, There are 3×3 = 9 possible hypotheses to evaluate Now increase the number of values of r and K that are evaluated With “infinite” points the “hypotheses” are a continuous range of all possible values of r and K. We want the posterior probability of every value of r and K