Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
MCMC estimation in MlwiN
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Estimation, Variation and Uncertainty Simon French
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Chapter 8: Estimating with Confidence
Gaussian Processes I have known
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Bayesian estimation Bayes’s theorem: prior, likelihood, posterior
Chapter 7 Sampling and Sampling Distributions
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A Discussion of the Bayesian Approach Reference: Chapter 1 and notes from Dr. David Madigan.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
July 3, A36 Theory of Statistics Course within the Master’s program in Statistics and Data mining Fall semester 2011.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Chapter 10: Estimating with Confidence
Lecture II-2: Probability Review
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Statistical Analysis of Systematic Errors and Small Signals Reinhard Schwienhorst University of Minnesota 10/26/99.
Review of Lecture Two Linear Regression Normal Equation
Objectives of Multiple Regression
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Gaussian process modelling
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
Statistical Decision Theory
Model Inference and Averaging
Bayesian Inference, Basics Professor Wei Zhu 1. Bayes Theorem Bayesian statistics named after Thomas Bayes ( ) -- an English statistician, philosopher.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
17 May 2007RSS Kent Local Group1 Quantifying uncertainty in the UK carbon flux Tony O’Hagan CTCD, Sheffield.
Section 10.1 Confidence Intervals
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Machine Learning 5. Parametric Methods.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
8 Sept 2006, DEMA2006Slide 1 An Introduction to Computer Experiments and their Design Problems Tony O’Hagan University of Sheffield.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Canadian Bioinformatics Workshops
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Uncertain Judgements: Eliciting experts’ probabilities Anthony O’Hagan et al 2006 Review by Samu Mäntyniemi.
MCMC Output & Metropolis-Hastings Algorithm Part I
Model Inference and Averaging
Basic Probability Theory
More about Posterior Distributions
Filtering and State Estimation: Basic Concepts
Bayesian Inference, Basics
Statistical NLP: Lecture 4
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Presentation transcript:

Bayesian methods, priors and Gaussian processes John Paul Gosling Department of Probability and Statistics

24-25 January 2007An Overview of State-of-the-Art Data Modelling2 Overview The Bayesian paradigm Bayesian data modelling Quantifying prior beliefs Data modelling with Gaussian processes

Bayesian methods The beginning, the subjectivist philosophy, and an overview of Bayesian techniques.

24-25 January 2007An Overview of State-of-the-Art Data Modelling4 Subjective probability Bayesian statistics involves a very different way of thinking about probability in comparison to classical inference. The probability of a proposition is defined to a measure of a person’s degree of belief. Wherever there is uncertainty, there is probability This covers aleatory and epistemic uncertainty

24-25 January 2007An Overview of State-of-the-Art Data Modelling5 Differences with classical inference To a frequentist, data are repeatable, parameters are not: P(data|parameters) To a Bayesian, the parameters are uncertain, the observed data are not: P(parameters|data)

24-25 January 2007An Overview of State-of-the-Art Data Modelling6 This can be extended to continuous distributions: In early probability courses, we are taught Bayes’s theorem for events: In Bayesian statistics, we use Bayes’s theorem in a particular way: Bayes’s theorem for distributions

24-25 January 2007An Overview of State-of-the-Art Data Modelling7 Prior Posterior Prior to posterior updating Bayes’s theorem is used to update our beliefs. The posterior is proportional to the prior times the likelihood. Data

24-25 January 2007An Overview of State-of-the-Art Data Modelling8 Posterior distribution So, once we have our posterior, we have captured all our beliefs about the parameter of interest. We can use this to do informal inference, i.e. intervals, summary statistics. Formally, to make choices about the parameter, we must couple this with decision theory to calculate the optimal decision.

24-25 January 2007An Overview of State-of-the-Art Data Modelling9 Data Sequential updating Today’s posterior is tomorrow’s prior Prior beliefs Posterior beliefs More data Posterior beliefs

24-25 January 2007An Overview of State-of-the-Art Data Modelling10 The triplot A triplot gives a graphical representation of prior to posterior updating. Prior Likelihood Posterior

24-25 January 2007An Overview of State-of-the-Art Data Modelling11 Audience participation Quantification of our prior beliefs What proportion of people in this room are left handed? – call this parameter ψ When I toss this coin, what’s the probability of me getting a tail? – call this θ

24-25 January 2007An Overview of State-of-the-Art Data Modelling12 A simple example The archetypal example in probability theory is the outcome of tossing a coin. Each toss of a coin is a Bernoulli trial with the probability of tails given by θ. If we carry out 10 independent trials, we know the number of tails(X) will follow a binomial distribution. [ X | θ ~ Bi(10, θ) ]

24-25 January 2007An Overview of State-of-the-Art Data Modelling13 Our prior distribution A Beta(2,2) distribution may reflect our beliefs about θ.

24-25 January 2007An Overview of State-of-the-Art Data Modelling14 Our posterior distribution If we observe X = 3, we get the following triplot:

24-25 January 2007An Overview of State-of-the-Art Data Modelling15 Our posterior distribution If we are more convinced, a priori, that θ = 0.5 and we observe X = 3, we get the following triplot:

24-25 January 2007An Overview of State-of-the-Art Data Modelling16 Credible intervals If asked to provide an interval in which there is a 90% chance of θ lying, we can derive this directly from our posterior distribution. Such an interval is called a credible interval. cannotIn frequentist statistics, there are confidence intervals that cannot be interpreted in the same way. In our example, using our first prior distribution, we can report a 95% posterior credible interval for θ of (0.14,0.62).

24-25 January 2007An Overview of State-of-the-Art Data Modelling17 Yesterday we saw a lot of this: We have a least squares solution given by Instead of trying to find the optimal set of parameters, we express our beliefs about them. Basic linear model

24-25 January 2007An Overview of State-of-the-Art Data Modelling18 Basic linear model By selecting appropriate priors for the two parameters, we can derive the posterior analytically. It is a normal inverse-gamma distribution. The mean of our posterior distribution is then which is a weighted average of the LSE and prior mean.

24-25 January 2007An Overview of State-of-the-Art Data Modelling19 Bayesian model comparison Suppose we have two plausible models for a set of data, M and N say. We can calculate posterior odds in favour of M using

24-25 January 2007An Overview of State-of-the-Art Data Modelling20 Bayesian model comparison The Bayes factor is calculated using A Bayes factor that is greater than one would mean that your odds in favour of M increase. Bayes factors naturally help guard against too much model structure.

24-25 January 2007An Overview of State-of-the-Art Data Modelling21 Advantages/Disadvantages Bayesian methods are often more complex than frequentist methods. There is not much software to give scientists off- the-shelf analyses. Subjectivity: all the inferences are based on somebody’s beliefs.

24-25 January 2007An Overview of State-of-the-Art Data Modelling22 Advantages/Disadvantages Bayesian statistics offers a framework to deal with all the uncertainty. Bayesians make use of more information – not just the data in their particular experiment. The Bayesian paradigm is very flexible and it is able to tackle problems that frequentist techniques could not. In selecting priors and likelihoods, Bayesians are showing their hands – they can’t get away with making arbitrary choices when it comes to inference. …

24-25 January 2007An Overview of State-of-the-Art Data Modelling23 Summary The basic principles of Bayesian statistics have been covered. We have seen how we update our beliefs in the light of data. Hopefully, I’ve convinced you that the Bayesian way is the right way.

Priors Advice on choosing suitable prior distributions and eliciting their parameters.

24-25 January 2007An Overview of State-of-the-Art Data Modelling25 Importance of priors As we saw in the previous section, prior beliefs about uncertain parameters are a fundamental part of Bayesian statistics. When we have few data about the parameter of interest, our prior beliefs dominate inference about that parameter. In any application, effort should be made to model our prior beliefs accurately.

24-25 January 2007An Overview of State-of-the-Art Data Modelling26 Weak prior information If we accept the subjective nature of Bayesian statistics and are not comfortable using subjective priors, then many have argued that we should try to specify prior distributions that represent no prior information. These prior distributions are called noninformative, reference, ignorance or weak priors. The idea is to have a completely flat prior distribution over all possible values of the parameter. Unfortunately, this can lead to improper distributions being used.

24-25 January 2007An Overview of State-of-the-Art Data Modelling27 Weak prior information In our coin tossing example, Be(1,1), Be(0.5,0.5) and Be(0,0) have been recommended as noninformative priors. Be(0,0) is improper.

24-25 January 2007An Overview of State-of-the-Art Data Modelling28 Conjugate priors When we move away from noninformative priors, we might use priors that are in a convenient form. That is a form where combining them with the likelihood produces a distribution from the same family. In our example, the beta distribution is a conjugate prior for a binomial likelihood.

24-25 January 2007An Overview of State-of-the-Art Data Modelling29 Informative priors An informative prior is an accurate representation of our prior beliefs. We are not interested in the prior being part of some conjugate family. An informative prior is essential when we have few or no data for the parameter of interest. Elicitation, in this context, is the process of translating someone’s beliefs into a distribution.

24-25 January 2007An Overview of State-of-the-Art Data Modelling30 Elicitation It is unrealistic to expect someone to be able to fully specify their beliefs in terms of a probability distribution. Often, they are only able to report a few summaries of the distribution. We usually work with medians, modes and percentiles. Sometimes they are able to report means and variances, but there are more doubts about these values.

24-25 January 2007An Overview of State-of-the-Art Data Modelling31 Elicitation Once we have some information about their beliefs, we fit some parametric distribution to them. These distribution almost never fit the judgements precisely. There are nonparametric techniques that can bypass this. Feedback is essential in the elicitation process.

24-25 January 2007An Overview of State-of-the-Art Data Modelling32 Normal with unknown mean Noninformative prior:

24-25 January 2007An Overview of State-of-the-Art Data Modelling33 Normal with unknown mean Conjugate prior:

24-25 January 2007An Overview of State-of-the-Art Data Modelling34 Normal with unknown mean Proper prior:

24-25 January 2007An Overview of State-of-the-Art Data Modelling35 Structuring prior information It is possible to structure our prior beliefs in a hierarchical manner: Here is referred to as the hyperparameter(s). Data model: First level of prior: Second level of prior: x

24-25 January 2007An Overview of State-of-the-Art Data Modelling36 An example of this type of hierarchical is a nonparametric regression model. We want to know about μ so the other parameters must be removed. The other parameters are known as nuisance parameters. Structuring prior information Data model: First level of prior: Second level of prior:

24-25 January 2007An Overview of State-of-the-Art Data Modelling37 Analytical tractability The more complexity that is built into your prior and likelihood the more likely it is that you won’t be able to derive your posterior analytically. In the ’90’s, computational techniques were devised to combat this. Markov chain Monte Carlo (MCMC) techniques allow us to access our posterior distributions even in complex models.

24-25 January 2007An Overview of State-of-the-Art Data Modelling38 Sensitivity analysis It is clear that the elicitation of prior distributions is far from being a precise science. A good Bayesian analysis will check that the conclusions are sufficiently robust to changes in the prior. If they aren’t, we need more data or more agreement on the prior structure.

24-25 January 2007An Overview of State-of-the-Art Data Modelling39 Summary Prior distributions are an important part of Bayesian statistics. They are far from being ad hoc, pick-the- easiest-to-use distributions when modelled properly. There are classes of noninformative priors that allow us to represent ignorance.

Gaussian processes A Bayesian data modelling technique that fully accounts for uncertainty.

24-25 January 2007An Overview of State-of-the-Art Data Modelling41 Data modelling: a fully probabilistic method Bayesian statistics offers a framework to account for uncertainty in data modelling. In this section, we’ll concentrate on regression using Gaussian processes and the associated Bayesian techniques

24-25 January 2007An Overview of State-of-the-Art Data Modelling42 We have: or and are uncertain. In order to proceed, we must elicit our beliefs about these two. can be dealt with as in the previous section. The basic idea

24-25 January 2007An Overview of State-of-the-Art Data Modelling43 We assume that f(.) follows a Gaussian process a priori. That is: i.e. any sample of f(x)’s will follow a MV-normal. Gaussian processes A process is Gaussian if and only if every finite sample from the process is a vector-valued Gaussian random variable.

24-25 January 2007An Overview of State-of-the-Art Data Modelling44 Gaussian processes We have prior beliefs about the form of the underlying model. We observe/experiment to get data about the model with which we train our GP. We are left with our posterior beliefs about the model, which can have a ‘nice’ form.

24-25 January 2007An Overview of State-of-the-Art Data Modelling45 A simple example Warning: more audience participation coming up

24-25 January 2007An Overview of State-of-the-Art Data Modelling46 A simple example Imagine we have data about some one dimensional phenomenon. Also, we’ll assume that there is no observational error. We’ll start with five data points between 0 and 4. A priori, we believe is roughly linear and differentiable everywhere.

24-25 January 2007An Overview of State-of-the-Art Data Modelling47 A simple example

24-25 January 2007An Overview of State-of-the-Art Data Modelling48 A simple example

24-25 January 2007An Overview of State-of-the-Art Data Modelling49 A simple example

24-25 January 2007An Overview of State-of-the-Art Data Modelling50 A simple example with error Now, we’ll start over and put some Gaussian error on the observations. Note: in kriging, this is equivalent to adding a nugget effect.,

24-25 January 2007An Overview of State-of-the-Art Data Modelling51 A simple example with error

24-25 January 2007An Overview of State-of-the-Art Data Modelling52 The mean function Recall that our prior mean for is given by where is vector of regression functions evaluated at and is a vector of unknown coefficients. The form of the regression functions is dependent on the application.

24-25 January 2007An Overview of State-of-the-Art Data Modelling53 The mean function It is common practice to use a constant (bias) Linear functions Gaussian basis functions Trigonometric basis functions … It is important to capture your beliefs about in the mean function.

24-25 January 2007An Overview of State-of-the-Art Data Modelling54 The correlation structure The correlation function defines how we believe will deviate nonparametrically from the mean function. In the examples here, I have used a stationary correlation function of the form:

24-25 January 2007An Overview of State-of-the-Art Data Modelling55 Dealing with the model parameters We have the following hyperparameters: can be removed analytically using conjugate priors. are not so easily accounted for…

24-25 January 2007An Overview of State-of-the-Art Data Modelling56 A 2-D example Rock porosity somewhere in the US

24-25 January 2007An Overview of State-of-the-Art Data Modelling57 A 2-D example Mean of our posterior beliefs about the underlying model, f(.).

24-25 January 2007An Overview of State-of-the-Art Data Modelling58 A 2-D example Mean of our posterior beliefs about the underlying model, f(.), in 3D!!!

24-25 January 2007An Overview of State-of-the-Art Data Modelling59 A 2-D example Our uncertainty about f(.) – two standard deviations

24-25 January 2007An Overview of State-of-the-Art Data Modelling60 A 2-D example Our uncertainty about f(.) looks much better in 3D.

24-25 January 2007An Overview of State-of-the-Art Data Modelling61 A 2-D example - prediction The geologists held back two observations at: P 1 = (0.60,0.35), z 1 = 10.0 and P 2 (0.20,0.90), z 2 = 20.8 Using our posterior distribution for f(.) and e, we get the following 90% credible intervals: z 1 |rest of points in (8.7,12.0) and z 2 |rest of points in (21.1,26.0)

24-25 January 2007An Overview of State-of-the-Art Data Modelling62 Diagnostics Cross validation allows us to check the validity of our GP fit. Two variations are often used: leave-one-out or leave- final-20% out. Leave-one-out Hyperparameters use all data and are then fixed when prediction is carried out for each omitted point. Leave-final-20%-out (hold out) Hyperparameters are estimated using the reduced data subset. Cross validation is not enough to justify GP fit.

24-25 January 2007An Overview of State-of-the-Art Data Modelling63 Cross validation for the 2-D e.g. Applying leave-one-out cross validation gives a RMSE of: Constant: Linear: (Using a linear function, reduces RMSE by 2.8%) Applying leave-last-20%-out cross validation gives: Constant: Linear: (A 16.3% difference)

24-25 January 2007An Overview of State-of-the-Art Data Modelling64 Benefits and limitations of GPs Gaussian processes offer a rich class of models, which, when fitted properly, is extremely flexible. It also offers us a framework in which we can account for all of our uncertainty. If there are discontinuities, the method will struggle to provide a good fit. The computation time hinges on the inversion of a square matrix of size (number of data points).

24-25 January 2007An Overview of State-of-the-Art Data Modelling65 Extensions Nonstationarity in the covariance can be modelled by added extra levels to variance term or deforming the input space. Discontinuity can be handled by using piecewise Gaussian process models. The GP model can be applied in a classification setting. There is a lot more research on GPs and there probably will be a way of using them in your applications.

24-25 January 2007An Overview of State-of-the-Art Data Modelling66 Further details I have set up a section on my website that has a comprehensive list of references for extended information on the topics covered in this presentation. j-p-gosling.staff.shef.ac.uk