Prediction and Missing Data. Summarising Distributions ● Models are often large and complex ● Often only interested in some parameters – e.g. not so interested.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Bayesian inference of normal distribution
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Segment 3 Introduction to Random Variables - or - You really do not know exactly what is going to happen George Howard.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Bayesian statistics – MCMC techniques
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Parameters and Statistics Probabilities The Binomial Probability Test.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Chapter 21 Random Variables Discrete: Bernoulli, Binomial, Geometric, Poisson Continuous: Uniform, Exponential, Gamma, Normal Expectation & Variance, Joint.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Chapter Two Probability Distributions: Discrete Variables
Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
Statistical Decision Theory
Model Inference and Averaging
Modeling and Simulation CS 313
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
A Beginner’s Guide to Bayesian Modelling Peter England, PhD EMB GIRO 2002.
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Bayesian Analysis and Applications of A Cure Rate Model.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Issues in Estimation Data Generating Process:
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Probability Refresher COMP5416 Advanced Network Technologies.
Exam 2: Rules Section 2.1 Bring a cheat sheet. One page 2 sides. Bring a calculator. Bring your book to use the tables in the back.
Inference: Probabilities and Distributions Feb , 2012.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Machine Learning 5. Parametric Methods.
P Values - part 2 Samples & Populations Robin Beaumont 2011 With much help from Professor Chris Wilds material University of Auckland.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Computer vision: models, learning and inference Chapter 2 Introduction to probability.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
Hierarchical models. Hierarchical with respect to Response being modeled – Outliers – Zeros Parameters in the model – Trends (Us) – Interactions (Bs)
Bursts modelling Using WinBUGS Tim Watson May 2012 :diagnostics/ :transformation/ :investment planning/ :portfolio optimisation/ :investment economics/
Stochasticity and Probability. A new approach to insight Pose question and think of the answer needed to answer it. Ask: How do the data arise? What is.
Bayesian Inference: Multiple Parameters
Markov Chain Monte Carlo in R
Hierarchical Models.
Multiple Imputation using SOLAS for Missing Data Analysis
Let’s continue to do a Bayesian analysis
Model Inference and Averaging
Appendix A: Probability Theory
CH 5: Multivariate Methods
Set-up of the Bayesian Regression Model
Bayes for Beginners Stephanie Azzopardi & Hrvoje Stojic
CSCI 5822 Probabilistic Models of Human and Machine Learning
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Introduction to Instrumentation Engineering
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Estimating mean abundance from repeated presence-absence surveys
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
Mathematical Foundations of BME Reza Shadmehr
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Prediction and Missing Data

Summarising Distributions ● Models are often large and complex ● Often only interested in some parameters – e.g. not so interested in the intercept ● Need to deal with the other parameters

Marginal Distributions ● A model with two parameters,  1 and  2 ● Suppose we are only interested in  1 ● Calculate the marginal distribution: ● Weighted sum of the joint distribution ● In practice, use MCMC, just look at the output for  1 and ignore  2

Example: Regression ● For each value of  0, take the distribution of  1 and add them together }

Marginal Distributions ● Marginal distributions include the uncertainty in the parameters they are marginalised over ● The Bayesian approach: take the uncertainty into account – don't know parameter values, only estimate them ● We condition on what we know: the data – this is measured – the parameters are not

Prediction ● We sometimes want to predict new data – e.g. population viability analysis ● We have to use estimated parameter values – but these are not certain ● Want to make predictions which include the uncertainty in the parameters – natural in the Bayesian approach

Prediction: The Maths ● If we have data, X, and parameters, , then the posterior is P(  | X) ● We want to predict some new data, X new ● We have a model to do this: P(X|  ) – the likelihood ● If we want to make predictions, we should make predictions based on what we know – i.e. the data – so, we want P(X new |X)

How Do We Get P(X new |X)? ● For each value of  make a prediction of X new, using P(X|  ) ● We can then take a weighted sum so that the values of  that are more likely contribute more to the prediction – i.e. take P(X new |  )×P(  | X) and add them up ● Mathematical statement of this:

In Practice: MCMC ● If we do MCMC, then we draw a lot of values from the posterior – Each value is equally likely – the parameter regions with higher densities are represented by more values ● So, each prediction based on a value is equally likely ● Therefore, we can take each value from the posterior, and simulate the new data using the likelihood

How long with the next Harry Potter book be? ● Model: P i ~ N(  +  Y i,  2 ) ● Predict for Y 7 = 2007

Posterior Predictive Distribution ● Plot of the Posterior Predictive Distribution

Missing Data ● In many real problems we do not observe all of the data ● It may be unobservable – e.g. patients who survive beyond the duration of a medical trial ● The observation may have been lost – e.g. someone stood on your tray of samples ● How should we deal with this?

Missing Data Example ● Suppose we do not know when the fourth Harry Potter book was published – missing covariate ● We could drop the data point, but that loses information – even worse when there are many covariates, with some data missing in all ● We do know the range when it could have been published – between books 3 and 5!

Estimation of Missing Data ● The data does tell us something about the missing covariates: – the observed covariates – the data (through the relationship with the covariates) ● Therefore we can learn about the missing covariates from the observed data ● We can easily formalise this

Missing Data ● If some covariates are not observed, they can be estimated ● Treat them as extra parameters

Inference ● Making the missing data extra parameters means we can get a joint posterior – P(  2, , , Y i ), ● i indexes missing data ● But we are not interested in Y i ● Being Bayesians, we can integrate it out: ● With MCMC: just drop the Y i estimates

Multiple Imputation ● With MCMC we are repeatedly estimating the parameters – drawing them from their posteriors – same applies to missing data ● Mirror of an older approach to missing data ● Called “multiple imputation”

A Model ● P i ~ N(  +  Y i,  2 ) ● Assume Y i is random – put a simple prior on it – but could use a more complex model ● For us, Y i ~ U(Y i-1, Y i+1 ) – and round to the nearest integer

DAG mu[i] beta P[i] for(i in 1:N) alph a tau sigm a Y[i] Prior

Predicted of Year of Publication

Prediction Length of Book 7

Making it Work ● In this case, the missing data does not affect the point estimates – but the uncertainty is higher ● In other cases it can have an effect – especially if the data is not missing at random ● With several covariates, removing data points with missing covariates will remove information ● Rather than remove the data points with missing data, estimate the missing data – increases precision

When We Need The Missing Data ● Sometimes we need to include the missing data ● e.g. censored data – there may be a reason why someone survived the experiment ● We might need to model how the data becomes missing – e.g. model that a person survived to the end of the experiment

When Missing Data Helps ● Sample beetles on 200 islands ● For one species, get these abundances: ● Might be a Poisson distribution, but too many zeroes!

The Model: ZIP ● If the rate of capture on each island was constant, then we would expect a Poisson distribution: ● But we have too many zeroes. ● One explanation: the species only occurs on some islands ● Model: occurrence is binomially distributed ● If species occurs, follows a Poisson distribution

Zero Inflated Poisson ● We end up with a Zero Inflated Poisson Distribution (ZIP) ● Probability: ● Two parameters: and p ● How do we fit this? – not a standard distribution

An Indirect Approach ● We do not have to fit the distribution directly ●. Instead we can split the distribution into two: – I = 1 if the species is present, else I = 0 and N = 0 – P(I=1) = p – I has a Bernoulli distribution ● Binomial with 1 trial ● If the species is present, N follows a Poisson distribution ● We augment the data with the un-observed I – treat it as missing data

Data Augmentation ● Data augmentation is a common technique ● Makes estimation easier ● But uses more parameters ● Works because we can integrate out the extra parameters – take the marginal distribution

The Punchline ● For the missing data, we simulate to estimate the posterior ● We use the posterior to simulate the predicted data ● We could think about the predicted data as missing data, and use data augmentation ● Conceptually, little difference – only that the predictions have no observed data after them ● It's all the same framework!