1 Bayesian Essentials and Bayesian Regression. 2 Distribution Theory 101 Marginal and Conditional Distributions: X Y 1 1 uniform.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
General Linear Model With correlated error terms  =  2 V ≠  2 I.
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Component Analysis (Review)
Pattern Recognition and Machine Learning
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Dimension reduction (1)
Lecture 3 Probability and Measurement Error, Part 2.
The General Linear Model. The Simple Linear Model Linear Regression.
The Simple Linear Regression Model: Specification and Estimation
Maximum likelihood (ML) and likelihood ratio (LR) test
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Bayesian Essentials and Bayesian Regression
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Ordinary least squares regression (OLS)
Introduction to Bayesian Parameter Estimation
Computer vision: models, learning and inference Chapter 3 Common probability distributions.
Thanks to Nir Friedman, HU
Basics of regression analysis
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Maximum likelihood (ML)
Separate multivariate observations
Review of Lecture Two Linear Regression Normal Equation
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
Statistical Decision Theory
Model Inference and Averaging
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Bayesian Analysis and Applications of A Cure Rate Model.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Applied Bayesian Inference, KSU, April 29, 2012 § ❷ / §❷ An Introduction to Bayesian inference Robert J. Tempelman 1.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
1 Bayesian Essentials Slides by Peter Rossi and David Madigan.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Canadian Bioinformatics Workshops
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Oliver Schulte Machine Learning 726
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Bayes Net Learning: Bayesian Approaches
Computer vision: models, learning and inference
More about Posterior Distributions
OVERVIEW OF LINEAR MODELS
Pattern Recognition and Machine Learning
Econometrics Chengyuan Yin School of Mathematics.
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

1 Bayesian Essentials and Bayesian Regression

2 Distribution Theory 101 Marginal and Conditional Distributions: X Y 1 1 uniform

3 Simulating from Joint To draw from the joint: i. draw from marginal on X ii. Condition on this draw, and draw from conditional of Y|X

4 The Goal of Inference Make inferences about unknown quantities using available information. Inference -- make probability statements unknowns -- parameters, functions of parameters, states or latent variables, “future” outcomes, outcomes conditional on an action Information – data-based non data-based theories of behavior; “subjective views” there is an underlying structure parameters are finite or in some range

5 Data Aspects of Marketing Problems Ex: Conjoint Survey >500 respondents rank, rate, chose among product configurations. Small Amount of Information per respondent Response Variable Discrete Ex: Retail Scanning Data very large number of products (> 10,000 SKUs) large number of geographical units (markets, zones, stores) limited variation in some marketing mix vars Must make plausible predictions for decision making!

6 The likelihood principle LP: the likelihood contains all information relevant for inference. That is, as long as I have same likelihood function, I should make the same inferences about the unknowns. In contrast to modern econometric methods (GMM) which does not obey the likelihood principle (e.g., regression for a 0-1 binary dependent variable) Implies analysis is done conditional on the data, in contrast to the frequentist approach where the sampling distribution is determined prior to observing the data. Note: any function proportional to data density can be called the likelihood.

7 p(  |D)  p(D|  ) p(  ) Posterior  “Likelihood” × Prior Modern Bayesian computing– simulation methods for generating draws from the posterior distribution p(  |D). Bayes theorem

8 Summarizing the posterior Output from Bayesian Inf: A high dimensional dist Summarize this object via simulation: marginal distributions of don’t just compute Contrast with Sampling Theory: point est/standard error summary of irrelevant dist bad summary (normal) Limitations of Asymptotics

9 Prediction See D, compute : “Predictive Distribution”

10 Decision theory Loss: L(a,  ) where a=action;  =state of nature Bayesian decision theory: Estimation problem is a special case:

11 Is Sampling Theory Useful? Sampling theory: the performance of an estimator across multiple, hypothetical data sets. Sampling experiments: i. sample from p(D|θ) ii. sample from another model iii. various asymptotic “experiments” – sequences of data sets Not useful for inference in applied work (only one data set). Useful to assess choice of inference procedure. Need one look further than Bayes estimators?

12 Sampling properties of Bayes estimators An estimator is admissible if there does not exist another estimator with lower risk for all values of . The Bayes estimator minimizes expected (average) risk which implies they are admissible: The Bayes estimator does the best for every D. Therefore, it must work as at least as well as any other estimator.

13 Identification If dim(R) >= 1, then we have an “identification” problem. That is, there are a set of observationally equivalent values of the model parameters. The likelihood is “flat” or constant over R. Practical Implications likelihood can have flats or ridges. Issue for both the Bayesian (is it?) and non- Bayesian.

14 Identification Is this a problem? no, I have a proper prior no, I don’t maximize “Classical” solution: impose enough constraints so that constrained parameter space is identified. “Bayesian” solution: use proper prior and recognize that some functions of θ are determined entirely by prior

15 is the identified parameter. Report only on the posterior distribution of this function of . check which is induced by p(θ) Identification

16 Bayes Inference: Summary Bayesian Inference delivers an integrated approach to: Inference – including “estimation” and “testing” Prediction – with a full accounting for uncertainty Decision – with likelihood and loss (these are distinct!) Bayesian Inference is conditional on available info. The right answer to the right question. Bayes estimators are admissible. All admissible estimators are Bayes (Complete Class Thm). Which Bayes estimator?

17 Bayes/Classical Estimators Does MLE obey LP? YES Does theory of maximum likelihood estimator obey LP? No! who cares about an infinite amount of irrelevant data! Is there any relationship? resort to asymptotics what is Bayesian asymptotics? does this even make sense? Investigate asymptotic behavior of the posterior.

18 Bayes/Classical Estimators Prior washes out – locally uniform!!! Bayes is consistent unless you have dogmatic prior.

19 Benefits/Costs of Bayes Inf Benefits- finite sample answer to right question full accounting for uncertainty integrated approach to inf/decision “Costs”- computational (true any more? < classical!!) prior (cost or benefit?) esp. with many parms (hierarchical/non-parameric problems)

20 Bayesian Computations Before simulation methods, Bayesians used posterior expectations of various functions as summary of posterior. If p(θ|D) is in a convenient form (e.g. normal), then I might be able to compute this for some h. Via iid simulation for all h.

21 Conjugate Families Models with convenient analytic properties almost invariably come from conjugate families. Why do I care now? - conjugate models are used as building blocks - build intuition re functions of Bayesian inference Definition: A prior is conjugate to a likelihood if the posterior is in the same class of distributions as prior. Basically, conjugate priors are like the posterior from some imaginary dataset with a diffuse prior.

22 Beta-Binomial model Need a prior!

23 Beta distribution

24 Posterior

25 Prediction

26 Regression model Is this model complete? For non-experimental data, don’t we need a model for the joint distribution of y and x?

27 Regression model If Ψ is a priori indep of ( β, σ), rules out x=f( β)!!! two separate analyses simultaneous systems are not written this way!

28 Conjugate Prior What is conjugate prior? Comes from form of likelihood function. Here we condition on X. quadratic form suggests normal prior. Let’s – complete the square on β or rewrite by projecting y on X (column space of X).

29 Geometry of regression x1x1 x2x2 y 1x11x1 2x22x2

30 Traditional regression No one ever computes a matrix inverse directly. Two numerically stable methods: QR decomposition of X Cholesky root of X’X and compute inverse using root Non-Bayesians have to worry about singularity or near singularity of X’X. We don’t! more later

31 In Bayesian computations, the fundamental matrix operation is the Cholesky root. chol() in R The Cholesky root is the generalization of the square root applied to positive definite matrices. As Bayesians with proper priors, we don’t ever have to worry about singular matrices! U is upper triangular with positive diagonal elements. U -1 is easy to compute by recursively solving TU = I for T, backsolve() in R. Cholesky Roots

32 Cholesky roots can be useful to simulate from Multivariate Normal Distribution. Cholesky Roots To simulate a matrix of draws from MVN (each row is a separate draw) in R, Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma) Y=t(t(Y)+mu)

33 Regression with R data.txt: UNIT Y X1 X2 A A A A B B B B B df=read.table("data.txt",header=TRUE) myreg=function(y,X){ # # purpose: compute lsq regression # # arguments: # y -- vector of dep var # X -- array of indep vars # # output: # list containing lsq coef and std errors # XpXinv=chol2inv(chol(crossprod(X))) bhat=XpXinv%*%crossprod(X,y) res=as.vector(y-X%*%bhat) ssq=as.numeric(res%*%res/(nrow(X)-ncol(X))) se=sqrt(diag(ssq*XpXinv)) list(b=bhat,std_errors=se) }

34 Regression likelihood where

35 Regression likelihood This is called an inverted gamma distribution. It can also be related to the inverse of a Chi-squared distribution. Note the conjugate prior suggested by the form the likelihood has a prior on β which depends on σ.

36 Bayesian Regression Prior: Inverted Chi-Square: Interpretation as from another dataset. Draw from prior?

37 Posterior

38 Combining quadratic forms

39 Posterior

40 IID Simulations 3) Repeat 1) Draw [  2 | y, X] 2) Draw [  |  2,y, X] Scheme: [y|X, ,  2 ] [  |  2 ] [  2 ]

41 IID Simulator, cont.

42 Bayes Estimator The Bayes Estimator is the posterior mean of β. Marginal on β is a multivariate student t. Who cares?

43 Shrinkage and Conjugate Priors The Bayes Estimator is the posterior mean of β. This is a “shrinkage” estimator. Is this reasonable?

44 Assessing Prior Hyperparameters These determine prior location and spread for both coefs and error variance. It has become customary to assess a “diffuse” prior: This can be problematic. Var(y) might be a better choice.

45 Improper or “non-informative” priors Classic “non-informative” prior (improper): Is this “non-informative”? Of course not, it says that  is large with high prior “probability” Is this wise computationally? No, I have to worry about singularity in X’X Is this a good procedure? No, it is not admissible. Shrinkage is good!

46 runireg runireg= function(Data,Prior,Mcmc){ # # purpose: # draw from posterior for a univariate regression model with natural conjugate prior # # Arguments: # Data -- list of data # y,X # Prior -- list of prior hyperparameters # betabar,A prior mean, prior precision # nu, ssq prior on sigmasq # Mcmc -- list of MCMC parms # R number of draws # keep -- thinning parameter # # output: # list of beta, sigmasq draws # beta is k x 1 vector of coefficients # model: # Y=Xbeta+e var(e_i) = sigmasq # priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1) # sigmasq ~ (nu*ssq)/chisq_nu

47 runireg RA=chol(A) W=rbind(X,RA) z=c(y,as.vector(RA%*%betabar)) IR=backsolve(chol(crossprod(W)),diag(k)) # W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp btilde=crossprod(t(IR))%*%crossprod(W,z) res=z-W%*%btilde s=t(res)%*%res # # first draw Sigma # sigmasq=(nu*ssq + s)/rchisq(1,nu+n) # # now draw beta given Sigma # beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k) list(beta=beta,sigmasq=sigmasq) }

48

49

50 Multivariate Regression

51 Multivariate regression likelihood

52 Multivariate regression likelihood

53 Inverted Wishart distribution Form of the likelihood suggests that natural conjugate (convenient prior) for  would be of the Inverted Wishart form: denoted u- tightness V- location however, as  increases, spread also increases limitations: i. small  -- thick tail ii. only one tightness parm

54 Wishart distribution (rwishart)

55 Multivariate regression prior and posterior Prior: Posterior:

56 Drawing from Posterior: rmultireg rmultireg= function(Y,X,Bbar,A,nu,V) RA=chol(A) W=rbind(X,RA) Z=rbind(Y,RA%*%Bbar) # note: Y,X,A,Bbar must be matrices! IR=backsolve(chol(crossprod(W)),diag(k)) # W'W = R'R & (W'W)^-1 = IRIR' -- this is the UL decomp! Btilde=crossprod(t(IR))%*%crossprod(W,Z) # IRIR'(W'Z) = (X'X+A)^-1(X'Y + ABbar) S=crossprod(Z-W%*%Btilde) # rwout=rwishart(nu+n,chol2inv(chol(V+S))) # # now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov) # Cov=(X'X + A)^-1 = IR t(IR) # Sigma=CICI' # therefore, cov(beta)= Omega = CICI' (x) IR IR' = (CI (x) IR) (CI (x) IR)' #so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk) #Z_mk is m x k matrix of N(0,1) #since vec(ABC) = (C' (x) A)vec(B), we have #B = Btilde + IR Z_mk CI' # B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout$CI)

57 Conjugacy is Fragile! SUR: set of regressions “related” via correlated errors BUT, no joint conjugate prior!!