Hierarchical Models.

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

Slice Sampling Radford M. Neal The Annals of Statistics (Vol. 31, No. 3, 2003)
Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Statistical Issues in Research Planning and Evaluation
Markov-Chain Monte Carlo
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
Visual Recognition Tutorial
Bayesian learning finalized (with high probability)
Business Statistics: Communicating with Numbers
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Introduction to Estimation Chapter Concepts of Estimation The objective of estimation is to determine the value of a population parameter on the.
Chapter 21: More About Tests “The wise man proportions his belief to the evidence.” -David Hume 1748.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Binomial Experiment A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the following properties:
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
1 Bayesian Essentials Slides by Peter Rossi and David Madigan.
Essential Questions How do we use simulations and hypothesis testing to compare treatments from a randomized experiment?
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
Chapter 10 Confidence Intervals for Proportions © 2010 Pearson Education 1.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Markov Chain Monte Carlo in R
Chapter 8 Introducing Inferential Statistics.
Probability distributions and likelihood
Inference for a Single Population Proportion (p)
Hypothesis Tests l Chapter 7 l 7.1 Developing Null and Alternative
Bayesian Semi-Parametric Multiple Shrinkage
Bayesian Estimation and Confidence Intervals
Investment Analysis and Portfolio management
Tutorial 9 EM and Beta distribution
Sampling Distributions and Estimation
Inference and Tests of Hypotheses
1. Estimation ESTIMATION.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
8.1 Sampling Distributions
Analyzing Redistribution Matrix with Wavelet
Samples and Populations
8.1 Sampling Distributions
Research methods Lesson 2.
Review of Probabilities and Basic Statistics
How to handle missing data values
Markov Chain Monte Carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
3.4 The Binomial Distribution
More about Posterior Distributions
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
The Binomial Distribution
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
Introduction to Estimation
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Bayesian vision Nisheeth 14th February 2019.
Multivariate Methods Berlin Chen, 2005 References:
12/12/ A Binomial Random Variables.
MAS2317- Introduction to Bayesian Statistics
Mathematical Foundations of BME Reza Shadmehr
Simulation Berlin Chen
Applied Statistics and Probability for Engineers
Classical regression review
Chapter 11 Probability.
Presentation transcript:

Hierarchical Models

Parameter Dependencies & Factorization Resident parameters of hierarchical models may depend on one another. Hierarchical models allow joint probabilities to be factored into chains of dependencies: p(𝛳1,𝛳2,⍵|D) ∝ p(D|𝛳1,𝛳2,⍵)p(𝛳1,𝛳2,⍵) Surgical Team ⍵ p(𝛳1,𝛳2,⍵|D) ∝ p(D|𝛳1,𝛳2,⍵)p(𝛳1|𝛳2,⍵)p(𝛳2|⍵)p(⍵) p(𝛳1,𝛳2,⍵|D) ∝ p(D|𝛳1,𝛳2,⍵)p(𝛳1|⍵)p(𝛳2|⍵)p(⍵) Patient Outcomes 𝛳1 𝛳2 This expresses conditional independence. If patient 2 has a good model of the surgical team, they don’t learn anything new from patient 1. Each outcome inform the higher-level parameters, which in turn constrain all individual parameters.

Review: Gibbs Sampling, Beta Dist

Introducing Gibbs Metropolis works best if proposal distribution is properly tuned to the posterior. Gibbs sampling is more efficient, and good for hierarchical models. In Gibbs, parameters are selected one at a time, and cycled through (Θ1, Θ2,...,Θ1,Θ2,...) New proposal distribution: E.g., with only two parameters Because the proposal distribution exactly mirrors the posterior for that parameter, the proposed move is always accepted.

Gibbs Sample: Pros & Cons Advantages No inefficiency of rejected proposals. No need to tune proposal distributions Intermediate Steps Shown Whole Steps Shown Disadvantages Progress can be stalled by highly correlated parameters. Imagine a long, narrow, diagonal hallway. How does Gibbs differ from Metropolis? Posterior distributions must be derivable. This is much easier with conditional independence. Gibbs sampling pairs well with Hierarchical Models.

Beta Distribution ⍵ = 30 ⍵ = 2 All distributions have control knobs. The beta distribution has two: A, B. ⍵ = 2 However, these parameters are not particularly semantically meaningful. Why not re-parameterize? ⍵ = 1.5 ⍵ = (a+1)/(a+b-2) 𝛋 = a + b ⍵ = 1 Interpretation of these new parameters is more straightforward: ⍵ is the “mode” 𝛋 is the “concentration” ⍵ = 0.75 Take, for example, our middle column...

One Mint, One Coin

Factory params: A, B Parameter Chains Coin outcomes come from the Bernoulli distribution, which has one parameter: 𝛳 (specific bias) ⍵: Average coin bias Mint factory outputs are viewed beta distribution. Coin biases increase proportional to ⍵ 𝛳: Flipping coin bias Finally, to understand our mint compared to other factories, we draw its ⍵ from another beta dist. y: Flip Outcomes

Figure 9.2 For the prior: Low certainty regarding ⍵ High certainty regarding dependence of ⍵ on 𝛳. Likelihood The likelihood is based on the following data: D = 9 heads, 3 tails Question: why is the likelihood invariant to ⍵? For the posterior: Distribution of ⍵ has been changed. Dependence of ⍵ on 𝛳 still persists Posterior

Figure 9.3 For the prior: High certainty regarding ⍵ Low certainty regarding dependence of ⍵ on 𝛳. Likelihood The likelihood is the same as before: D = 9 heads, 3 tails For the posterior: High certainty regarding ⍵ Low certainty regarding dependence of ⍵ on 𝛳. Posterior

The Effect of the Prior Prior Likelihood Posterior

One Mint, Two Coins

Figure 9.5 For the prior: Low certainty regarding ⍵ Low dependence of ⍵ on 𝛳 (K=5) Likelihood The likelihood is based on: D1 = 3 heads, 12 tails D2 = 4 heads, 1 tail Q: why are 𝛳1 contours more dense? For the posterior: Distribution of ⍵ has been changed. Dependence of ⍵ on 𝛳 still persists Posterior

Figure 9.6 Prior The prior encodes high dependence of ⍵ on 𝛳: K=75. The posterior will “live in this trough” The likelihood is based on: D1 = 3 heads, 12 tails D2 = 4 heads, 1 tail Likelihood The posterior 𝛳2 is peaked around 0.4, far from 0.8 in its coins data! Why? Other coin has more data- greater effect on ⍵, which in turn influences 𝛳2 Posterior

One (Realistic) Mint, Two Coins

Mint Variance & Gamma Recall that before, we set beta “distribution width” 𝜅 = K. Now, let 𝜅 be drawn from a distribution. We want 𝜅 to be small. Draw 𝜅 from a gamma dist. 2 params: shape and rate (S𝜅, R𝜅) (0.01, 0.01) (1.56, 0.03) (1, 0.02) (6.3, 0.125)

Hierarchical Models, in JAGS

Example: Therapeutic Touch

Ability & Consistency of Group Setup, Data, and Model Data General Claim: Therapeutic Touch practitioners can sense a body’s energy field. Operationalized Claim: They should sense which of their hands is near another person’s hand, with vision obstructed. Ability & Consistency of Group Ability of Individual Practitioners Hierarchical Model 28 practitioners tested for 10 trials. Trial Outcomes Our “coin” model fits perfectly:

JAGS Prior We set our priors with low certainty, to avoid biasing the final result.

JAGS Code

JAGS Results If the 0.5 ∉ HDI, we might be justified in concluding that bodily presence was detectable. If max(HDI) < 0.5, it was somehow detected + misinterpreted The model assumed that all individuals were representative of the same overarching group. All individuals mutually informed each other’s estimates.

Shrinkage We have seen low-level parameters trying to reconcile two sources of information: The data The higher-level parameter Non-hierarchical models, in contrast, must only accommodate the former. This additional constraint, then, makes posterior distributions more narrow, overall. This is desirable: parameter estimates are less affected by random sampling noise.

Higher-Level Models We can easily construct third-order hierarchies. Ability & Consistency of Group We can easily construct third-order hierarchies. In baseball, this is important: pitchers have much different batting average from other positions. Ability & Consistency of Position Ability of Individual Players Neither position- nor positionless- models are uniquely “correct”. Like all models, parameter estimates are meaningful descriptions only in context of model structure. Batting Outcomes

The End