Reversing Label Switching:

Slides:



Advertisements
Similar presentations
MCMC estimation in MlwiN
Advertisements

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Bayesian inference of normal distribution
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no , pages
Visual Recognition Tutorial
1 Finite Population Inference for the Mean from a Bayesian Perspective Edward J. Stanek III Department of Public Health University of Massachusetts Amherst,
1 Finite Population Inference for Latent Values Measured with Error from a Bayesian Perspective Edward J. Stanek III Department of Public Health University.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Image Analysis and Markov Random Fields (MRFs) Quanren Xiong.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
EM and expected complete log-likelihood Mixture of Experts
Bayesian Inference, Basics Professor Wei Zhu 1. Bayes Theorem Bayesian statistics named after Thomas Bayes ( ) -- an English statistician, philosopher.
Applied Bayesian Inference, KSU, April 29, 2012 § ❷ / §❷ An Introduction to Bayesian inference Robert J. Tempelman 1.
Permutation Groups Part 1. Definition A permutation of a set A is a function from A to A that is both one to one and onto.
Simulation of the matrix Bingham-von Mises- Fisher distribution, with applications to multivariate and relational data Discussion led by Chunping Wang.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Bayesian statistics Probabilities for everything.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Stick-Breaking Constructions
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
- 1 - Matlab statistics fundamentals Normal distribution % basic functions mew=100; sig=10; x=90; normpdf(x,mew,sig) 1/sig/sqrt(2*pi)*exp(-(x-mew)^2/sig^2/2)
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Introduction: Metropolis-Hasting Sampler Purpose--To draw samples from a probability distribution There are three steps 1Propose a move from x to y 2Accept.
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Stochasticity and Probability. A new approach to insight Pose question and think of the answer needed to answer it. Ask: How do the data arise? What is.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Systems of Identical Particles
Linear Algebra Review.
STATISTICS POINT ESTIMATION
Bayesian statistics So far we have thought of probabilities as the long term “success frequency”: #successes / #trails → P(success). In Bayesian statistics.
Model Inference and Averaging
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Location-Scale Normal Model
More about Posterior Distributions
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
CHAPTER 22: Inference about a Population Proportion
Collapsed Variational Dirichlet Process Mixture Models
Hierarchical Topic Models and the Nested Chinese Restaurant Process
'Linear Hierarchical Models'
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chapter 10: Basics of Confidence Intervals
Integration of sensory modalities
Parametric Empirical Bayes Methods for Microarrays
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
CS639: Data Management for Data Science
EM Algorithm 主講人:虞台文.
MAS2317- Introduction to Bayesian Statistics
Data Exploration and Pattern Recognition © R. El-Yaniv
Applied Statistics and Probability for Engineers
Maximum Likelihood Estimation (MLE)
Presentation transcript:

Reversing Label Switching: An Interactive Talk Earl Duncan 20 July 2017

Introduction Given observed data 𝒚= 𝑦 1 ,…, 𝑦 𝑁 , the 𝐾-component mixture model is expressed as 𝒀 ~ 𝑝 𝒚 𝒘,𝝓 = 𝑖=1 𝑁 𝑘=1 𝐾 𝑤 𝑘 𝑓 𝑘 𝑦 𝑖 𝝓 𝑘 where 𝝓 𝑘 denotes unknown component-specific parameter(s), and 𝑓 𝑘 (∙) is the 𝑘 th component density with corresponding mixture weight 𝑤 𝑘 subject to: 𝑘=1 𝐾 𝑤 𝑘 =1 and 𝑤 𝑘 ≥0 for 𝑘=1,…,𝐾. Marin, J-M., K. Mengersen, and C. P. Robert. 2005. “Bayesian modelling and inference on mixtures of distributions” In Handbook of Statistics edited C. Rao and D. Dey. New York: Springer-Verlag. Earl Duncan BRAG 20 July 2017: Reversing Label Switching 1/12

Introduction A latent allocation variable 𝑍 𝑖 is used to identify which component 𝑌 𝑖 belongs to. 𝑌 𝑖 𝑧 𝑖 ,𝝓 ~ 𝑓 𝑧 𝑖 𝑦 𝑖 𝝓 𝑧 𝑖 𝑍 𝑖 |𝒘 ~ Cat 𝑤 1 , …, 𝑤 𝐾 The likelihood is exchangeable meaning that it is invariant to permutations of the labels identifying the mixture components 𝑝 𝒚 𝜽 =𝑝 𝒚 𝜏 𝜽 E.g. 𝑝 𝒚 𝜃 1 , 𝜃 2 =𝑝 𝒚 𝜃 2 , 𝜃 1 . for some permutation 𝜏. If the posterior distribution is invariant to permutations of the labels, this is known as label switching (LS). Earl Duncan BRAG 20 July 2017: Reversing Label Switching 2/12

Introduction Consider the conditions: the prior is (at least partly) exchangeable the sampler is efficient at exploring the posterior hypersurface If condition 1 holds, the posterior will have (up to) 𝐾! symmetric modes. If condition 1 and 2 hold, LS will occur (i.e. the symmetric modes will be observed). No label switching LS between all 3 groups LS between groups 1 and 2 Earl Duncan BRAG 20 July 2017: Reversing Label Switching 3/12

Introduction If label switching occurs, the marginal posterior distributions are identical for each component. Therefore, it is impossible to make inferences! K = 3 K = 4 Earl Duncan BRAG 20 July 2017: Reversing Label Switching 4/12

Introduction To make sensible inferences, one must first reverse the label switching using a relabelling algorithm. If/when LS occurs, determine the permutations 𝜏 (1) ,…, 𝜏 (𝑀) to undo the label switching. Apply the permutations to 𝝓, 𝒘, and inverse permutations to 𝒛. The function 𝜏(∙) can be regarded as a generic permutation function which either permutes or relabels. Let 𝜏=( 𝜏 1 , …, 𝜏 𝐾 ) be a permutation of the index set 1,…,𝐾 , let 𝒗=( 𝑣 1 ,…, 𝑣 𝐾 ) be an arbitrary 𝐾-length vector, and let 𝒛=( 𝓏 1 , 𝓏 2 , 𝓏 3 ,…) be an arbitrary length vector (or possibly scalar) containing only the values 1,…,𝐾 . Then: Permute: 𝜏 𝑣 1 ,…, 𝑣 𝐾 = 𝑣 𝜏 1 ,…, 𝑣 𝜏 𝐾 Relabel: 𝜏 𝓏 1 , 𝓏 2 , 𝓏 3 ,… = 𝜏 𝓏 1 , 𝜏 𝓏 2 , 𝜏 𝓏 3 ,… Earl Duncan BRAG 20 July 2017: Reversing Label Switching 5/12

Example Example: determining 𝜏 𝜏 (𝑚) can be determined from the posterior estimates 𝒛 (𝑚) and a reference allocation vector 𝒛 ∗ = 𝑧 1 ,…, 𝑧 𝑁 ( 𝑚 ∗ ) . Earl Duncan BRAG 20 July 2017: Reversing Label Switching 6/12

Exercises Consider the following cross-tabulation of reference allocation vector 𝒛 ∗ = 𝒛 ( 𝑚 ∗ ) and 𝒛 (7) (here 𝑁=200). 1 2 3 4 1 2 3 4 0 90 0 0 0 0 2 14 52 0 1 3 0 2 35 1 𝒛 ∗ 𝒛 (7) Question 1: What should the permutation 𝜏 (7) be to reverse the labels of a component-specific parameter, 𝜽 (7) ? Hint: (3, 1, 4, 2) or (2, 4, 1, 3) Answer: 𝜏 (7) =(3, 1, 4, 2) Earl Duncan BRAG 20 July 2017: Reversing Label Switching 7/12

Exercises The second step requires this permutation to be applied to the component-specific parameters and the labels. Question 2: If 𝒘 (7) =(0.5, 0.1, 0.3, 0.2) and 𝒛 (7) =(3, 4, 2, 2, 3, …), what are the resulting estimates after relabelling? Recall 𝜏 (7) =(3, 1, 4, 2). Hint: Permuting: 𝜏 𝑣 1 ,…, 𝑣 𝐾 = 𝑣 𝜏 1 ,…, 𝑣 𝜏 𝐾 Relabelling: 𝜏 𝓏 1 , 𝓏 2 , 𝓏 3 ,… = 𝜏 𝓏 1 , 𝜏 𝓏 2 , 𝜏 𝓏 3 ,… Answer: 𝒘 (7) := 𝜏 7 0.5, 0.1, 0.3, 0.2 = 0.3, 0.5, 0.2, 0.1 𝒛 (7) := 𝜏 7 −𝟏 (3, 4, 2, 2, 3, …) = ( 𝜏 𝓏 1 −1 , 𝜏 𝓏 2 −1 , 𝜏 𝓏 3 −1 , 𝜏 𝓏 4 −1 , 𝜏 𝓏 5 −1 ,…) = ( 𝜏 3 −1 , 𝜏 4 −1 , 𝜏 2 −1 , 𝜏 2 −1 , 𝜏 3 −1 ,…) = (1, 3, 4, 4, 1,…) Earl Duncan BRAG 20 July 2017: Reversing Label Switching 8/12

Exercises Question 3: Why is the inverse permutation used to relabel 𝒛? Hint: Consider drawing values from 3 component densities. Introduce LS, and note how the new values of 𝜽 and 𝒛 are recorded. 𝜽 ? ? ? ? ? ? ⋮ ⋮ ⋮ w/o LS w/ LS Answer: Draw values without LS, then with LS: 𝜽 0 10 20 10 20 0 ⋮ ⋮ ⋮ ⇒ 𝜏 LS =(2, 3, 1) ⇒𝜏= 𝜏 LS −1 =(3, 1, 2) But how are the values of 𝒛 recorded? Earl Duncan BRAG 20 July 2017: Reversing Label Switching 9/12

Exercises Answer continued: 𝜏 LS =(2, 3, 1) 𝜏=(3, 1, 2) 𝜽 0 10 20 10 20 0 ⋮ ⋮ ⋮ → Draw from Middle, but label it “1” Draw from Right, but label it “2” Draw from Left, but label it “3” 2→ 1 3→ 2 1→ 3 𝒛 3 3 1 2 1 ⋯ 2 2 3 1 3 ⋯ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⇓ ⇔ 𝜏 −1 ( 𝒛 2 )=( 𝜏 𝓏 1 −1 , 𝜏 𝓏 2 −1 , 𝜏 𝓏 3 −1 , 𝜏 𝓏 4 −1 , 𝜏 𝓏 5 −1 ,…) =( 𝜏 2 −1 , 𝜏 2 −1 , 𝜏 3 −1 , 𝜏 1 −1 , 𝜏 3 −1 ,…) =(3, 3, 1, 2, 1,…) 𝒛 3 3 1 2 1 ⋯ ? ? ? ? ? ⋯ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ Earl Duncan BRAG 20 July 2017: Reversing Label Switching 10/12

Comparison of Relabelling Algorithms Earl Duncan BRAG 20 July 2017: Reversing Label Switching 11/12

Questions? Any questions? Earl Duncan BRAG 20 July 2017: Reversing Label Switching 12/12