1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – 10708 Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Teg Grenager NLP Group Lunch February 24, 2005
Xiaolong Wang and Daniel Khashabi
Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Course: Neural Networks, Instructor: Professor L.Behera.
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Hierarchical Dirichlet Process (HDP)
A Tutorial on Learning with Bayesian Networks
Information retrieval – LSI, pLSI and LDA
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
Expectation Maximization
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
1 Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
Optimal Stopping -Amit Goyal 27 th Feb 2007 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Latent Dirichlet Allocation a generative model for text
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
Visual Recognition Tutorial
A Unifying Review of Linear Gaussian Models
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Randomized Algorithms for Bayesian Hierarchical Clustering
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
Stick-Breaking Constructions
Storylines from Streaming Text The Infinite Topic Cluster Model Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing Carnegie Mellon.
CS Statistical Machine learning Lecture 24
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.
Latent Dirichlet Allocation
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
by Ryan P. Adams, Iain Murray, and David J.C. MacKay (ICML 2009)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Bayesian Generalized Product Partition Model
Advanced Statistical Computing Fall 2016
Non-Parametric Models
Omiros Papaspiliopoulos and Gareth O. Roberts
Dirichlet process tutorial
Multitask Learning Using Dirichlet Process
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Stochastic Optimization Maximization for Latent Variable Models
Topic Models in Text Processing
Presentation transcript:

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AA

We are given a data set, and are told that it was generated from a mixture of Gaussians. Unfortunately, no one has any idea how many Gaussians produced the data. Motivation

We are given a data set, and are told that it was generated from a mixture of Gaussians. Unfortunately, no one has any idea how many Gaussians produced the data. Motivation

What to do? We can guess the number of clusters, do EM for Gaussian Mixture Models, look at the results, and then try again… We can do hierarchical agglomerative clustering, and cut the tree at a visually appealing level… We want to cluster the data in a statistically principled manner, without resorting to hacks.

Review: Dirichlet Distribution Let We write: Distribution over possible parameter vectors for a multinomial distribution, and is in fact the conjugate prior for the multinomial. Beta distribution is the special case of a Dirichlet for 2 dimensions. Samples from the distribution lie in the m-1 dimensional simplex Thus, it is in fact a “distribution over distributions.”

Dirichlet Process A Dirichlet Process is also a distribution over distributions. We write: G ~ DP(α, G 0 )  G 0 is a base distribution  α is a positive scaling parameter G has the same support as G 0

Dirichlet Process Consider Gaussian G 0 G ~ DP(α, G 0 )

Dirichlet Process G ~ DP(α, G 0 ) G 0 is continuous, so the probability that any two samples are equal is precisely zero. However, G is a discrete distribution, made up of a countably infinite number of point masses [Blackwell]  Therefore, there is always a non-zero probability of two samples colliding

Dirichlet Process G ~ DP(α 1, G 0 ) G ~ DP(α 2, G 0 ) α values determine how close G is to G 0

Sampling from a DP G ~ DP(α, G 0 ) X n | G ~ Gfor n = {1, …, N} (iid) Marginalizing out G introduces dependencies between the X n variables G XnXn N

Sampling from a DP Assume we view these variables in a specific order, and are interested in the behavior of X n given the previous n - 1 observations. Let there be K unique values for the variables:

Sampling from a DP Notice that the above formulation of the joint does not depend on the order we consider the variables. We can arrive at a mixture model by assuming exchangeability and applying DeFinetti’s Theorem (1935). Chain rule P(partition)P(draws)

Chinese Restaurant Process Can rewrite as: Let there be K unique values for the variables:

Chinese Restaurant Process Consider a restaurant with infinitely many tables, where the X n ’s represent the patrons of the restaurant. From the above conditional probability distribution, we can see that a customer is more likely to sit at a table if there are already many people sitting there. However, with probability proportional to α, the customer will sit at a new table. Also known as the “clustering effect,” and can be seen in the setting of social clubs. [Aldous]

Dirichlet Process Mixture G ηnηn N ynyn G0G0 α countably infinite number of point masses draw N times from G to get parameters for different mixture components If η n were drawn from e.g. a Gaussian, no two values would be the same, but since they are drawn from a distribution drawn from a Dirichlet Process, we expect a clustering of the η n # unique values for η n = # mixture components

CRP Mixture

Stick Breaking So far, we’ve just mentioned properties of a distribution G drawn from a Dirichlet Process In 1994, Sethuraman developed a constructive way of forming G, known as “stick breaking”

Stick Breaking 1. Draw η 1 * from G 0 2. Draw v 1 from Beta(1, α) 4. Draw η 2 * from G 0 3. π 1 = v 1 … 5. Draw v 2 from Beta(1, α) 6. π 2 = v 2 (1 – v 1 )

Formal Definition Let α be a positive, real-valued scalar Let G 0 be a non-atomic probability distribution over support set A We say G ~ DP(α, G 0 ), if for all natural numbers k and k-partitions {A 1, …, A k },

Inference in a DPM EM is generally used for inference in a mixture model, but G is nonparametric, making EM difficult Markov Chain Monte Carlo techniques [Neal 2000] Variational Inference [Blei and Jordan 2006] G ηnηn N ynyn G0G0 α

Gibbs Sampling [Neal 2000] Algorithm 1:  Define H i to be the single observation posterior  We marginalize out G from our model, and sample each η n given everything else G ηnηn N ynyn G0G0 α SLOW TO CONVERGE!

Gibbs Sampling [WAS 22-DAL 19]

Gibbs Sampling [Neal 2000] Algorithm 2: G ηnηn N ynyn G0G0 α cncn N ynyn G0G0 α ηcηc ∞ [Grenager 2005]

Gibbs Sampling [Neal 2000] Algorithm 2 (cont.):  We sample from the distribution over an individual cluster assignment c n given y n, and all the other cluster assignments 1. Initialize cluster assignments c 1, …, c N 2. For i=1,…,N, draw c i from: 3. For all c, draw η c | y i (for all i such that c i = c) if c = c j for some j ≠ i otherwise

We now have a statistically principled mechanism for solving our original problem. This was intended as a general and fairly shallow overview of Dirichlet Processes. Conclusion

Acknowledgments Much thanks goes to David Blei. Some material for this presentation was inspired by slides from Teg Grenager and Zoubin Ghahramani.

References Blei, David M. and Michael I. Jordan. “Variational inference for Dirichlet process mixtures.” Bayesian Analysis 1(1), R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9: , Ghahramani, Zoubin. “Non-parametric Bayesian Methods.” UAI Tutorial July Grenager, Teg. “Chinese Restaurants and Stick Breaking: An Introduction to the Dirichlet Process” Blackwell, David and James B. MacQueen. “Ferguson Distributions via Polya Urn Schemes.” The Annals of Statistics 1(2), 1973, Ferguson, Thomas S. “A Bayesian Analysis of Some Nonparametric Problems” The Annals of Statistics 1(2), 1973,