Hierarchical Dirichlet Processes

Slides:

Advertisements

Similar presentations

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

Teg Grenager NLP Group Lunch February 24, 2005

Gentle Introduction to Infinite Gaussian Mixture Modeling

Xiaolong Wang and Daniel Khashabi

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

Course: Neural Networks, Instructor: Professor L.Behera.

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Dirichlet Process (HDP)

Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.

Bayesian dynamic modeling of latent trait distributions Duke University Machine Learning Group Presented by Kai Ni Jan. 25, 2007 Paper by David B. Dunson,

DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood

Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.

HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

Latent Dirichlet Allocation a generative model for text

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Nonparametric Bayesian Learning

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.

The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.

Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream (UAI 2010) Amr Ahmed and Eric.

Randomized Algorithms for Bayesian Hierarchical Clustering

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Variational Inference for the Indian Buffet Process

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

Stick-Breaking Constructions

Storylines from Streaming Text The Infinite Topic Cluster Model Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing Carnegie Mellon.

1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.

The Infinite Hierarchical Factor Regression Model Piyush Rai and Hal Daume III NIPS 2008 Presented by Bo Chen March 26, 2009.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.

Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.

Bayesian Density Regression Author: David B. Dunson and Natesh Pillai Presenter: Ya Xue April 28, 2006.

Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.

The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Fast search for Dirichlet process mixture models

Bayesian Generalized Product Partition Model

Non-Parametric Models

Omiros Papaspiliopoulos and Gareth O. Roberts

A Non-Parametric Bayesian Method for Inferring Hidden Causes

Kernel Stick-Breaking Process

Collapsed Variational Dirichlet Process Mixture Models

Hierarchical Topic Models and the Nested Chinese Restaurant Process

Generalized Spatial Dirichlet Process Models

Chinese Restaurant Representation Stick-Breaking Construction

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

Topic Models in Text Processing

Rational models of categorization

Presentation transcript:

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010

Content Introduction and Motivation Dirichlet Processes Hierarchical Dirichlet Processes Definition Three Analogs Inference Three Sampling Strategies

Introduction Hierarchical approach to model-based clustering of grouped data Find an unknown number of clusters to capture the structure of each group and allow for sharing among the groups Documents with an arbitrary number of topics which are shared globably across the set of corpora. A Dirichlet Process will be used as a prior mixture components The DP will be extended to a HDP to allow for sharing clusters among related clustering problems 3

Motivation Interested in problems with observations organized into groups Let xji be the ith observation of group j = xj = {xj1, xj2...} xji is exchangeable with any other element of xj For all j,k , xj is exchangeable with xk 4

Motivation Assume each observation is drawn independently for a mixture model Factor θji is the mixture component associated with xji Let F(θji ) be the distribution of xji given θji Let Gj be the prior distribution of θj1, θj2... which are conditionally independent given Gj 5

Content Introduction and Motivation Dirichlet Processes Hierarchical Dirichlet Processes Definition Three Analogs Inference Three Sampling Strategies

The Dirichlet Process Let (Θ , β) be a measureable space, Let G0 be a probability measure on that space Let A = (A1,A2..,Ar) be a finite partition of that space Let α0 be a positive real number G ~ DP( α0, G0) is defined s.t. for all A : 7

Stick Breaking Construction The general idea is that the distribution G will be a weighted average of the distributions of a set of infinite random variables 2 infinite sets of i.i.d random variables ϕk ~ G0 – Samples from the initial probability measure πk' ~ Beta (1, α0) – Defines the weights of these samples 8

Stick Breaking Construction πk' ~ Beta (1, α0) Define πk as 1 π1' 1-π1' (1-π1')π2' ... 9

Stick Breaking Construction πk ~ GEM(α0) These πk define the weight of drawing the value corresponding to ϕk. 10

Polya urn scheme/ CRP Let each θ1, θ2,.. be i.i.d. Random variables distributed according to G Consider the distribution of θi, given θ1,...θi-1, integrating out G: 11

Polya urn scheme Consider a simple urn model representation. Each sample is a ball of a certain color Balls are drawn equiprobably, and when a ball of color x is drawn, both that ball and a new ball of color x is returned to the urn With Probability proportional to α0, a new atom is created from G0, A new ball of a new color is added to the urn 12

Polya urn scheme Let ϕ1 ...ϕK be the distinct values taken on by θ1,...θi-1, If mk is the number of values of θ1,...θi-1, equal to ϕk: 13

Chinese restaurant process: θ2 θ4 ϕ1 ϕ2 ϕ3 ... θ1 θ3 14

Dirichlet Process Mixture Model Dirichlet Process as nonparametric prior on the parameters of a mixture model: 15

Dirichlet Process Mixture Model From the stick breaking representation: θi will be the distribution represented by ϕk with probability πk Let zi be the indicator variable representing which ϕk θi is associated with: 16

Infinite Limit of Finite Mixture Model Consider a multinomial on L mixture components with parameters π = (π1, … πL) Let π have a symmetric Dirichlet prior with hyperparameters (α0/L,....α0/L) If xi is drawn from a mixture component, zi, according to the defined distribution: 17

Infinite Limit of Finite Mixture Model If , then as L approaches ∞: The marginal distribution of x1,x2.... approaches that of a Dirichlet Process Mixture Model 18

Content Introduction and Motivation Dirichlet Processes Hierarchical Dirichlet Processes Definition Three Analogs Inference Three Sampling Strategies

HDP Definition General idea To model grouped data Each group j <=> a Dirichlet process mixture model Hierarchical prior to link these mixture models <=> hierarchical Dirichlet process A hierarchical Dirichlet process is A distribution over a set of random probability measures ( )

HDP Definition (Cont.) Formally, a hierarchical Dirichlet process defines A set of random probability measures , one for each group j A global random probability measure is a distributed as a Dirichlet process are conditional independent given , also follow DP is discrete!

Hierarchical Dirichlet Process Mixture Model Hierarchical Dirichlet process as prior distribution over the factors for grouped data For each group j Each observation corresponds to a factor The factors are i.i.d random. variables distributed as

Some Notices HDP can be extended to more than two levels The base measure H can be drawn from a DP, and so on and so forth A tree can be formed Each node is a DP Children nodes are conditionally independent given their parent, which is a base measure The atoms at a given node are shared among all its descendant nodes

Analog I: The stick-breaking construction Stick-breaking representation of i.e., i.e.,

Equivalent representation using conditional distributions

Analog II: the Chinese restaurant franchise General idea: Allow multiple restaurants to share a common menu, which includes a set of dishes A restaurant has infinite tables, each table has only one dish

Notations : the index of associated with The factor (dish) corresponding to The factors (dishes) drawn from H The dish chosen by table t in restaurant j : the index of associated with

Conditional distributions Integrate out Gj (sampling table for customer) Integrate out G0 (sampling dish for table) Count notation: , number of customers in restaurant j, at table t, eating dish k , number of tables in restaurant j, eating dish k

Analog III: The infinite limit of finite mixture models Two different finite models both yield HDPM Global mixing proportions place a prior for group-specific mixing proportions As L goes infinity

Each group choose a subset of T mixture components As L, T go to infinity

Content Introduction and Motivation Dirichlet Processes Hierarchical Dirichlet Processes Definition Three Analogs Inference Three Sampling Strategies

Introduction to three MCMC schemes Assumption: H is conjugate to F A straightforward Gibbs sampler based on Chinese restaurant franchise An augmented representation involving both the Chinese restaurant franchise and the posterior for G0 A variation to scheme 2 with streamline bookkeeping

Conditional density of data under mixture component k For data , conditional density under component k given all data items except is: For data set , conditional density is similarly defined

Scheme I: Posterior sampling in the Chinese restaurant franchise Sampling t and k Sampling t If is a new t, sampling the k corresponding to it by And

Sampling k Where is all the observations for table t in restaurant j

Scheme II: Posterior sampling with an augmented representation Posterior of G0 given : An explicit construction for G0 is given:

Given a sample of G0, posterior for each group is factorized and sampling in each group can be performed separately Sampling t and k: Almost the same as in Scheme I Except using to replace When a new component knew is instantiated, draw , and set and

Sampling for

Scheme III: Posterior sampling by direct assignment Difference from Scheme I and II: In I and II, data items are first assigned to some table t, and the tables are then assigned to some component k In III, directly assign data items to component via variable , which is equivalent to Tables are collapsed to numbers

Sampling z: Sampling m: Sampling

Comparison of Sampling Schemes In terms of ease of implementation The direct assignment is better In terms of convergence speed Direct assignment changes the component membership of data items one at a time Scheme I and II, component membership of one table will change the membership of multiple data items at the same time, leading to better performance

Applications Hierarchical DP extension of LDA In CRF representation: dishes are topics, customers are the observed words

Applications HDP-HMM

References Yee Whye Teh et. al., Hierarchical Dirichlet Processes, 2006