Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei Presented by Eric Wang 9/16/2011
Introduction Latent Dirichlet Allocation (LDA) is a powerful and ubiquitous topic modeling framework. Incorporating the hierarchical Dirichlet process (HDP) into the LDA allows for more flexible topic modeling by estimating the global topic proportions. A drawback of HDP-LDA is that a topic that is rare globally will also have a low expected proportion within each document. The authors propose a model that allows a rare topic to still have large mass within individual documents.
Hierarchical Dirichlet Process The hierarchical Dirichlet process (HDP) is a prior for Bayesian nonparametric mixed membership modeling of data groups. Hierarchically, it can be defined as where m indexes the data group. In HDP, the expectation of the mixing weights in is . In practice, the mixing weights in is the global average of the mixture membership.
Indian Buffet Process The Indian Buffet Process (IBP) defines a distribution over binary matrices with an infinite number of columns, and a finite number of non-zero entries. Hierarchically, it is defined as where m and k denote the rows and columns of binary matrix b. It can be represented via a stick-breaking construction
IBP Compound Dirichlet Process Combining HDP and IBP into single prior yields an infinite “spike-slab” prior (ICD). A spike distribution (IBP) determines which variables are drawn from the slab (DP). The model assumes the following generative process
IBP Compound Dirichlet Process The atom masses of data group m is Dirichlet distributed as follows where In this construction, the are the topic proportions for document m and B is a binary vector indicating usage of the dictionary elements.
Focused Topic Models The authors use ICD to develop the Focused Topic model (FTM). In this framework, a global distribution over topics is drawn and shared over all documents as in HDP-LDA. Each document infers a subset of topics from the global menu. The subset is determined by the binary vector . Since the binary vector is independent of the global topic proportions, topics that are rare globally can still make up a large proportion of individual documents.
Focused Topic Models The generative process for the FTM is as follows
Posterior Inference To sample the topic indicator for word i in document m, where the integral has an analytical form and . This is an important point because it suggests a general framework that can be adapted to other applications.
Posterior Inference The joint probability of and the total number of words assigned to topic k is and is log differentiable with respect to and . A hybrid MC algorithm is used to sample from their posteriors.
Posterior Inference The topic weights are sampled as And the binary topic indicators are sampled as Notice here that if a topic is used, it is automatically considered “active”, and additional (unused) topics can be activated.
Empirical Results The authors considered three different text datasets: All models were run for 1000 iterations, with the first 500 iterations discarded as burn-in.
Empirical Results Model Perplexity Topic Correlation
Empirical Results Here, the authors compare the number of topics a word appears in (a). The FTM has more concentrated topics. In (b), the authors show the number of documents the topics appear in. The plot illustrates that HDP has many topics that appear in only a few documents, while a significant portion of the FTM topics appear in many documents.
Discussion The authors have proposed a novel model called the IBP compound Dirichlet Process (ICD) that decouples the across-data topic prevalence and the intra-data topic proportions. The Focused Topic Model (FTM) was developed from the ICD that addressed several key shortcomings of HDP-LDA. In HDL-LDA, the global topic prevalence affects the proportion a topic can appear within a document, but in FTM, globally rare topics can still be highly occupied within a document. FTM shows improved perplexity relative to HDP.