Hierarchical Dirichlet Process (HDP)

Slides:



Advertisements
Similar presentations
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
Teg Grenager NLP Group Lunch February 24, 2005
Gentle Introduction to Infinite Gaussian Mixture Modeling
Xiaolong Wang and Daniel Khashabi
Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Course: Neural Networks, Instructor: Professor L.Behera.
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
Bayesian dynamic modeling of latent trait distributions Duke University Machine Learning Group Presented by Kai Ni Jan. 25, 2007 Paper by David B. Dunson,
Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.
Latent Dirichlet Allocation a generative model for text
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream (UAI 2010) Amr Ahmed and Eric.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Stick-Breaking Constructions
Storylines from Streaming Text The Infinite Topic Cluster Model Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing Carnegie Mellon.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
Amir Harati and Joseph Picone
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Gaussian Processes For Regression, Classification, and Prediction.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Bayesian Semi-Parametric Multiple Shrinkage
Non-Parametric Models
Statistical Models for Automatic Speech Recognition
Machine Learning Basics
Multitask Learning Using Dirichlet Process
Statistical Models for Automatic Speech Recognition
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Matching Words with Pictures
Nonparametric Bayesian Texture Learning and Synthesis
Topic Models in Text Processing
Presentation transcript:

Hierarchical Dirichlet Process (HDP) Amir Harati

Latent Dirichlet Allocation (LDA) Relation to the topic: LDA is the parametric counterpart of HDP First motivation: Extract efficient features from text data (dimensionality reduction). LDA is a generative probabilistic model of a corpus. “The basic idea is that documents are presented as random mixtures over latent topics; each topic is characterized by a distribution over words”.

LDA N: Number of words M: Number of documents Graphical model Θ is the distribution of topic mixture and is a Dirichlet distribution. As it can be seen, in LDA topic node is sampled for each word so each document can have multiple topics. We can summarize a document by a vector of weights for each topic. Limitation: Number of topics should somehow inferred before applying the algorithm and this is a difficult problem by itself.

HDP Suppose data is divided into J groups (i.e. documents) and within each group we want to find clusters (i.e. topics) that capture the latent structure in the data assigned to that group. The number of clusters within each group is unknown and moreover we want clusters to be shared (tied) across groups. For example in case of text corpus, each group is a document and each cluster is a topic. We want to infer topics for each document but topics are shared among documents in the corpus or generally a group of corpus. By comparing this problem with LDA, we can see in order to let number of topics be unspecified we should replace Dirichlet distribution with Dirichlet process (DP). In other words, we associate a DP to each group. But this cause another problem: With this setting clusters could not be shared among groups (Because different DP would have different atoms) . The solution is to somehow, link these DPs together. The first attempt is use common G0 as the base measure. However, this does not solve the problem because for smooth G0 each DP will have a different set of atoms with probability one. To tackle this problem we should have a discrete G0 with broad support. In other words, G0 is itself another DP. In other words, G0 is an infinite discreet function and its support includes the support of all Gj . This construction cause random measure Gj place its atoms at discrete locations determined by G0. Therefore, HDP ties Gi by letting them share the base measure and letting this base measure to be random.

HDP Stick breaking representation

HDP graphical model of an example HDP mixture model with 3 groups. Corresponding to each DP node we also plot a sample draw from the DP using the stick-breaking construction.

HDP Chinese Restaurant Franchise (CRP): In CRP the metaphor of Chinese restaurant is extended to a group of restaurants (J). The coupling among restaurants is achieved via a common menu. In CRF number of clusters scales doubly logarithmically in the size of each group and logarithmically in the number of groups. In other words CRF says number of clusters grow slowly with N.

HDP An instantiation of the CRF representation for the 3 group HDP. Each of the 3 restaurants has customers sitting around tables, and each table is served a dish (which corresponds to customers in the Chinese restaurant for the global DP).

HDP Applications: Information Retrieval Multi-population haplotype phasing Topic modeling (mixture models): extension of LDA to nonparametric case.

Infinite HMM(IHMM) or HDP-HMM An HMM with a countably infinite state space. Also known as HDP-HMM. In classical HMM transition probability 𝜋( 𝜃 𝑡 , 𝜃 𝑡+1 ) plays the role of a mixing proportion and emission distribution 𝐹 𝜃 𝑡 plays the role of the mixture component. (Suppose for now that emission distribution is not a mixture itself and whole HMM represent a mixture.) The idea is therefore to look at the HMM as mixture model in which each state represent one component. (HMM is a mixture that the probability of selecting a cluster is not independent of the current cluster.) If we want to replace the finite mixture model with DP we obtain a set of DPs (one for each current state.) if these DPs are not tied then sets of states accessible from each state would be disjoint from other states. (branching process instead of chain process.) The solution is to use HDP instead of DP

HDP-HMM Instead of transition matrix we would have a set of transition kernels. This diagram does not show the structure of the HMM. It is not a strict left-right model. Each state can go to many other states depending on 𝐺 𝜃

Sticky HDP-HMM Limitations of HDP-HMM: It can not model state persistence. It has a tendency to create redundant sates and switch rapidly among them. (In other words, it tends to more complicated models.) Because of the above problem, in high dimensional problems data is fragmented among many redundant states and prediction performance reduced. It is limited to uni-modal (Gaussian) emission distribution. (If both rapid switching and multiple Gaussian existed in the same algorithm the uncertainty in the posterior distributions gets very high.) In cases that underlying process is multimodal it tends to make many redundant states. Solution: Sticky HDP-HMM

Sticky HDP-HMM The basic idea is simple: To augment HDP to contain a parameter for self transition bias and place a separate prior on this parameter. (a) observed Seq. (b) true state Sqe (c) HDP-HMM (d) Sticky HDP-HMM

Sticky HDP-HMM Another extension is to associate a DP to each state of the HMM. The result can model multimodal emission distributions. Notes: It seems in its current form, each state has a separate mixture model. I think another improvement could obtained by using a HDP for modeling emission distribution so we can tie distributions among different states.

Sticky HDP-HMM -applications Speaker Diarization: Segmenting an audio file into speaker homogenous regions.

Speaker Diarization Notes: It seems the result (with current inference algorithm) is sensitive to starting point

Other applications Word segmentation: Model an utterance with a HDP-HMM. The latent state is corresponding to a word ( we have unbounded number of words) and observations are phonemes (either in text or speech) Note: It seems it is essentially a method to encode the n-gram grammar. Trees and grammars : go beyond chain structure;