Latent Dirichlet Allocation

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Topic models Source: Topic models, David Blei, MLSS 09.
Xiaolong Wang and Daniel Khashabi
A Tutorial on Learning with Bayesian Networks
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Information retrieval – LSI, pLSI and LDA
Hierarchical Dirichlet Processes
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Statistical Topic Modeling part 1
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Bayesian Nonparametric Matrix Factorization for Recorded Music Reading Group Presenter: Shujie Hou Cognitive Radio Institute Friday, October 15, 2010 Authors:
Visual Recognition Tutorial
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Integrating Topics and Syntax -Thomas L
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Web-Mining Agents Topic Analysis: pLSI and LDA
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
Learning Deep Generative Models by Ruslan Salakhutdinov
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Model Inference and Averaging
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Expectation-Maximization
Language Models for Information Retrieval
Probabilistic Topic Models.
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
CS590I: Information Retrieval
Presentation transcript:

Latent Dirichlet Allocation Presenter: Hsuan-Sheng Chiu

Reference D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.

Outline Introduction Notation and terminology Latent Dirichlet allocation Relationship with other latent variable models Inference and parameter estimation Discussion

Introduction We consider with the problem of modeling text corpora and other collections of discrete data To find short description of the members a collection Significant process in IR tf-idf scheme (Salton and McGill, 1983) Latent Semantic Indexing (LSI, Deerwester et al., 1990) Probabilistic LSI (pLSI, aspect model, Hofmann, 1999)

Introduction (cont.) Problem of pLSI: Exchangeability: bag of words Incomplete: Provide no probabilistic model at the level of documents The number of parameters in the model grows linear with the size of the corpus It is not clear how to assign probability to a document outside of the training data Exchangeability: bag of words

Notation and terminology A word is the basic unit of discrete data ,from vocabulary indexed by {1,…,V}. The vth word is represented by a V-vector w such that wv = 1 and wu = 0 for u≠v A document is a sequence of N words denote by w = (w1,w2,…,wN) A corpus is a collection of M documents denoted by D = {w1,w2,…,wM}

Latent Dirichlet allocation Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. Generative process for each document w in a corpus D: 1. Choose N ~ Poisson(ξ) 2. Choose θ ~ Dir(α) 3. For each of the N words wn Choose a topic zn ~ Multinomial(θ) Choose a word wn from p(wn|zn, β), a multinomial probability conditioned on the topic zn βij is a a element of k×V matrix = p(wj = 1| zi = 1)

Latent Dirichlet allocation (cont.) Representation of a document generation: θ~ Dir(α) → {z1,z2,…,zk} β(z) →{w1,w2,…,wn} z1 z2 … zN w1 w2 wN w N ~ Poisson

Latent Dirichlet allocation (cont.) Several simplifying assumptions: 1. The dimensionality k of Dirichlet distribution is known and fixed 2. The word probabilities β is fixed quantity that is to be estimated 3. Document length N is independent of all the other data generating variable θ and z A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex http://www.answers.com/topic/dirichlet-distribution

Latent Dirichlet allocation (cont.) Simplex: The above figures show the graphs for the n-simplexes with n =2 to 7. (from mathworld, http://mathworld.wolfram.com/Simplex.html)

Latent Dirichlet allocation (cont.) The joint distribution of a topic θ, and a set of N topic z, and a set of N words w: Marginal distribution of a document: Probability of a corpus:

Latent Dirichlet allocation (cont.) There are three levels to LDA representation αβ are corpus-level parameters θd are document-level variables zdn, wdn are word-level variables corpus document Refer to as hierarchical models, conditionally independent hierarchical models and parametric empirical Bayes models

Latent Dirichlet allocation (cont.) LDA and exchangeability A finite set of random variables {z1,…,zN} is said exchangeable if the joint distribution is invariant to permutation (πis a permutation) A infinite sequence of random variables is infinitely exchangeable if every finite subsequence is exchangeable De Finetti’s representation theorem states that the joint distribution of an infinitely exchangeable sequence of random variables is as if a random parameter were drawn from some distribution and then the random variables in question were independent and identically distributed, conditioned on that parameter http://en.wikipedia.org/wiki/De_Finetti's_theorem

Latent Dirichlet allocation (cont.) In LDA, we assume that words are generated by topics (by fixed conditional distributions) and that those topics are infinitely exchangeable within a document

Latent Dirichlet allocation (cont.) A continuous mixture of unigrams By marginalizing over the hidden topic variable z, we can understand LDA as a two-level model Generative process for a document w 1. choose θ~ Dir(α) 2. For each of the N word wn Choose a word wn from p(wn|θ, β) Marginal distribution od a document

Latent Dirichlet allocation (cont.) The distribution on the (V-1)-simplex is attained with only k+kV parameters.

Relationship with other latent variable models Unigram model Mixture of unigrams Each document is generated by first choosing a topic z and then generating N words independently form conditional multinomial k-1 parameters

Relationship with other latent variable models (cont.) Probabilistic latent semantic indexing Attempt to relax the simplifying assumption made in the mixture of unigrams models In a sense, it does capture the possibility that a document may contain multiple topics kv+kM parameters and linear growth in M

Relationship with other latent variable models (cont.) Problem of PLSI There is no natural way to use it to assign probability to a previously unseen document The linear growth in parameters suggests that the model is prone to overfitting and empirically , overfitting is indeed a serious problem LDA overcomes both of there problems by treating the topic mixture weights as a k-parameter hidden random variable The k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus.

Relationship with other latent variable models (cont.) A geometric interpretation: three topics and three words

Relationship with other latent variable models (cont.) The unigram model find a single point on the word simplex and posits that all word in the corpus come from the corresponding distribution. The mixture of unigram models posits that for each documents, one of the k points on the word simplex is chosen randomly and all the words of the document are drawn from the distribution The pLSI model posits that each word of a training documents comes from a randomly chosen topic. The topics are themselves drawn from a document-specific distribution over topics. LDA posits that each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter

Inference and parameter estimation The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document Unfortunately, this distribution is intractable to compute in general. A function which is intractable due to the coupling between θ and β in the summation over latent topics

Inference and parameter estimation (cont.) The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood. Essentially, one considers a family of lower bounds, indexed by a set of variational parameters. A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed.

Inference and parameter estimation (cont.) Drop some edges and the w nodes

Inference and parameter estimation (cont.) Variational distribution: Lower bound on Log-likelihood KL between variational posteriori and true posteriori

Inference and parameter estimation (cont.) Finding a tight lower bound on the log likelihood Maximizing the lower bound with respect to γand φ is equivalent to minimizing the KL divergence between the variational posterior probability and the true posterior probability

Inference and parameter estimation (cont.) Expand the lower bound:

Inference and parameter estimation (cont.) Then

Inference and parameter estimation (cont.) We can get variational parameters by adding Lagrange multipliers and setting this derivative to zero:

Inference and parameter estimation (cont.) Maximize log likelihood of the data: Variational inference provide us with a tractable lower bound on the log likelihood, a bound which we can maximize with respect α and β Variational EM procedure 1. (E-step) For each document, find the optimizing values of the variational parameters {γ, φ} 2. (M-step) Maximize the result lower bound on the log likelihood with respect to the model parameters α and β

Inference and parameter estimation (cont.) Smoothed LDA model:

Discussion LDA is a flexible generative probabilistic model for collection of discrete data. Exact inference is intractable for LDA, but any or a large suite of approximate inference algorithms for inference and parameter estimation can be used with the LDA framework. LDA is a simple model and is readily extended to continuous data or other non-multinomial data.