Latent Dirichlet Allocation a generative model for text

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Teg Grenager NLP Group Lunch February 24, 2005
Xiaolong Wang and Daniel Khashabi
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Information retrieval – LSI, pLSI and LDA
Hierarchical Dirichlet Processes
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
CS 599: Social Media Analysis University of Southern California1 Elementary Text Analysis & Topic Modeling Kristina Lerman University of Southern California.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Generative Topic Models for Community Analysis
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
British Museum Library, London Picture Courtesy: flickr.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.
Visual Recognition Tutorial
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Scalable Text Mining with Sparse Generative Models
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Online Learning for Latent Dirichlet Allocation
2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
Techniques for Dimensionality Reduction
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Link Distribution on Wikipedia [0407]KwangHee Park.
Web-Mining Agents Topic Analysis: pLSI and LDA
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
14.0 Linguistic Processing and Latent Topic Analysis.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
The topic discovery models
The topic discovery models
Latent Dirichlet Analysis
The topic discovery models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Topic Models in Text Processing
Presentation transcript:

Latent Dirichlet Allocation a generative model for text David M. Blei, Andrew Y. Ng, Michael I. Jordan (2002) Presenter: Ido Abramovich

Overview Motivation Other models Notation and terminology Latent Dirichlet allocation method LDA in relation to other models A geometric interpretation The problems of estimating Example

Motivation What do we want to do with text corpora? classification, novelty detection, summarization and similarity/relevance judgments. Given a text corpora or other collection of discrete data we wish to: Find a short description of the data. Preserve the essential statistical relationships

Term Frequency – Inverse Document Frequency tf-idf (Salton and McGill, 1983) The term frequency count is compared to an inverse document frequency count. Results in a txd matrix – thus reducing the corpus to a fixed-length list Basic identification of sets of words that are discriminative for documents in the collection Used for search engines

LSI (Deerwester et al., 1990) Latent Semantic Indexing Classic attempt at solving this problem in information retrieval Uses SVD to reduce document representations Models synonymy and polysemy Computing SVD is slow Non-probabilistic model

pLSI Hoffman (1999) A generative model Models each word in a document as a sample from a mixture model. Each word is generated from a single topic, different words in the document may be generated from different topics. Each document is represented as a list of mixing proportions for the mixture components.

Exchangeability A finite set of random variables is said to be exchangeable if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N: An infinite sequence of random is infinitely exchangeable if every finite subsequence is exchangeable

bag-of-words Assumption Word order is ignored “bag-of-words” – exchangeability, not i.i.d Theorem (De Finetti, 1935) – if are infinitely exchangeable, then the joint probability has a representation as a mixture: For some random variable θ

Notation and terminology A word is an item from a vocabulary indexed by {1,…,V}. We represent words using unit-basis vectors. The vth word is represented by a V-vector w such that and for A document is a sequence of N words denoted by , where is the nth word in the sequence. A corpus is a collection of M documents denoted by

Latent Dirichlet allocation LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words.

LDA – generative process Choose For each of the N words : Choose a topic Choose a word from , a multinomial probability conditioned on the topic

Dirichlet distribution A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex, and has the following probability density on this simplex:

The graphical model

The LDA equations

LDA and exchangeability We assume that words are generated by topics and that those topics are infinitely exchangeable within a document. By de Finetti’s theorem: By marginalizing out the topic variables, we get eq. 3 in the previous slide.

Unigram model

Mixture of unigrams

Probabilistic LSI

A geometric interpretation word simplex

A geometric interpretation topic 1 topic simplex word simplex topic 2 topic 3

A geometric interpretation topic 1 topic simplex word simplex topic 2 topic 3

A geometric interpretation topic 1 topic simplex word simplex topic 2 topic 3

Inference We want to compute the posterior dist. Of the hidden variables given a document: Unfortunately, this is intractable to compute in general. We write Eq. (3) as:

Variational inference

Parameter estimation Variational EM (E Step) For each document, find the optimizing values of the variational parameters (γ, φ) with α, β fixed. (M Step) Maximize variational distribution w.r.t. α, β for the γ and φ values found in the E step.

Smoothed LDA Introduces Dirichlet smoothing on β to avoid the “zero frequency problem” More Bayesian approach Inference and parameter learning similar to unsmoothed LDA

Document modeling Unlabeled data – our goal is density estimation. Compute the perplexity of a held-out test to evaluate the models – lower perplexity score indicates better generalization. .

Document Modeling – cont. data used C. Elegans Community abstracts 5,225 abstracts 28,414 unique terms TREC AP corpus (subset) 16,333 newswire articles 23,075 unique terms Held-out data – 10% Removed terms – 50 stop words, words appearing once (AP)

nematode

AP

Document Modeling – cont. Results Both pLSI and mixture suffer from overfitting. Mixture – peaked posteriors in the training set. Can solve overfitting with variational Bayesian smoothing. Perplexity Num. topics (k) pLSI Mult. Mixt. 7,052 22,266 2 17,588 2.20 x 108 5 63.800 1.93 x 1017 10 2.52 x 105 1.20 x 1022 20 5.04 x 106 4.19 x 10106 50 1.72 x 107 2.39 x 10150 100 1.31 x 107 3.51 x 10264 200

Document Modeling – cont. Results Both pLSI and mixture suffer from overfitting. pLSI – overfitting due to dimensionality of the p(z|d) parameter. As k gets larger, the chance that a training document will cover all the topics in a new document decreases Perplexity Num. topics (k) pLSI Mult. Mixt. 7,052 22,266 2 17,588 2.20 x 108 5 63.800 1.93 x 1017 10 2.52 x 105 1.20 x 1022 20 5.04 x 106 4.19 x 10106 50 1.72 x 107 2.39 x 10150 100 1.31 x 107 3.51 x 10264 200

Other uses

Summary Based on the exchangeability assumption Can be viewed as a dimensionality reduction technique Exact inference is intractable, we can approximate instead Can be used in other collection – images and caption for example.