Download presentation
1
Latent Dirichlet Allocation
David M Blei, Andrew Y Ng & Michael I Jordan presented by Tilaye Alemu & Anand Ramkissoon
2
Motivation for LDA In lay terms: document modelling
text classification collaborative filtering ... ...in the context of Information Retrieval The principal focus in this paper is on document classification within a corpus
3
Structure of this talk Part 1: Part 2: Theory Background
(some) other approaches Part 2: Experimental results some details of usage wider applications
4
LDA: conceptual features
Generative Probabilistic Collections of discrete data 3-level hierarchical Bayesian model mixture models efficient approximate inference techniques variational methods EM algorithm for empirical Bayes parameter estimation
5
How to classify text documents
Word (term) frequency tf-idf term-by-document matrix discriminative sets of words fixed-length lists of numbers little statistical structure Dimensionality reduction techniques Latent Semantic Indexing Singular value decomposition not generative
6
How to classify text documents ct'd
probabilistic LSI (PLSI) each word generated by one topic each document generated by a mixture of topics a document is represented as a list of mixing proportions for topics No generative model for these numbers Number of parameters grows linearly with the corpus Overfitting How to classify documents outside training set
7
A major simplifying assumption
A document is a “bag of words” A corpus is a “bag of documents” order is unimportant exchangeability de Finetti representation theorem any collection of exchangeable random variables has a representation as a (generally infinite) mixture distribution
8
A note about exchangeability
Does not mean that random variables are iid iid when conditioned on wrt to an underlying latent parameter of a probability distribution Conditionally the joint distribution is simple and factored
9
Notation word: unit of discrete data, an item from a vocabulary indexed {1,...,V} each word is a unit basis V-vector document: sequence of N words w=(w1,...,wN) corpus a collection of M documents D=(w1,...,wM) Each document is considered a random mixture over latent topics Each topic is considered a distribution over words
10
LDA assumes a generative process for each document in the corpus
11
Probability density for the Dirichlet Random variable
12
Joint distribution of a Topic mixture
13
Marginal distribution of a document
14
Probability of a corpus
15
Marginalize over z The word distribution The generative process
16
a Unigram Model
17
probabilistic Latent Semantic Indexing
18
Inference from LDA
19
Variational Inference
20
A family of distributions on latent variables
The Dirichlet parameter γ and the multinomial parameters φ are the free variational parameters
21
The update equations Minimize the Kullback-Leibler divergence between the distribution and the true posterior
22
Variational Inference Algorithm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.