Latent Dirichlet Allocation

Name: Latent Dirichlet Allocation
Uploaded: 2017-08-23T06:31:29+00:00
Duration: PTM4S30
Channel: Avis Daniels
Description: Latent Dirichlet Allocation

Latent Dirichlet Allocation
David M Blei, Andrew Y Ng & Michael I Jordan presented by Tilaye Alemu & Anand Ramkissoon

Motivation for LDA In lay terms: document modelling
text classification collaborative filtering ... ...in the context of Information Retrieval The principal focus in this paper is on document classification within a corpus

Structure of this talk Part 1: Part 2: Theory Background
(some) other approaches Part 2: Experimental results some details of usage wider applications

LDA: conceptual features
Generative Probabilistic Collections of discrete data 3-level hierarchical Bayesian model mixture models efficient approximate inference techniques variational methods EM algorithm for empirical Bayes parameter estimation

How to classify text documents
Word (term) frequency tf-idf term-by-document matrix discriminative sets of words fixed-length lists of numbers little statistical structure Dimensionality reduction techniques Latent Semantic Indexing Singular value decomposition not generative

How to classify text documents ct'd
probabilistic LSI (PLSI) each word generated by one topic each document generated by a mixture of topics a document is represented as a list of mixing proportions for topics No generative model for these numbers Number of parameters grows linearly with the corpus Overfitting How to classify documents outside training set

A major simplifying assumption
A document is a “bag of words” A corpus is a “bag of documents” order is unimportant exchangeability de Finetti representation theorem any collection of exchangeable random variables has a representation as a (generally infinite) mixture distribution

A note about exchangeability
Does not mean that random variables are iid iid when conditioned on wrt to an underlying latent parameter of a probability distribution Conditionally the joint distribution is simple and factored

Notation word: unit of discrete data, an item from a vocabulary indexed {1,...,V} each word is a unit basis V-vector document: sequence of N words w=(w1,...,wN) corpus a collection of M documents D=(w1,...,wM) Each document is considered a random mixture over latent topics Each topic is considered a distribution over words

LDA assumes a generative process for each document in the corpus

Probability density for the Dirichlet Random variable

Joint distribution of a Topic mixture

Marginal distribution of a document

Probability of a corpus

Marginalize over z The word distribution The generative process

a Unigram Model

probabilistic Latent Semantic Indexing

Inference from LDA

Variational Inference

A family of distributions on latent variables
The Dirichlet parameter γ and the multinomial parameters φ are the free variational parameters

The update equations Minimize the Kullback-Leibler divergence between the distribution and the true posterior

Variational Inference Algorithm

Latent Dirichlet Allocation

Similar presentations

Presentation on theme: "Latent Dirichlet Allocation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Latent Dirichlet Allocation

Similar presentations

Presentation on theme: "Latent Dirichlet Allocation"— Presentation transcript:

Similar presentations

About project

Feedback