Latent Dirichlet Allocation

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
A Tutorial on Learning with Bayesian Networks
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Exact Inference in Bayes Nets
Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.
Scaling up LDA William Cohen. Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Overview Full Bayesian Learning MAP learning
Bayesian network inference
… Hidden Markov Models Markov assumption: Transition model:
Bayesian Belief Networks
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Learning Bayesian Networks
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Bayesian Networks Alan Ritter.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Crash Course on Machine Learning
Computer vision: models, learning and inference
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
Bayesian Networks: Independencies and Inference Scott Davies and Andrew Moore Note to other teachers and users of these slides. Andrew and Scott would.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
CS Statistical Machine learning Lecture 24
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
Topic Models.
Latent Dirichlet Allocation
Techniques for Dimensionality Reduction
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Analysis of Social Media MLD , LTI William Cohen
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Scaling up LDA William Cohen. First some pictures…
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
Slide 1 Directed Graphical Probabilistic Models: inference William W. Cohen Machine Learning Feb 2008.
Analysis of Social Media MLD , LTI William Cohen
Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Slide 1 Directed Graphical Probabilistic Models: learning in DGMs William W. Cohen Machine Learning
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
A Brief Introduction to Bayesian networks
CS 2750: Machine Learning Directed Graphical Models
Today.
Bayes Net Learning: Bayesian Approaches
Exam Review Session William Cohen.
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
CSCI 5822 Probabilistic Models of Human and Machine Learning
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Bayesian Statistics and Belief Networks
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Directed Graphical Probabilistic Models: the sequel
Michal Rosen-Zvi University of California, Irvine
Expectation-Maximization & Belief Propagation
Latent Dirichlet Allocation
LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT
Topic models for corpora and for graphs
Approximate Inference by Sampling
10-405: LDA Lecture 2.
Topic Models in Text Processing
NER with Models Allowing Long-Range Dependencies
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

Latent Dirichlet Allocation

Directed Graphical Models William W. Cohen Machine Learning 10-601

DGMs: The “Burglar Alarm” example Node ~ random variable Arcs define form of probability distribution: Pr(Xi | X1,…,Xi-1,Xi+1,…,Xn) = Pr(Xi | parents(Xi)) Burglar Earthquake Alarm Phone Call Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!

DGMs: The “Burglar Alarm” example Node ~ random variable Arcs define form of probability distribution: Pr(Xi | X1,…,Xi-1,Xi+1,…,Xn) = Pr(Xi | parents(Xi)) Burglar Earthquake Alarm Phone Call Generative story (and joint distribution): Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm | Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall | Alarm)

DGMs: The “Burglar Alarm” example You can also compute other quantities: e.g., Pr(Burglar=true | PhoneCall=true) Pr(Burglar=true | PhoneCall=true, and Earthquake=true) Burglar Earthquake Alarm Phone Call Generative story: Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm | Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall | Alarm)

Inference in Bayes nets General problem: given evidence E1,…,Ek compute P(X|E1,..,Ek) for any X Big assumption: graph is “polytree” <=1 undirected path between any nodes X,Y Notation: U1 U2 X Z1 Z2 Y1 Y2

Inference in Bayes nets: P(X|E) Step 1 – fancy version of Bayes rule 2: X is E- is B, E+ is C (2) Step 3 - E+: causal support E-: evidential support

Inference in Bayes nets: P(X|E+) CPT table lookup Recursive call to P(.|E+) So far: simple way of propagating requests for “belief due to causal evidence” up the tree Evidence for Uj that doesn’t go thru X I.e. info on Pr(X|E+) flows down

Inference in Bayes nets: P(E-|X) simplified d-sep + polytree Recursive call to P(E-|.) So far: simple way of propagating requests for “belief due to evidential support” down the tree I.e. info on Pr(E-|X) flows up

Recap: Message Passing for BP We reduced P(X|E) to product of two recursively calculated parts: P(X=x|E+) i.e., CPT for X and product of “forward” messages from parents P(E-|X=x) i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) This can also be implemented by message-passing (belief propagation) Messages are distributions – i.e., vectors

Recap: Message Passing for BP Top-level algorithm Pick one vertex as the “root” Any node with only one edge is a “leaf” Pass messages from the leaves to the root Pass messages from root to the leaves Now every X has received P(X|E+) and P(E-|X) and can compute P(X|E)

NAÏVE BAYES AS A DGM

Naïve Bayes as a DGM for every X Nd D Plate diagram For each document d in the corpus (of size D): Pick a label yd from Pr(Y) For each word in d (of length Nd): Pick a word xid from Pr(X|Y=yd) Pr(Y=y) onion 0.3 economist 0.7 Y for every X Y X Pr(X|y=y) onion aardvark 0.034 ai 0.0067 … economist 0.0000003 …. zymurgy 0.01000 X Nd D Plate diagram

Naïve Bayes as a DGM For each document d in the corpus (of size D): Pick a label yd from Pr(Y) For each word in d (of length Nd): Pick a word xid from Pr(X|Y=yd) Y Not described: how do we smooth for classes? For multinomials? How many classes are there? …. X Nd D Plate diagram

Recap: smoothing for a binomial estimate Θ=P(heads) for a binomial with MLE as: MLE: maximize Pr(D|θ) #heads MAP: maximize Pr(D|θ)Pr(θ) #tails and with MAP as: #imaginary heads #imaginary tails Smoothing = prior over the parameter θ

Smoothing for a binomial as a DGM MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ D #imaginary tails MAP is a Viterbi-style inference: want to find max probability paramete θ, to the posterior distribution Also: inference in a simple graph can be intractable if the conditional distributions are complicated

Smoothing for a binomial as a DGM MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ replacing D=d with F=f1,f2,… F Dirichlet prior has imaginary data for each of K classes #imaginary tails α1 + α2 Comment: conjugate for binomial is a Beta, conjugate for a multinomial is called a Dirichlet

Recap: Naïve Bayes as a DGM Now: let’s turn Bayes up to 11 for naïve Bayes…. Y X Nd D Plate diagram

A more Bayesian Naïve Bayes From a Dirichlet α: Draw a multinomial π over the K classes From a Dirichlet β For each class y=1…K Draw a multinomial γ[y] over the vocabulary For each document d=1..D: Pick a label yd from π For each word in d (of length Nd): Pick a word xid from γ[yd] α π Y W Nd D β γ K

Unsupervised Naïve Bayes From a Dirichlet α: Draw a multinomial π over the K classes From a Dirichlet β For each class y=1…K Draw a multinomial γ[y] over the vocabulary For each document d=1..D: Pick a label yd from π For each word in d (of length Nd): Pick a word xid from γ[yd] α π Y W Nd D β γ K

Unsupervised Naïve Bayes The generative story is the same What we do is different We want to both find values of γ and π and find values for Y for each example. EM is a natural algorithm α π Y W Nd D β γ K

LDA AS A DGM

The LDA Topic Model

Unsupervised NB vs LDA different class distrib for each doc one class prior Y Nd D π W γ K α β Y Nd D π W γ K α β one Y per doc one Y per word

Unsupervised NB vs LDA different class distrib θ for each doc one class prior Y Nd D θd γk K α β Zdi Y Nd D π W γ K α β one Y per doc one Z per word Wdi

LDA Blei’s motivation: start with BOW assumption Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,,M Generate d ~ D1(…) For each word n = 1,, Nd generate wn ~ D2( . | θdn) Now pick your favorite distributions for D1, D2  w

Unsupervised NB vs LDA  w Y γk α β Zdi Wdi θd Nd D θd γk K α β Zdi Unsupervised NB clusters documents into latent classes LDA clusters word occurrences into latent classes (topics) The smoothness of βk  same word suggests same topic The smoothness of θd  same document suggests same topic  Wdi w

LDA’s view of a document Mixed membership model

LDA topics: top words w by Pr(w|Z=k)

SVM using 50 features: Pr(Z=k|θd) 50 topics vs all words, SVM

Gibbs Sampling for LDA

LDA Latent Dirichlet Allocation Parameter learning: Variational EM … Collapsed Gibbs Sampling Wait, why is sampling called “learning” here? Here’s the idea….

LDA Gibbs sampling – works for any directed model! Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X) α Z1 X1 Y1 θ α Z2 X2 Y2 θ Z X Y 2 I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X)

Z1 X1 Y1 θ α Z2 X2 Y2 . Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) Initialize all the hidden variables randomly, then…. Z1 X1 Y1 θ α Z2 X2 Y2 Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior Pick Z2 ~ Pr(Z|x2,y2,θ) . Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior So we will have (a sample of) the true θ Pick Z2 ~ Pr(Z|x2,y2,θ) in a broad range of cases eventually these will converge to samples from the true joint distribution

Why does Gibbs sampling work? Basic claim: when you sample x~P(X|y1,…,yk) then if y1,…,yk were sampled from the true joint then x will be sampled from the true joint So the true joint is a “fixed point” you tend to stay there if you ever get there How long does it take to get there? depends on the structure of the space of samples: how well-connected are they by the sampling steps?

Mixing time depends on the eigenvalues of the state space!

LDA Latent Dirichlet Allocation Parameter learning: Variational EM Not covered in 601-B Collapsed Gibbs Sampling What is collapsed Gibbs sampling?

Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) Pick Z2 ~ Pr(Z2|z1,z3,α) θ Pick Z3 ~ Pr(Z3|z1,z2,α) Pick Z1 ~ Pr(Z1|z2,z3,α) Z1 Z2 Z3 Pick Z2 ~ Pr(Z2|z1,z3,α) Pick Z3 ~ Pr(Z3|z1,z2,α) X1 Y1 X2 Y2 X3 Y3 . Converges to samples from the true joint … and then we can estimate Pr(θ| α, sample of Z’s)

What’s this distribution? Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ What’s this distribution? Z1 Z2 Z3 X1 Y1 X2 Y2 X3 Y3

α θ Z1 Z2 Z3 Pick Z1 ~ Pr(Z1|z2,z3,α) Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ Simpler case: What’s this distribution? called a Dirichlet-multinomial and it looks like this: Z1 Z2 Z3 If there are k values for the Z’s and nk = # Z’s with value k, then it turns out: if n is pos int

α θ Z = (Z1,…, Zm) Z(-i) = (Z1,…, Zi-1 , Zi+1 ,…, Zm) Z1 Z2 Z3 Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) It turns out that sampling from a Dirichlet-multinomial is very easy! θ Notation: k values for the Z ’s nk = # Z ’s with value k Z = (Z1,…, Zm) Z(-i) = (Z1,…, Zi-1 , Zi+1 ,…, Zm) nk (-i)= # Z ’s with value k excluding Zi Z1 Z2 Z3

β[k] α θ Z1 Z2 Z3 X1 X2 X3 Z1 X1 Z2 X2 β[k] η η What about with downstream evidence? α captures the constraints on Zi via θ (from “above”, “causal” direction) what about via β? Pr(Z) = (1/c) * Pr(Z|E+) Pr(E-|Z) θ Z1 Z2 Z3 X1 X2 X3 Z1 X1 Z2 X2 β[k] η β[k] η

Sampling for LDA Y βk α η Zdi Wdi Z(-d,i) = all the Z’s but Zd,i θd Nd Notation: k values for the Zd,i ’s Z(-d,i) = all the Z’s but Zd,i nw,k = # Zd,i ’s with value k paired with Wd,i=w n*,k = # Zd,i ’s with value k nw,k (-d,i)= nw,k excluding Zi,d n*,k (-d,i)= n*k excluding Zi,d n*,k d,(-i)= n*k from doc d excluding Zi,d Y Nd D θd βk K α η Zdi Wdi Pr(E-|Z) Pr(Z|E+) fraction of time Z=k in doc d fraction of time W=w in topic k

IMPLEMENTING LDA

Way way more detail

More detail

z=1 z=2 random z=3 unit height … …

What gets learned…..

In A Math-ier Notation N[*,k] N[d,k] N[*,*]=V M[w,k]

for each document d and word position j in d z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d

for each pass t=1,2,…. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j]

z=1 z=2 random z=3 unit height … … You spend a lot of time sampling There’s a loop over all topics here in the sampler