Latent Dirichlet Allocation
Directed Graphical Models William W. Cohen Machine Learning 10-601
DGMs: The “Burglar Alarm” example Node ~ random variable Arcs define form of probability distribution: Pr(Xi | X1,…,Xi-1,Xi+1,…,Xn) = Pr(Xi | parents(Xi)) Burglar Earthquake Alarm Phone Call Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!
DGMs: The “Burglar Alarm” example Node ~ random variable Arcs define form of probability distribution: Pr(Xi | X1,…,Xi-1,Xi+1,…,Xn) = Pr(Xi | parents(Xi)) Burglar Earthquake Alarm Phone Call Generative story (and joint distribution): Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm | Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall | Alarm)
DGMs: The “Burglar Alarm” example You can also compute other quantities: e.g., Pr(Burglar=true | PhoneCall=true) Pr(Burglar=true | PhoneCall=true, and Earthquake=true) Burglar Earthquake Alarm Phone Call Generative story: Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm | Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall | Alarm)
Inference in Bayes nets General problem: given evidence E1,…,Ek compute P(X|E1,..,Ek) for any X Big assumption: graph is “polytree” <=1 undirected path between any nodes X,Y Notation: U1 U2 X Z1 Z2 Y1 Y2
Inference in Bayes nets: P(X|E) Step 1 – fancy version of Bayes rule 2: X is E- is B, E+ is C (2) Step 3 - E+: causal support E-: evidential support
Inference in Bayes nets: P(X|E+) CPT table lookup Recursive call to P(.|E+) So far: simple way of propagating requests for “belief due to causal evidence” up the tree Evidence for Uj that doesn’t go thru X I.e. info on Pr(X|E+) flows down
Inference in Bayes nets: P(E-|X) simplified d-sep + polytree Recursive call to P(E-|.) So far: simple way of propagating requests for “belief due to evidential support” down the tree I.e. info on Pr(E-|X) flows up
Recap: Message Passing for BP We reduced P(X|E) to product of two recursively calculated parts: P(X=x|E+) i.e., CPT for X and product of “forward” messages from parents P(E-|X=x) i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) This can also be implemented by message-passing (belief propagation) Messages are distributions – i.e., vectors
Recap: Message Passing for BP Top-level algorithm Pick one vertex as the “root” Any node with only one edge is a “leaf” Pass messages from the leaves to the root Pass messages from root to the leaves Now every X has received P(X|E+) and P(E-|X) and can compute P(X|E)
NAÏVE BAYES AS A DGM
Naïve Bayes as a DGM for every X Nd D Plate diagram For each document d in the corpus (of size D): Pick a label yd from Pr(Y) For each word in d (of length Nd): Pick a word xid from Pr(X|Y=yd) Pr(Y=y) onion 0.3 economist 0.7 Y for every X Y X Pr(X|y=y) onion aardvark 0.034 ai 0.0067 … economist 0.0000003 …. zymurgy 0.01000 X Nd D Plate diagram
Naïve Bayes as a DGM For each document d in the corpus (of size D): Pick a label yd from Pr(Y) For each word in d (of length Nd): Pick a word xid from Pr(X|Y=yd) Y Not described: how do we smooth for classes? For multinomials? How many classes are there? …. X Nd D Plate diagram
Recap: smoothing for a binomial estimate Θ=P(heads) for a binomial with MLE as: MLE: maximize Pr(D|θ) #heads MAP: maximize Pr(D|θ)Pr(θ) #tails and with MAP as: #imaginary heads #imaginary tails Smoothing = prior over the parameter θ
Smoothing for a binomial as a DGM MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ D #imaginary tails MAP is a Viterbi-style inference: want to find max probability paramete θ, to the posterior distribution Also: inference in a simple graph can be intractable if the conditional distributions are complicated
Smoothing for a binomial as a DGM MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ replacing D=d with F=f1,f2,… F Dirichlet prior has imaginary data for each of K classes #imaginary tails α1 + α2 Comment: conjugate for binomial is a Beta, conjugate for a multinomial is called a Dirichlet
Recap: Naïve Bayes as a DGM Now: let’s turn Bayes up to 11 for naïve Bayes…. Y X Nd D Plate diagram
A more Bayesian Naïve Bayes From a Dirichlet α: Draw a multinomial π over the K classes From a Dirichlet β For each class y=1…K Draw a multinomial γ[y] over the vocabulary For each document d=1..D: Pick a label yd from π For each word in d (of length Nd): Pick a word xid from γ[yd] α π Y W Nd D β γ K
Unsupervised Naïve Bayes From a Dirichlet α: Draw a multinomial π over the K classes From a Dirichlet β For each class y=1…K Draw a multinomial γ[y] over the vocabulary For each document d=1..D: Pick a label yd from π For each word in d (of length Nd): Pick a word xid from γ[yd] α π Y W Nd D β γ K
Unsupervised Naïve Bayes The generative story is the same What we do is different We want to both find values of γ and π and find values for Y for each example. EM is a natural algorithm α π Y W Nd D β γ K
LDA AS A DGM
The LDA Topic Model
Unsupervised NB vs LDA different class distrib for each doc one class prior Y Nd D π W γ K α β Y Nd D π W γ K α β one Y per doc one Y per word
Unsupervised NB vs LDA different class distrib θ for each doc one class prior Y Nd D θd γk K α β Zdi Y Nd D π W γ K α β one Y per doc one Z per word Wdi
LDA Blei’s motivation: start with BOW assumption Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,,M Generate d ~ D1(…) For each word n = 1,, Nd generate wn ~ D2( . | θdn) Now pick your favorite distributions for D1, D2 w
Unsupervised NB vs LDA w Y γk α β Zdi Wdi θd Nd D θd γk K α β Zdi Unsupervised NB clusters documents into latent classes LDA clusters word occurrences into latent classes (topics) The smoothness of βk same word suggests same topic The smoothness of θd same document suggests same topic Wdi w
LDA’s view of a document Mixed membership model
LDA topics: top words w by Pr(w|Z=k)
SVM using 50 features: Pr(Z=k|θd) 50 topics vs all words, SVM
Gibbs Sampling for LDA
LDA Latent Dirichlet Allocation Parameter learning: Variational EM … Collapsed Gibbs Sampling Wait, why is sampling called “learning” here? Here’s the idea….
LDA Gibbs sampling – works for any directed model! Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.
I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X) α Z1 X1 Y1 θ α Z2 X2 Y2 θ Z X Y 2 I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X)
Z1 X1 Y1 θ α Z2 X2 Y2 . Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) Initialize all the hidden variables randomly, then…. Z1 X1 Y1 θ α Z2 X2 Y2 Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior Pick Z2 ~ Pr(Z|x2,y2,θ) . Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior So we will have (a sample of) the true θ Pick Z2 ~ Pr(Z|x2,y2,θ) in a broad range of cases eventually these will converge to samples from the true joint distribution
Why does Gibbs sampling work? Basic claim: when you sample x~P(X|y1,…,yk) then if y1,…,yk were sampled from the true joint then x will be sampled from the true joint So the true joint is a “fixed point” you tend to stay there if you ever get there How long does it take to get there? depends on the structure of the space of samples: how well-connected are they by the sampling steps?
Mixing time depends on the eigenvalues of the state space!
LDA Latent Dirichlet Allocation Parameter learning: Variational EM Not covered in 601-B Collapsed Gibbs Sampling What is collapsed Gibbs sampling?
Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) Pick Z2 ~ Pr(Z2|z1,z3,α) θ Pick Z3 ~ Pr(Z3|z1,z2,α) Pick Z1 ~ Pr(Z1|z2,z3,α) Z1 Z2 Z3 Pick Z2 ~ Pr(Z2|z1,z3,α) Pick Z3 ~ Pr(Z3|z1,z2,α) X1 Y1 X2 Y2 X3 Y3 . Converges to samples from the true joint … and then we can estimate Pr(θ| α, sample of Z’s)
What’s this distribution? Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ What’s this distribution? Z1 Z2 Z3 X1 Y1 X2 Y2 X3 Y3
α θ Z1 Z2 Z3 Pick Z1 ~ Pr(Z1|z2,z3,α) Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ Simpler case: What’s this distribution? called a Dirichlet-multinomial and it looks like this: Z1 Z2 Z3 If there are k values for the Z’s and nk = # Z’s with value k, then it turns out: if n is pos int
α θ Z = (Z1,…, Zm) Z(-i) = (Z1,…, Zi-1 , Zi+1 ,…, Zm) Z1 Z2 Z3 Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) It turns out that sampling from a Dirichlet-multinomial is very easy! θ Notation: k values for the Z ’s nk = # Z ’s with value k Z = (Z1,…, Zm) Z(-i) = (Z1,…, Zi-1 , Zi+1 ,…, Zm) nk (-i)= # Z ’s with value k excluding Zi Z1 Z2 Z3
β[k] α θ Z1 Z2 Z3 X1 X2 X3 Z1 X1 Z2 X2 β[k] η η What about with downstream evidence? α captures the constraints on Zi via θ (from “above”, “causal” direction) what about via β? Pr(Z) = (1/c) * Pr(Z|E+) Pr(E-|Z) θ Z1 Z2 Z3 X1 X2 X3 Z1 X1 Z2 X2 β[k] η β[k] η
Sampling for LDA Y βk α η Zdi Wdi Z(-d,i) = all the Z’s but Zd,i θd Nd Notation: k values for the Zd,i ’s Z(-d,i) = all the Z’s but Zd,i nw,k = # Zd,i ’s with value k paired with Wd,i=w n*,k = # Zd,i ’s with value k nw,k (-d,i)= nw,k excluding Zi,d n*,k (-d,i)= n*k excluding Zi,d n*,k d,(-i)= n*k from doc d excluding Zi,d Y Nd D θd βk K α η Zdi Wdi Pr(E-|Z) Pr(Z|E+) fraction of time Z=k in doc d fraction of time W=w in topic k
IMPLEMENTING LDA
Way way more detail
More detail
z=1 z=2 random z=3 unit height … …
What gets learned…..
In A Math-ier Notation N[*,k] N[d,k] N[*,*]=V M[w,k]
for each document d and word position j in d z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d
for each pass t=1,2,…. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j]
z=1 z=2 random z=3 unit height … … You spend a lot of time sampling There’s a loop over all topics here in the sampler