Download presentation
Presentation is loading. Please wait.
1
Latent Dirichlet Allocation
2
Directed Graphical Models
William W. Cohen Machine Learning
3
DGMs: The “Burglar Alarm” example
Node ~ random variable Arcs define form of probability distribution: Pr(Xi | X1,…,Xi-1,Xi+1,…,Xn) = Pr(Xi | parents(Xi)) Burglar Earthquake Alarm Phone Call Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!
4
DGMs: The “Burglar Alarm” example
Node ~ random variable Arcs define form of probability distribution: Pr(Xi | X1,…,Xi-1,Xi+1,…,Xn) = Pr(Xi | parents(Xi)) Burglar Earthquake Alarm Phone Call Generative story (and joint distribution): Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm | Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall | Alarm)
5
DGMs: The “Burglar Alarm” example
You can also compute other quantities: e.g., Pr(Burglar=true | PhoneCall=true) Pr(Burglar=true | PhoneCall=true, and Earthquake=true) Burglar Earthquake Alarm Phone Call Generative story: Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm | Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall | Alarm)
6
Inference in Bayes nets
General problem: given evidence E1,…,Ek compute P(X|E1,..,Ek) for any X Big assumption: graph is “polytree” <=1 undirected path between any nodes X,Y Notation: U1 U2 X Z1 Z2 Y1 Y2
7
Inference in Bayes nets: P(X|E)
Step 1 – fancy version of Bayes rule 2: X is E- is B, E+ is C (2) Step 3 - E+: causal support E-: evidential support
8
Inference in Bayes nets: P(X|E+)
CPT table lookup Recursive call to P(.|E+) So far: simple way of propagating requests for “belief due to causal evidence” up the tree Evidence for Uj that doesn’t go thru X I.e. info on Pr(X|E+) flows down
9
Inference in Bayes nets: P(E-|X) simplified
d-sep + polytree Recursive call to P(E-|.) So far: simple way of propagating requests for “belief due to evidential support” down the tree I.e. info on Pr(E-|X) flows up
10
Recap: Message Passing for BP
We reduced P(X|E) to product of two recursively calculated parts: P(X=x|E+) i.e., CPT for X and product of “forward” messages from parents P(E-|X=x) i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) This can also be implemented by message-passing (belief propagation) Messages are distributions – i.e., vectors
11
Recap: Message Passing for BP
Top-level algorithm Pick one vertex as the “root” Any node with only one edge is a “leaf” Pass messages from the leaves to the root Pass messages from root to the leaves Now every X has received P(X|E+) and P(E-|X) and can compute P(X|E)
12
NAÏVE BAYES AS A DGM
13
Naïve Bayes as a DGM for every X Nd D Plate diagram
For each document d in the corpus (of size D): Pick a label yd from Pr(Y) For each word in d (of length Nd): Pick a word xid from Pr(X|Y=yd) Pr(Y=y) onion 0.3 economist 0.7 Y for every X Y X Pr(X|y=y) onion aardvark 0.034 ai 0.0067 … economist …. zymurgy X Nd D Plate diagram
14
Naïve Bayes as a DGM For each document d in the corpus (of size D): Pick a label yd from Pr(Y) For each word in d (of length Nd): Pick a word xid from Pr(X|Y=yd) Y Not described: how do we smooth for classes? For multinomials? How many classes are there? …. X Nd D Plate diagram
15
Recap: smoothing for a binomial
estimate Θ=P(heads) for a binomial with MLE as: MLE: maximize Pr(D|θ) #heads MAP: maximize Pr(D|θ)Pr(θ) #tails and with MAP as: #imaginary heads #imaginary tails Smoothing = prior over the parameter θ
16
Smoothing for a binomial as a DGM
MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ D #imaginary tails MAP is a Viterbi-style inference: want to find max probability paramete θ, to the posterior distribution Also: inference in a simple graph can be intractable if the conditional distributions are complicated
17
Smoothing for a binomial as a DGM
MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ replacing D=d with F=f1,f2,… F Dirichlet prior has imaginary data for each of K classes #imaginary tails α1 + α2 Comment: conjugate for binomial is a Beta, conjugate for a multinomial is called a Dirichlet
18
Recap: Naïve Bayes as a DGM
Now: let’s turn Bayes up to 11 for naïve Bayes…. Y X Nd D Plate diagram
19
A more Bayesian Naïve Bayes
From a Dirichlet α: Draw a multinomial π over the K classes From a Dirichlet β For each class y=1…K Draw a multinomial γ[y] over the vocabulary For each document d=1..D: Pick a label yd from π For each word in d (of length Nd): Pick a word xid from γ[yd] α π Y W Nd D β γ K
20
Unsupervised Naïve Bayes
From a Dirichlet α: Draw a multinomial π over the K classes From a Dirichlet β For each class y=1…K Draw a multinomial γ[y] over the vocabulary For each document d=1..D: Pick a label yd from π For each word in d (of length Nd): Pick a word xid from γ[yd] α π Y W Nd D β γ K
21
Unsupervised Naïve Bayes
The generative story is the same What we do is different We want to both find values of γ and π and find values for Y for each example. EM is a natural algorithm α π Y W Nd D β γ K
22
LDA AS A DGM
23
The LDA Topic Model
25
Unsupervised NB vs LDA different class distrib for each doc
one class prior Y Nd D π W γ K α β Y Nd D π W γ K α β one Y per doc one Y per word
26
Unsupervised NB vs LDA different class distrib θ for each doc
one class prior Y Nd D θd γk K α β Zdi Y Nd D π W γ K α β one Y per doc one Z per word Wdi
27
LDA Blei’s motivation: start with BOW assumption
Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,,M Generate d ~ D1(…) For each word n = 1,, Nd generate wn ~ D2( . | θdn) Now pick your favorite distributions for D1, D2 w
28
Unsupervised NB vs LDA w Y γk α β Zdi Wdi θd
Nd D θd γk K α β Zdi Unsupervised NB clusters documents into latent classes LDA clusters word occurrences into latent classes (topics) The smoothness of βk same word suggests same topic The smoothness of θd same document suggests same topic Wdi w
29
LDA’s view of a document
Mixed membership model
30
LDA topics: top words w by Pr(w|Z=k)
31
SVM using 50 features: Pr(Z=k|θd)
50 topics vs all words, SVM
32
Gibbs Sampling for LDA
33
LDA Latent Dirichlet Allocation Parameter learning: Variational EM
… Collapsed Gibbs Sampling Wait, why is sampling called “learning” here? Here’s the idea….
34
LDA Gibbs sampling – works for any directed model!
Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.
35
I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X)
α Z1 X1 Y1 θ α Z2 X2 Y2 θ Z X Y 2 I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X)
36
Z1 X1 Y1 θ α Z2 X2 Y2 . Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α)
Initialize all the hidden variables randomly, then…. Z1 X1 Y1 θ α Z2 X2 Y2 Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior Pick Z2 ~ Pr(Z|x2,y2,θ) . Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior So we will have (a sample of) the true θ Pick Z2 ~ Pr(Z|x2,y2,θ) in a broad range of cases eventually these will converge to samples from the true joint distribution
37
Why does Gibbs sampling work?
Basic claim: when you sample x~P(X|y1,…,yk) then if y1,…,yk were sampled from the true joint then x will be sampled from the true joint So the true joint is a “fixed point” you tend to stay there if you ever get there How long does it take to get there? depends on the structure of the space of samples: how well-connected are they by the sampling steps?
39
Mixing time depends on the eigenvalues of the state space!
40
LDA Latent Dirichlet Allocation Parameter learning: Variational EM
Not covered in 601-B Collapsed Gibbs Sampling What is collapsed Gibbs sampling?
41
Initialize all the Z’s randomly, then….
α Pick Z1 ~ Pr(Z1|z2,z3,α) Pick Z2 ~ Pr(Z2|z1,z3,α) θ Pick Z3 ~ Pr(Z3|z1,z2,α) Pick Z1 ~ Pr(Z1|z2,z3,α) Z1 Z2 Z3 Pick Z2 ~ Pr(Z2|z1,z3,α) Pick Z3 ~ Pr(Z3|z1,z2,α) X1 Y1 X2 Y2 X3 Y3 . Converges to samples from the true joint … and then we can estimate Pr(θ| α, sample of Z’s)
42
What’s this distribution?
Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ What’s this distribution? Z1 Z2 Z3 X1 Y1 X2 Y2 X3 Y3
43
α θ Z1 Z2 Z3 Pick Z1 ~ Pr(Z1|z2,z3,α)
Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ Simpler case: What’s this distribution? called a Dirichlet-multinomial and it looks like this: Z1 Z2 Z3 If there are k values for the Z’s and nk = # Z’s with value k, then it turns out: if n is pos int
44
α θ Z = (Z1,…, Zm) Z(-i) = (Z1,…, Zi-1 , Zi+1 ,…, Zm) Z1 Z2 Z3
Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) It turns out that sampling from a Dirichlet-multinomial is very easy! θ Notation: k values for the Z ’s nk = # Z ’s with value k Z = (Z1,…, Zm) Z(-i) = (Z1,…, Zi-1 , Zi+1 ,…, Zm) nk (-i)= # Z ’s with value k excluding Zi Z1 Z2 Z3
45
β[k] α θ Z1 Z2 Z3 X1 X2 X3 Z1 X1 Z2 X2 β[k] η η
What about with downstream evidence? α captures the constraints on Zi via θ (from “above”, “causal” direction) what about via β? Pr(Z) = (1/c) * Pr(Z|E+) Pr(E-|Z) θ Z1 Z2 Z3 X1 X2 X3 Z1 X1 Z2 X2 β[k] η β[k] η
46
Sampling for LDA Y βk α η Zdi Wdi Z(-d,i) = all the Z’s but Zd,i θd Nd
Notation: k values for the Zd,i ’s Z(-d,i) = all the Z’s but Zd,i nw,k = # Zd,i ’s with value k paired with Wd,i=w n*,k = # Zd,i ’s with value k nw,k (-d,i)= nw,k excluding Zi,d n*,k (-d,i)= n*k excluding Zi,d n*,k d,(-i)= n*k from doc d excluding Zi,d Y Nd D θd βk K α η Zdi Wdi Pr(E-|Z) Pr(Z|E+) fraction of time Z=k in doc d fraction of time W=w in topic k
47
IMPLEMENTING LDA
48
Way way more detail
49
More detail
52
z=1 z=2 random z=3 unit height … …
53
What gets learned…..
54
In A Math-ier Notation N[*,k] N[d,k] N[*,*]=V M[w,k]
55
for each document d and word position j in d
z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d
56
for each pass t=1,2,…. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j]
58
z=1 z=2 random z=3 unit height … … You spend a lot of time sampling There’s a loop over all topics here in the sampler
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.