Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Dirichlet Allocation

Similar presentations


Presentation on theme: "Latent Dirichlet Allocation"— Presentation transcript:

1 Latent Dirichlet Allocation

2 Directed Graphical Models
William W. Cohen Machine Learning

3 DGMs: The “Burglar Alarm” example
Node ~ random variable Arcs define form of probability distribution: Pr(Xi | X1,…,Xi-1,Xi+1,…,Xn) = Pr(Xi | parents(Xi)) Burglar Earthquake Alarm Phone Call Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!

4 DGMs: The “Burglar Alarm” example
Node ~ random variable Arcs define form of probability distribution: Pr(Xi | X1,…,Xi-1,Xi+1,…,Xn) = Pr(Xi | parents(Xi)) Burglar Earthquake Alarm Phone Call Generative story (and joint distribution): Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm | Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall | Alarm)

5 DGMs: The “Burglar Alarm” example
You can also compute other quantities: e.g., Pr(Burglar=true | PhoneCall=true) Pr(Burglar=true | PhoneCall=true, and Earthquake=true) Burglar Earthquake Alarm Phone Call Generative story: Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm | Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall | Alarm)

6 Inference in Bayes nets
General problem: given evidence E1,…,Ek compute P(X|E1,..,Ek) for any X Big assumption: graph is “polytree” <=1 undirected path between any nodes X,Y Notation: U1 U2 X Z1 Z2 Y1 Y2

7 Inference in Bayes nets: P(X|E)
Step 1 – fancy version of Bayes rule 2: X is E- is B, E+ is C (2) Step 3 - E+: causal support E-: evidential support

8 Inference in Bayes nets: P(X|E+)
CPT table lookup Recursive call to P(.|E+) So far: simple way of propagating requests for “belief due to causal evidence” up the tree Evidence for Uj that doesn’t go thru X I.e. info on Pr(X|E+) flows down

9 Inference in Bayes nets: P(E-|X) simplified
d-sep + polytree Recursive call to P(E-|.) So far: simple way of propagating requests for “belief due to evidential support” down the tree I.e. info on Pr(E-|X) flows up

10 Recap: Message Passing for BP
We reduced P(X|E) to product of two recursively calculated parts: P(X=x|E+) i.e., CPT for X and product of “forward” messages from parents P(E-|X=x) i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) This can also be implemented by message-passing (belief propagation) Messages are distributions – i.e., vectors

11 Recap: Message Passing for BP
Top-level algorithm Pick one vertex as the “root” Any node with only one edge is a “leaf” Pass messages from the leaves to the root Pass messages from root to the leaves Now every X has received P(X|E+) and P(E-|X) and can compute P(X|E)

12 NAÏVE BAYES AS A DGM

13 Naïve Bayes as a DGM for every X Nd D Plate diagram
For each document d in the corpus (of size D): Pick a label yd from Pr(Y) For each word in d (of length Nd): Pick a word xid from Pr(X|Y=yd) Pr(Y=y) onion 0.3 economist 0.7 Y for every X Y X Pr(X|y=y) onion aardvark 0.034 ai 0.0067 economist …. zymurgy X Nd D Plate diagram

14 Naïve Bayes as a DGM For each document d in the corpus (of size D): Pick a label yd from Pr(Y) For each word in d (of length Nd): Pick a word xid from Pr(X|Y=yd) Y Not described: how do we smooth for classes? For multinomials? How many classes are there? …. X Nd D Plate diagram

15 Recap: smoothing for a binomial
estimate Θ=P(heads) for a binomial with MLE as: MLE: maximize Pr(D|θ) #heads MAP: maximize Pr(D|θ)Pr(θ) #tails and with MAP as: #imaginary heads #imaginary tails Smoothing = prior over the parameter θ

16 Smoothing for a binomial as a DGM
MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ D #imaginary tails MAP is a Viterbi-style inference: want to find max probability paramete θ, to the posterior distribution Also: inference in a simple graph can be intractable if the conditional distributions are complicated

17 Smoothing for a binomial as a DGM
MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ replacing D=d with F=f1,f2,… F Dirichlet prior has imaginary data for each of K classes #imaginary tails α1 + α2 Comment: conjugate for binomial is a Beta, conjugate for a multinomial is called a Dirichlet

18 Recap: Naïve Bayes as a DGM
Now: let’s turn Bayes up to 11 for naïve Bayes…. Y X Nd D Plate diagram

19 A more Bayesian Naïve Bayes
From a Dirichlet α: Draw a multinomial π over the K classes From a Dirichlet β For each class y=1…K Draw a multinomial γ[y] over the vocabulary For each document d=1..D: Pick a label yd from π For each word in d (of length Nd): Pick a word xid from γ[yd] α π Y W Nd D β γ K

20 Unsupervised Naïve Bayes
From a Dirichlet α: Draw a multinomial π over the K classes From a Dirichlet β For each class y=1…K Draw a multinomial γ[y] over the vocabulary For each document d=1..D: Pick a label yd from π For each word in d (of length Nd): Pick a word xid from γ[yd] α π Y W Nd D β γ K

21 Unsupervised Naïve Bayes
The generative story is the same What we do is different We want to both find values of γ and π and find values for Y for each example. EM is a natural algorithm α π Y W Nd D β γ K

22 LDA AS A DGM

23 The LDA Topic Model

24

25 Unsupervised NB vs LDA different class distrib for each doc
one class prior Y Nd D π W γ K α β Y Nd D π W γ K α β one Y per doc one Y per word

26 Unsupervised NB vs LDA different class distrib θ for each doc
one class prior Y Nd D θd γk K α β Zdi Y Nd D π W γ K α β one Y per doc one Z per word Wdi

27 LDA Blei’s motivation: start with BOW assumption
Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,,M Generate d ~ D1(…) For each word n = 1,, Nd generate wn ~ D2( . | θdn) Now pick your favorite distributions for D1, D2 w

28 Unsupervised NB vs LDA  w Y γk α β Zdi Wdi θd
Nd D θd γk K α β Zdi Unsupervised NB clusters documents into latent classes LDA clusters word occurrences into latent classes (topics) The smoothness of βk  same word suggests same topic The smoothness of θd  same document suggests same topic Wdi w

29 LDA’s view of a document
Mixed membership model

30 LDA topics: top words w by Pr(w|Z=k)

31 SVM using 50 features: Pr(Z=k|θd)
50 topics vs all words, SVM

32 Gibbs Sampling for LDA

33 LDA Latent Dirichlet Allocation Parameter learning: Variational EM
Collapsed Gibbs Sampling Wait, why is sampling called “learning” here? Here’s the idea….

34 LDA Gibbs sampling – works for any directed model!
Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

35 I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X)
α Z1 X1 Y1 θ α Z2 X2 Y2 θ Z X Y 2 I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X)

36 Z1 X1 Y1 θ α Z2 X2 Y2 . Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α)
Initialize all the hidden variables randomly, then…. Z1 X1 Y1 θ α Z2 X2 Y2 Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior Pick Z2 ~ Pr(Z|x2,y2,θ) . Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior So we will have (a sample of) the true θ Pick Z2 ~ Pr(Z|x2,y2,θ) in a broad range of cases eventually these will converge to samples from the true joint distribution

37 Why does Gibbs sampling work?
Basic claim: when you sample x~P(X|y1,…,yk) then if y1,…,yk were sampled from the true joint then x will be sampled from the true joint So the true joint is a “fixed point” you tend to stay there if you ever get there How long does it take to get there? depends on the structure of the space of samples: how well-connected are they by the sampling steps?

38

39 Mixing time depends on the eigenvalues of the state space!

40 LDA Latent Dirichlet Allocation Parameter learning: Variational EM
Not covered in 601-B Collapsed Gibbs Sampling What is collapsed Gibbs sampling?

41 Initialize all the Z’s randomly, then….
α Pick Z1 ~ Pr(Z1|z2,z3,α) Pick Z2 ~ Pr(Z2|z1,z3,α) θ Pick Z3 ~ Pr(Z3|z1,z2,α) Pick Z1 ~ Pr(Z1|z2,z3,α) Z1 Z2 Z3 Pick Z2 ~ Pr(Z2|z1,z3,α) Pick Z3 ~ Pr(Z3|z1,z2,α) X1 Y1 X2 Y2 X3 Y3 . Converges to samples from the true joint … and then we can estimate Pr(θ| α, sample of Z’s)

42 What’s this distribution?
Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ What’s this distribution? Z1 Z2 Z3 X1 Y1 X2 Y2 X3 Y3

43 α θ Z1 Z2 Z3 Pick Z1 ~ Pr(Z1|z2,z3,α)
Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ Simpler case: What’s this distribution? called a Dirichlet-multinomial and it looks like this: Z1 Z2 Z3 If there are k values for the Z’s and nk = # Z’s with value k, then it turns out: if n is pos int

44 α θ Z = (Z1,…, Zm) Z(-i) = (Z1,…, Zi-1 , Zi+1 ,…, Zm) Z1 Z2 Z3
Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) It turns out that sampling from a Dirichlet-multinomial is very easy! θ Notation: k values for the Z ’s nk = # Z ’s with value k Z = (Z1,…, Zm) Z(-i) = (Z1,…, Zi-1 , Zi+1 ,…, Zm) nk (-i)= # Z ’s with value k excluding Zi Z1 Z2 Z3

45 β[k] α θ Z1 Z2 Z3 X1 X2 X3 Z1 X1 Z2 X2 β[k] η η
What about with downstream evidence? α captures the constraints on Zi via θ (from “above”, “causal” direction) what about via β? Pr(Z) = (1/c) * Pr(Z|E+) Pr(E-|Z) θ Z1 Z2 Z3 X1 X2 X3 Z1 X1 Z2 X2 β[k] η β[k] η

46 Sampling for LDA Y βk α η Zdi Wdi Z(-d,i) = all the Z’s but Zd,i θd Nd
Notation: k values for the Zd,i ’s Z(-d,i) = all the Z’s but Zd,i nw,k = # Zd,i ’s with value k paired with Wd,i=w n*,k = # Zd,i ’s with value k nw,k (-d,i)= nw,k excluding Zi,d n*,k (-d,i)= n*k excluding Zi,d n*,k d,(-i)= n*k from doc d excluding Zi,d Y Nd D θd βk K α η Zdi Wdi Pr(E-|Z) Pr(Z|E+) fraction of time Z=k in doc d fraction of time W=w in topic k

47 IMPLEMENTING LDA

48 Way way more detail

49 More detail

50

51

52 z=1 z=2 random z=3 unit height

53 What gets learned…..

54 In A Math-ier Notation N[*,k] N[d,k] N[*,*]=V M[w,k]

55 for each document d and word position j in d
z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d

56 for each pass t=1,2,…. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j]

57

58 z=1 z=2 random z=3 unit height You spend a lot of time sampling There’s a loop over all topics here in the sampler


Download ppt "Latent Dirichlet Allocation"

Similar presentations


Ads by Google