Download presentation
Presentation is loading. Please wait.
Published byLoreen Lloyd Modified over 8 years ago
1
Probabilistic models for corpora and graphs
2
Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN M For each document d = 1, , M Generate C d ~ Mult( C | ) For each position n = 1, , N d Generate w n ~ Mult(W | )
3
Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN M For each document d = 1, , M Generate C d ~ Mult( C | ) For each position n = 1, , N d Generate w n ~ Mult(W | C d ) K
4
Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN M For each class k=1,…K Generate k For each document d = 1, , M Generate C d ~ Mult( C | ) For each position n = 1, , N d Generate w n ~ Mult(W | C d ) K
5
Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN M For each class k=1,…K Generate k Dir( For each document d = 1, , M Generate C d ~ Mult( C | ) For each position n = 1, , N d Generate w n ~ Mult(W | C d ) K Dirichlet is a prior for multinomials, defined by k params ( K and K MAP for P( i | (n k + i )/(n + 0 ) Symmetric Dirichlet: all i ‘s are equal
6
Review – unsupervised Naïve Bayes Mixture model: unsupervised naïve Bayes model C W N M Z K For each class k=1,…K Generate k Dir( For each document d = 1, , M Generate C d ~ Mult( C | ) For each position n = 1, , N d Generate w n ~ Mult(W | C d ) Same generative story – different learning problem…
7
Latent Dirichlet Allocation z w M N For each document d = 1, ,M Generate d ~ Dir( | ) For each position n = 1, , N d generate z n ~ Mult( | d ) generate w n ~ Mult( | z n ) “Mixed membership” K
8
LDA’s view of a document
9
Review - LDA Gibbs sampling –Applicable when joint distribution is hard to evaluate but conditional distribution is known –Sequence of samples comprises a Markov Chain –Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.
10
D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2
11
D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 z w N1 z w N2 z w N3 K=2
12
D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 z11z21 thank z31 istam K=2 happy z12 thank z22 turkey z32 turkey
13
D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 12 thank 1 istam K=2 happy 2 thank 1 turkey 1 Step 1: initialize z’s randomly
14
D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 1 2 thank 1 istam K=2 happy 2 thank 1 turkey 1 Step 1: initialize z’s randomly Step 2: sample z 11 from: slow derivation: https://lingpipe.files.wordpress.com/2010/07/lda3.pdf
15
D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 2 thank 1 istam K=2 happy 2 thank 1 turkey 1 #times ‘happy’ in topic t ------------------------------- - #words in topic t (smoothed by
16
D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 2 thank 1 istam K=2 happy 2 thank 1 turkey 1 #times topic t in D1 ---------------------- #words in D1 (smoothed by
17
D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 2 2 thank 1 istam K=2 happy thank 1 turkey 1 Step 1: initialize z’s randomly Step 2: sample z 11 from: Pr(z 11 =k|z 12,…,z 32, ) Step 3: sample z 12
19
Review – unsupervised Naïve Bayes Mixture model: unsupervised naïve Bayes model C W N M Z K For each class k=1,…K Generate k Dir( For each document d = 1, , M Generate C d ~ Mult( C | ) For each position n = 1, , N d Generate w n ~ Mult(W | C d ) Question: could you use collapsed Gibbs sampling here? what would Pr(z i |z 1,..,z i-1,z i+1,….,z m ) look like?
20
Models for corpora and graphs
21
21
22
Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks z p,z q are exchangeable zizi zjzj a uv N2N2 zizi N u for each node i, pick a latent class z i for each node pair i,j, pick an edge weight a by a ~ Pr(a | [z i,z j ])
23
Another mixed membership block model pick a multinomial θ over pairs of node topics k 1,k 2 for each node topic k, pick a multinomial ϕ k over node id’s for each edge e: pick z=(k 1,k 2 ) pick i~ Pr(. | ϕ k1 ) pick i~ Pr(. | ϕ k2 )
24
Speeding up modeling for corpora and graphs
25
Solutions Parallelize –IPM (Newman et al, AD-LDA) –Parameter server (Qiao et al, HW7) Speedups: –Exploit sparsity –Fast sampling techniques State of the art methods use both….
26
z=1 z=2 z=3 … … … … unit height random This is almost all the run-time, and it’s linear in the number of topics….and bigger corpora need more topics….
27
KDD 09
28
z=t1 z=t2 z=t3 … … … … unit height random z=t1 z=t2 z=t3 … … … … z=t1 z=t2 z=t3 … … … … z=t2 z=t3 … … … …
29
z=s+r+q
30
If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<r: lookup U on line segment for r Only need to check t such that n t|d >0
31
z=s+r+q If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that n w|t >0
32
z=s+r+q Only need to check t such that n w|t >0 Only need to check t such that n t|d >0 Only need to check occasionally (< 10% of the time)
33
z=s+r+q Need to store n w|t for each word, topic pair …??? Only need to store n t|d for current d Only need to store (and maintain) total words per topic and α ’s, β,V Trick; count up n t|d for d when you start working on d and update incrementally
34
z=s+r+q Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w
35
Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w map w to an int array no larger than frequency of w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order
37
Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling is key – Some sampling ideas for LDA The Mimno/McCallum decomposition (SparseLDA) Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)
38
Alias tables http://www.keithschwarz.com/darts-dice-coins/ Basic problem: how can we sample from a biased coin quickly? If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree r in (23/40,7/10] O(K) O(log2K)
39
Alias tables http://www.keithschwarz.com/darts-dice-coins/ Another idea… Simulate the dart with two drawn values: rx int(u1*K) ry u1*p max keep throwing till you hit a stripe Simulate the dart with two drawn values: rx int(u1*K) ry u1*p max keep throwing till you hit a stripe
40
Alias tables http://www.keithschwarz.com/darts-dice-coins/ An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking…
41
KDD 2014 Key ideas use variant of Mimno/McCallum decomposition Use alias tables to sample from the dense parts Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs
42
KDD 2014 q is stale, easy-to-draw from distribution p is updated distribution computing ratios p(i)/q(i) is cheap usually the ratio is close to one else the dart missed
43
KDD 2014
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.