Fast sampling for LDA William Cohen
MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS
Called “collapsed Gibbs sampling” since you’ve marginalized away some variables Fr: Parameter estimation for text analysis - Gregor Heinrich prob this word/term assigned to topic k prob this doc contains topic k
More detail
z=1 z=2 z=3 … … … … unit height random
SPEEDUP 1 - SPARSITY
KDD 2008
z=1 z=2 z=3 … … … … unit height random
Running total of P(z=k|…) or P(z<=k)
Discussion…. Where do you spend your time? – sampling the z’s – each sampling step involves a loop over all topics – this seems wasteful even with many topics, words are often only assigned to a few different topics – low frequency words appear < K times … and there are lots and lots of them! – even frequent words are not in every topic
Discussion…. What’s the solution? Idea: come up with approximations to Z at each stage - then you might be able to stop early….. computationally like a sparser vector Want Z i >=Z
Tricks How do you compute and maintain the bound? – see the paper What order do you go in? – want to pick large P(k)’s first – … so we want large P(k|d) and P(k|w) – … so we maintain k’s in sorted order which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array
Results
SPEEDUP 2 - ANOTHER APPROACH FOR USING SPARSITY
KDD 09
z=s+r+q t = topic (k) w = word d = doc
z=s+r+q If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<r: lookup U on line segment for r Only need to check t such that n t|d >0 t = topic (k) w = word d = doc
z=s+r+q If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that n w|t >0
z=s+r+q Only need to check t such that n w|t >0 Only need to check t such that n t|d >0 Only need to check occasionally (< 10% of the time)
z=s+r+q Need to store n w|t for each word, topic pair …??? Only need to store n t|d for current d Only need to store (and maintain) total words per topic and α ’s, β,V Trick; count up n t|d for d when you start working on d and update incrementally
z=s+r+q Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w
Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w map w to an int array no larger than frequency w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order
Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling is key – Some sampling ideas for LDA The Mimno/McCallum decomposition (SparseLDA) Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)
Alias tables Basic problem: how can we sample from a biased coin quickly? If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree r in (23/40,7/10] O(K) O(log2K)
Alias tables Another idea… Simulate the dart with two drawn values: rx int(u1*K) ry u1*p max keep throwing till you hit a stripe Simulate the dart with two drawn values: rx int(u1*K) ry u1*p max keep throwing till you hit a stripe
Alias tables An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking…
KDD 2014 Key ideas use variant of Mimno/McCallum decomposition Use alias tables to sample from the dense parts Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs
KDD 2014 q is stale, easy-to-draw from distribution p is updated distribution computing ratios p(i)/q(i) is cheap usually the ratio is close to one else the dart missed
KDD 2014
SPEEDUP 3 - ONLINE LDA
Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei
ASIDE: VARIATIONAL INFERENCE FOR LDA
uses γ uses λ
BACK TO: SPEEDUP 3 - ONLINE LDA
SPEED ONLINE SPARSE LDA
Compute expectations over the z’s any way you want…. Compute expectations over the z’s any way you want….
Technical Details Variational distrib: q(z d ) not q(z d )! Approximate using Gibbs: after sampling for a while estimate: estimate using time and “coherence”: D(w) = # docs containing word w
better
Summary of LDA speedup tricks Gibbs sampler: – O(N*K*T) and K grows with N – Need to keep the corpus (and z’s) in memory You can parallelize – You need to keep a slice of the corpus – But you need to synchronize K multinomials over the vocabulary – AllReduce helps You can sparsify the sampling and topic-counts – Mimno’s trick - greatly reduces memory You can do the computation on-line – Only need to keep K-multinomials and one document’s worth of corpus and z’s in memory You can combine some of these methods – Online sparsified LDA – Parallel online sparsified LDA?
SPEEDUP FOR PARALLEL LDA - USING ALLREDUCE FOR SYNCHRONIZATION
What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….
Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE
Gory details of VW Hadoop-AllReduce Spanning-tree server: – Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (“fake” mappers): – Input for worker is locally cached – Workers all connect to spanning-tree server – Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an all- reduce
Hadoop AllReduce don’t wait for duplicate jobs
Second-order method - like Newton’s method
2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad
50M examples explicitly constructed kernel 11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error