Scaling up LDA (Monday’s lecture)
What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….
AllReduce
Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. MAP REDUCE some sort of copy
Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE
Gory details of VW Hadoop-AllReduce Spanning-tree server: – Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (“fake” mappers): – Input for worker is locally cached – Workers all connect to spanning-tree server – Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an all- reduce
Hadoop AllReduce don’t wait for duplicate jobs
Second-order method - like Newton’s method
2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad
50M examples explicitly constructed kernel 11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error
On-line LDA
Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei
uses γ uses λ
Monday’s lecture
recap
Compute expectations over the z’s any way you want…. Compute expectations over the z’s any way you want….
Technical Details Variational distrib: q(z d ) not q(z d )! Approximate using Gibbs: after sampling for a while estimate: estimate using time and “coherence”: D(w) = # docs containing word w
better
Summary of LDA speedup tricks Gibbs sampler: – O(N*K*T) and K grows with N – Need to keep the corpus (and z’s) in memory You can parallelize – You need to keep a slice of the corpus – But you need to synchronize K multinomials over the vocabulary – AllReduce would help? You can sparsify the sampling and topic-counts – Mimno’s trick - greatly reduces memory You can do the computation on-line – Only need to keep K-multinomials and one document’s worth of corpus and z’s in memory You can combine some of these methods – Online sparsified LDA – Parallel online sparsified LDA?