Download presentation
Presentation is loading. Please wait.
Published byLorraine Malone Modified over 9 years ago
1
Scaling up LDA (Monday’s lecture)
2
What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….
3
AllReduce
5
Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. MAP REDUCE some sort of copy
6
Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE
13
Gory details of VW Hadoop-AllReduce Spanning-tree server: – Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (“fake” mappers): – Input for worker is locally cached – Workers all connect to spanning-tree server – Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an all- reduce
14
Hadoop AllReduce don’t wait for duplicate jobs
15
Second-order method - like Newton’s method
16
2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad
17
50M examples explicitly constructed kernel 11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error
19
On-line LDA
20
Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei
27
uses γ uses λ
42
Monday’s lecture
43
recap
44
Compute expectations over the z’s any way you want…. Compute expectations over the z’s any way you want….
46
Technical Details Variational distrib: q(z d ) not q(z d )! Approximate using Gibbs: after sampling for a while estimate: estimate using time and “coherence”: D(w) = # docs containing word w
48
better
49
Summary of LDA speedup tricks Gibbs sampler: – O(N*K*T) and K grows with N – Need to keep the corpus (and z’s) in memory You can parallelize – You need to keep a slice of the corpus – But you need to synchronize K multinomials over the vocabulary – AllReduce would help? You can sparsify the sampling and topic-counts – Mimno’s trick - greatly reduces memory You can do the computation on-line – Only need to keep K-multinomials and one document’s worth of corpus and z’s in memory You can combine some of these methods – Online sparsified LDA – Parallel online sparsified LDA?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.