Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate.

Similar presentations


Presentation on theme: "Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate."— Presentation transcript:

1 Scaling up LDA (Monday’s lecture)

2 What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….

3 AllReduce

4

5 Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. MAP REDUCE some sort of copy

6 Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

7

8

9

10

11

12

13 Gory details of VW Hadoop-AllReduce Spanning-tree server: – Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (“fake” mappers): – Input for worker is locally cached – Workers all connect to spanning-tree server – Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an all- reduce

14 Hadoop AllReduce don’t wait for duplicate jobs

15 Second-order method - like Newton’s method

16 2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

17 50M examples explicitly constructed kernel  11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error

18

19 On-line LDA

20 Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei

21

22

23

24

25

26

27 uses γ uses λ

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42 Monday’s lecture

43 recap

44 Compute expectations over the z’s any way you want…. Compute expectations over the z’s any way you want….

45

46 Technical Details Variational distrib: q(z d ) not q(z d )! Approximate using Gibbs: after sampling for a while estimate: estimate using time and “coherence”: D(w) = # docs containing word w

47

48 better

49 Summary of LDA speedup tricks Gibbs sampler: – O(N*K*T) and K grows with N – Need to keep the corpus (and z’s) in memory You can parallelize – You need to keep a slice of the corpus – But you need to synchronize K multinomials over the vocabulary – AllReduce would help? You can sparsify the sampling and topic-counts – Mimno’s trick - greatly reduces memory You can do the computation on-line – Only need to keep K-multinomials and one document’s worth of corpus and z’s in memory You can combine some of these methods – Online sparsified LDA – Parallel online sparsified LDA?


Download ppt "Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate."

Similar presentations


Ads by Google