Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John Lafferty Machine Learning Department Carnegie Mellon University

8/28/2007/10:30amICDM’07 HPDM workshop2 Latent Dirichlet Allocation (LDA) A directed graphical model for topic mining from large scale document collections –A completely unsupervised technique –Extracts semantically coherent multinomial distributions over vocabulary called topics –Represents documents in a lower dimensional topic-space

8/28/2007/10:30amICDM’07 HPDM workshop3 LDA: generative process

8/28/2007/10:30amICDM’07 HPDM workshop4 LDA: topics

8/28/2007/10:30amICDM’07 HPDM workshop5 LDA: inference Intractable for exact inference Several approximation inference techniques available –Stochastic techniques MCMC Sampling –Numerical techniques Loopy Belief propagation Variational Inference Expectation Propagation

8/28/2007/10:30amICDM’07 HPDM workshop6 LDA: variational inference The true (intractable) posterior probability of the latent variables approximated by a fully factored variational posterior Lower bound on the true data-log-likelihood: –The difference is equal to the KL-divergence between the variational posterior and true posterior

8/28/2007/10:30amICDM’07 HPDM workshop7 LDA: variational Inference E-step: M-step:

8/28/2007/10:30amICDM’07 HPDM workshop8 LDA: variational inference The main bottleneck is E-step Key insight: –Variational parameters  d and  dnk can be computed independently for various documents –E-step can be parallelized Two implementations –Multi-processor architecture with shared memory –Distributed architecture with shared disk

8/28/2007/10:30amICDM’07 HPDM workshop9 Parallel implementation Hardware: –Linux machine with 4 CPUs Each CPU is an Intel Xeon 2.4GHz processor –Shared 4GB RAM –512 KB cache Software: –David Blei’s LDA implementation in C –Used pthreads to parallelize the code

8/28/2007/10:30amICDM’07 HPDM workshop10 Parallel Implementation

8/28/2007/10:30amICDM’07 HPDM workshop11 Distributed implementation Hardware: –Cluster of 96 nodes –Each is a linux machine Transmetta Efficeon 1.2GHz processors 1GB RAM and 1MB cache Software –David Blei’s C-code forms the core –Perl code to co-ordinate the worker nodes –Rsh connections to invoke worker nodes –Communication through disk

8/28/2007/10:30amICDM’07 HPDM workshop12 Distributed implementation

8/28/2007/10:30amICDM’07 HPDM workshop13 Data A subset of PubMed consisting of 300K docs Collection Indexed using Lemur –Stopwords removed and stemmed Vocabulary size: ¼ 100,000 Generated subcollections of various sizes –Vocabulary size remains the same

8/28/2007/10:30amICDM’07 HPDM workshop14 Experiments Studied runtime as a function of –number of threads/nodes –collection size Fixed the number of topics at 50 Multiprocessor setting: varied # CPUs from 1 to 4 Distributed setting: varied # nodes from 1 to 90 LDA initialization on a collection: –randomly initialized LDA run for 1 EM iteration – resulting model used as a starting point in all experiments Reported average runtime per EM-iteration

8/28/2007/10:30amICDM’07 HPDM workshop15 Results: Multiprocessor

8/28/2007/10:30amICDM’07 HPDM workshop16 Results: Multiprocessor case: 50,000 documents

8/28/2007/10:30amICDM’07 HPDM workshop17 Discussion Plot shows E-step is the main bottle-neck The speedup is not linear! –A speedup of only 1.85 from 1 to 4 CPUs (50,000 docs) –Possibly a conflict between threads in read- accessing the model in main-memory –Create a copy of the model in memory for each thread? Results in huge memory requirements

8/28/2007/10:30amICDM’07 HPDM workshop18 Results: Distributed

8/28/2007/10:30amICDM’07 HPDM workshop19 Results: distributed case: 50,000 documents

8/28/2007/10:30amICDM’07 HPDM workshop20 Discussion Sub-linear speedups –Speedup of 14.5 from 1 to 50 nodes (50,000 docs) –Speedup tapers-off after an optimum number of nodes Conflicts in disk reading M-step: larger input filesize with more nodes –Optimum number of nodes increases with collection size Scaling the cluster size is desirable for larger collections

8/28/2007/10:30amICDM’07 HPDM workshop21 Conclusions Distributed version seems more desirable –Scalable –Cheaper Future work: further improvements Communication using RPCs Loading only sparse model corresponding to the sparse doc-term matrix during E-step Load one document at a time in E-step Document clustering before splitting the data between nodes

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Similar presentations

Presentation on theme: "Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Similar presentations

Presentation on theme: "Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John."— Presentation transcript:

Similar presentations

About project

Feedback