Scaling up LDA William Cohen. Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling.

Slides:

Advertisements

Similar presentations

Topic models Source: Topic models, David Blei, MLSS 09.

Advertisements

Lecture 18: Temporal-Difference Learning

Overview of this week Debugging tips for ML algorithms

Performance of Cache Memory

Algorithm Design Techniques: Greedy Algorithms. Introduction Algorithm Design Techniques –Design of algorithms –Algorithms commonly used to solve problems.

Hierarchical Dirichlet Processes

Parallel and Distributed Simulation Global Virtual Time - Part 2.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Recursion. Recursion is a powerful technique for thinking about a process It can be used to simulate a loop, or for many other kinds of applications In.

Segmentation and Fitting Using Probabilistic Methods

Statistical Topic Modeling part 1

From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.

Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.

Fall 2006CENG 7071 Algorithm Analysis. Fall 2006CENG 7072 Algorithmic Performance There are two aspects of algorithmic performance: Time Instructions.

Reference: Message Passing Fundamentals.

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

Object (Data and Algorithm) Analysis Cmput Lecture 5 Department of Computing Science University of Alberta ©Duane Szafron 1999 Some code in this.

Algorithm Analysis CS 201 Fundamental Structures of Computer Science.

Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.

British Museum Library, London Picture Courtesy: flickr.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.

Analysis of Algorithms 7/2/2015CS202 - Fundamentals of Computer Science II1.

Elementary Data Structures and Algorithms

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

Chapter 10 A Algorithm Efficiency. © 2004 Pearson Addison-Wesley. All rights reserved 10 A-2 Determining the Efficiency of Algorithms Analysis of algorithms.

C++ Programming: From Problem Analysis to Program Design, Second Edition Chapter 19: Searching and Sorting.

ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.

Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ.

Algorithm Analysis CS 400/600 – Data Structures. Algorithm Analysis2 Abstract Data Types Abstract Data Type (ADT): a definition for a data type solely.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Storylines from Streaming Text The Infinite Topic Cluster Model Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing Carnegie Mellon.

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.

Topic modeling experiments benchmark and simple evaluations 11/15, 11/16.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Techniques for Dimensionality Reduction

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.

Information Bottleneck versus Maximum Likelihood Felix Polyakov.

Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate.

1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.

Analysis of Social Media MLD , LTI William Cohen

How can we maintain an error bound? Settle for a “per-step” bound What’s the probability of a mistake at each step? Not cumulative, but Equal footing with.

Latent Dirichlet Allocation (LDA)

Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.

Review William Cohen. Sessions Randomized algorithms – The “hash trick” – Bloom filters how would you use it? how and why does it work? – Count-min sketch.

Scaling up LDA William Cohen. First some pictures…

Analysis of Social Media MLD , LTI William Cohen

Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.

Fast search for Dirichlet process mixture models

Speeding up LDA.

Exam Review Session William Cohen.

Logistic Regression & Parallel SGD

Topic models for corpora and for graphs

Topic models for corpora and for graphs

10-405: LDA Lecture 2.

Presentation transcript:

Scaling up LDA William Cohen

Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling is key – Some sampling ideas for LDA

Review - LDA Latent Dirichlet Allocation with Gibbs z w  M  N  Randomly initialize each z m,n Repeat for t=1,…. For each doc m, word n Find Pr(z mn =k|other z’s) Sample z mn according to that distr.

Way way more detail

More detail

What gets learned…..

In A Math-ier Notation N[*,k] N[d,k] M[w,k] N[*,*]=V

for each document d and word position j in d z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d

for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j] for each pass t=1,2,….

z=1 z=2 z=3 … … … … unit height random 1.You spend a lot of time sampling 2.There’s a loop over all topics here in the sampler 1.You spend a lot of time sampling 2.There’s a loop over all topics here in the sampler

JMLR 2009

Observation How much does the choice of z depend on the other z’s in the same document? – quite a lot How much does the choice of z depend on the other z’s in elsewhere in the corpus? – maybe not so much – depends on Pr(w|t) but that changes slowly Can we parallelize Gibbs and still get good results?

Question Can we parallelize Gibbs sampling? – formally, no: every choice of z depends on all the other z’s – Gibbs needs to be sequential just like SGD

What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate Distributed LDA”

What if you try and parallelize? D=#docs W=#word(types) K=#topics N=words in corpus

Update c Algorithms: – Distributed variational EM – Asynchronous LDA (AS-LDA) – Approximate Distributed LDA (AD-LDA) – Ensemble versions of LDA: HLDA, DCM-LDA Implementations: – GitHub Yahoo_LDA not Hadoop, special-purpose communication code for synchronizing the global counts Alex Smola, Yahoo  CMU – Mahout LDA Andy Schlaikjer, CMU  Twitter

Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling is key – Some sampling ideas for LDA

Stopped here Tues

z=1 z=2 z=3 … … … … unit height random

Running total of P(z=k|…) or P(z<=k)

Discussion…. Where do you spend your time? – sampling the z’s – each sampling step involves a loop over all topics – number of topics grows with corpus size – this seems wasteful even with many topics, words are often only assigned to a few different topics – low frequency words appear < K times … and there are lots and lots of them! – even frequent words are not in every topic

Results from Porteus et al

Results

Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling is key – Some sampling ideas for LDA The Mimno/McCallum decomposition

KDD 09

z=s+r+q

If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<r: lookup U on line segment for r Only need to check t such that n t|d >0

z=s+r+q If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that n w|t >0

z=s+r+q Only need to check t such that n w|t >0 Only need to check t such that n t|d >0 Only need to check occasionally (< 10% of the time)

z=s+r+q Need to store n w|t for each word, topic pair …??? Only need to store n t|d for current d Only need to store (and maintain) total words per topic and α ’s, β,V Trick; count up n t|d for d when you start working on d and update incrementally

z=s+r+q Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w

Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w map w to an int array no larger than frequency w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order

Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling is key – Some sampling ideas for LDA The Mimno/McCallum decomposition (SparseLDA) Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)

Alias tables Basic problem: how can we sample from a biased coin quickly? If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree r in (23/40,7/10] O(K) O(log2K)

Alias tables Another idea… Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*p max keep throwing till you hit a stripe Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*p max keep throwing till you hit a stripe

Alias tables An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking…

KDD 2014 Key ideas use variant of Mimno/McCallum decomposition Use alias tables to sample from the dense parts Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs

KDD 2014 q is stale, easy-to-draw from distribution p is updated distribution computing ratios p(i)/q(i) is cheap usually the ratio is close to one else the dart missed

KDD 2014