Scaling up LDA William Cohen. First some pictures…

Slides:

Advertisements

Similar presentations

Topic models Source: Topic models, David Blei, MLSS 09.

Advertisements

CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.

Searching for Data Relationship between searching and sorting Simple linear searching Linear searching of sorted data Searching for string or numeric data.

Overview of this week Debugging tips for ML algorithms

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

The Linux Kernel: Memory Management

Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.

Computer Science: A Structured Programming Approach Using C1 8-4 Array Applications In this section we study two array applications: frequency arrays with.

Scaling up LDA William Cohen. Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling.

Statistical Topic Modeling part 1

A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.

From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.

D1: Quick Sort. The quick sort is an algorithm that sorts data into a specified order. For a quick sort, select the data item in the middle of the list.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.

CS 106 Introduction to Computer Science I 03 / 03 / 2008 Instructor: Michael Eckmann.

British Museum Library, London Picture Courtesy: flickr.

Algorithm Efficiency and Sorting

Elementary Data Structures and Algorithms

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:

Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

1 Data Structures and Algorithms Sorting I Gal A. Kaminka Computer Science Department.

Problem of the Day  Why are manhole covers round?

Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Topic Modeling using Latent Dirichlet Allocation

Testing Random-Number Generators Andy Wang CIS Computer Systems Performance Analysis.

Topic modeling experiments benchmark and simple evaluations 11/15, 11/16.

Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.

Techniques for Dimensionality Reduction

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.

CS 106 Introduction to Computer Science I 03 / 02 / 2007 Instructor: Michael Eckmann.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:

Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Analysis of Social Media MLD , LTI William Cohen

How can we maintain an error bound? Settle for a “per-step” bound What’s the probability of a mistake at each step? Not cumulative, but Equal footing with.

Latent Dirichlet Allocation (LDA)

Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.

Review William Cohen. Sessions Randomized algorithms – The “hash trick” – Bloom filters how would you use it? how and why does it work? – Count-min sketch.

Sorting & Searching Geletaw S (MSC, MCITP). Objectives At the end of this session the students should be able to: – Design and implement the following.

Aims: To learn about some simple sorting algorithms. To develop understanding of the importance of efficient algorithms. Objectives: All:Understand how.

PHY 107 – Programming For Science. Today’s Goal  How to use while & do / while loops in code  Know & explain when use of either loop preferable  Understand.

Analysis of Social Media MLD , LTI William Cohen

Large Scale Parallel Supervised Topic-Modeling -implementation plan- Keisuke Kamataki Jun Zhu Eric Xing Sep 27, 2010.

Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Latent Dirichlet Allocation

Speeding up LDA.

Exam Review Session William Cohen.

Problem Solving: Brute Force Approaches

Sequence comparison: Significance of similarity scores

Latent Dirichlet Analysis

עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב

Stochastic Optimization Maximization for Latent Variable Models

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

CS246: Latent Dirichlet Analysis

Junghoo “John” Cho UCLA

Topic models for corpora and for graphs

10-405: LDA Lecture 2.

Presentation transcript:

Scaling up LDA William Cohen

First some pictures…

LDA in way too much detail William Cohen

Review - LDA Latent Dirichlet Allocation with Gibbs z w  M  N  Randomly initialize each z m,n Repeat for t=1,…. For each doc m, word n Find Pr(z mn =k|other z’s) Sample z mn according to that distr.

Way way more detail

More detail

What gets learned…..

In A Math-ier Notation N[*,k] N[d,k] M[w,k] N[*,*]=V

for each document d and word position j in d z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d

for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j] for each pass t=1,2,….

z=1 z=2 z=3 … … … … unit height random

JMLR 2009

Observation How much does the choice of z depend on the other z’s in the same document? – quite a lot How much does the choice of z depend on the other z’s in elsewhere in the corpus? – maybe not so much – depends on Pr(w|t) but that changes slowly Can we parallelize Gibbs and still get good results?

Question Can we parallelize Gibbs sampling? – formally, no: every choice of z depends on all the other z’s – Gibbs needs to be sequential just like SGD

What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate Distributed LDA”

What if you try and parallelize? D=#docs W=#word(types) K=#topics N=words in corpus

z=1 z=2 z=3 … … … … unit height random

Running total of P(z=k|…) or P(z<=k)

Discussion…. Where do you spend your time? – sampling the z’s – each sampling step involves a loop over all topics – this seems wasteful even with many topics, words are often only assigned to a few different topics – low frequency words appear < K times … and there are lots and lots of them! – even frequent words are not in every topic

Discussion…. What’s the solution? Idea: come up with approximations to Z at each stage - then you might be able to stop early….. Want Z i >=Z

Tricks How do you compute and maintain the bound? – see the paper What order do you go in? – want to pick large P(k)’s first – … so we want large P(k|d) and P(k|w) – … so we maintain k’s in sorted order which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array

Results

KDD 09

z=s+r+q

If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<r: lookup U on line segment for r Only need to check t such that n t|d >0

z=s+r+q If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that n w|t >0

z=s+r+q Only need to check t such that n w|t >0 Only need to check t such that n t|d >0 Only need to check occasionally (< 10% of the time)

z=s+r+q Need to store n w|t for each word, topic pair …??? Only need to store n t|d for current d Only need to store (and maintain) total words per topic and α ’s, β,V Trick; count up n t|d for d when you start working on d and update incrementally

z=s+r+q Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w

Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w map w to an int array no larger than frequency w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order