Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

SQL SERVER 2012 XVELOCITY COLUMNSTORE INDEX Conor Cunningham Principal Architect SQL Server Engine.

Hierarchical Dirichlet Processes

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Recursion. Recursion is a powerful technique for thinking about a process It can be used to simulate a loop, or for many other kinds of applications In.

Scaling up LDA William Cohen. Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling.

Segmentation and Fitting Using Probabilistic Methods

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.

Reference: Message Passing Fundamentals.

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Sparse vs. Ensemble Approaches to Supervised Learning

1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.

Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.

Chapter 10-Arithmetic-logic units

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

British Museum Library, London Picture Courtesy: flickr.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Elementary Data Structures and Algorithms

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.

Abstract Data Types (ADTs) Data Structures The Java Collections API

Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)

Efficient Model Selection for Support Vector Machines

Identifying Reversible Functions From an ROBDD Adam MacDonald.

PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET

Online Learning for Latent Dirichlet Allocation

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Matrix Sparsification. Problem Statement Reduce the number of 1s in a matrix.

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.

CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate.

Arithmetic-logic units1 An arithmetic-logic unit, or ALU, performs many different arithmetic and logic operations. The ALU is the “heart” of a processor—you.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

KERNELS AND PERCEPTRONS. The perceptron A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1.

Analysis of Social Media MLD , LTI William Cohen

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

Scaling up LDA William Cohen. First some pictures…

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.

Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Speeding up LDA.

Course Description Algorithms are: Recipes for solving problems.

Logistic Regression & Parallel SGD

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

Boltzmann Machine (BM) (§6.4)

Topic models for corpora and for graphs

10-405: LDA Lecture 2.

Parallel Perceptrons and Iterative Parameter Mixing

Course Description Algorithms are: Recipes for solving problems.

Time Complexity and the divide and conquer strategy

Presentation transcript:

Fast sampling for LDA William Cohen

MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables Fr: Parameter estimation for text analysis - Gregor Heinrich prob this word/term assigned to topic k prob this doc contains topic k

More detail

z=1 z=2 z=3 … … … … unit height random

SPEEDUP 1 - SPARSITY

KDD 2008

z=1 z=2 z=3 … … … … unit height random

Running total of P(z=k|…) or P(z<=k)

Discussion…. Where do you spend your time? – sampling the z’s – each sampling step involves a loop over all topics – this seems wasteful even with many topics, words are often only assigned to a few different topics – low frequency words appear < K times … and there are lots and lots of them! – even frequent words are not in every topic

Discussion…. What’s the solution? Idea: come up with approximations to Z at each stage - then you might be able to stop early….. computationally like a sparser vector Want Z i >=Z

Tricks How do you compute and maintain the bound? – see the paper What order do you go in? – want to pick large P(k)’s first – … so we want large P(k|d) and P(k|w) – … so we maintain k’s in sorted order which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array

Results

SPEEDUP 2 - ANOTHER APPROACH FOR USING SPARSITY

KDD 09

z=s+r+q t = topic (k) w = word d = doc

z=s+r+q If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<r: lookup U on line segment for r Only need to check t such that n t|d >0 t = topic (k) w = word d = doc

z=s+r+q If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that n w|t >0

z=s+r+q Only need to check t such that n w|t >0 Only need to check t such that n t|d >0 Only need to check occasionally (< 10% of the time)

z=s+r+q Need to store n w|t for each word, topic pair …??? Only need to store n t|d for current d Only need to store (and maintain) total words per topic and α ’s, β,V Trick; count up n t|d for d when you start working on d and update incrementally

z=s+r+q Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w

Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w map w to an int array no larger than frequency w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order

Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling is key – Some sampling ideas for LDA The Mimno/McCallum decomposition (SparseLDA) Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)

Alias tables Basic problem: how can we sample from a biased coin quickly? If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree r in (23/40,7/10] O(K) O(log2K)

Alias tables Another idea… Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*p max keep throwing till you hit a stripe Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*p max keep throwing till you hit a stripe

Alias tables An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking…

KDD 2014 Key ideas use variant of Mimno/McCallum decomposition Use alias tables to sample from the dense parts Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs

KDD 2014 q is stale, easy-to-draw from distribution p is updated distribution computing ratios p(i)/q(i) is cheap usually the ratio is close to one else the dart missed

KDD 2014

SPEEDUP 3 - ONLINE LDA

Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei

ASIDE: VARIATIONAL INFERENCE FOR LDA

uses γ uses λ

BACK TO: SPEEDUP 3 - ONLINE LDA

SPEED ONLINE SPARSE LDA

Compute expectations over the z’s any way you want…. Compute expectations over the z’s any way you want….

Technical Details Variational distrib: q(z d ) not q(z d )! Approximate using Gibbs: after sampling for a while estimate: estimate using time and “coherence”: D(w) = # docs containing word w

better

Summary of LDA speedup tricks Gibbs sampler: – O(N*K*T) and K grows with N – Need to keep the corpus (and z’s) in memory You can parallelize – You need to keep a slice of the corpus – But you need to synchronize K multinomials over the vocabulary – AllReduce helps You can sparsify the sampling and topic-counts – Mimno’s trick - greatly reduces memory You can do the computation on-line – Only need to keep K-multinomials and one document’s worth of corpus and z’s in memory You can combine some of these methods – Online sparsified LDA – Parallel online sparsified LDA?

SPEEDUP FOR PARALLEL LDA - USING ALLREDUCE FOR SYNCHRONIZATION

What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….

Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

Gory details of VW Hadoop-AllReduce Spanning-tree server: – Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (“fake” mappers): – Input for worker is locally cached – Workers all connect to spanning-tree server – Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an allreduce

Hadoop AllReduce don’t wait for duplicate jobs

Second-order method - like Newton’s method

2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

50M examples explicitly constructed kernel  11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error