Speeding up LDA.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Hierarchical Dirichlet Processes
Distributed Indexed Outlier Detection Algorithm Status Update as of March 11, 2014.
Scaling up LDA William Cohen. Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling.
Segmentation and Fitting Using Probabilistic Methods
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Analysis of Social Media MLD , LTI William Cohen
Techniques for Dimensionality Reduction
Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate.
Analysis of Social Media MLD , LTI William Cohen
Scaling up LDA William Cohen. First some pictures…
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
Analysis of Social Media MLD , LTI William Cohen
Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
Fast search for Dirichlet process mixture models
Clustering (1) Clustering Similarity measure Hierarchical clustering
Optimization by Quantum Computers
Dimensionality Reduction and Principle Components Analysis
Sathya Ronak Alisha Zach Devin Josh
Multiway Search Trees Data may not fit into main memory
Lecture 11 Graph Algorithms
Database Management System
Latent Dirichlet Allocation
Sequential Algorithms for Generating Random Graphs
Classification of unlabeled data:
Multimodal Learning with Deep Boltzmann Machines
Greedy Algorithm for Community Detection
COSC160: Data Structures Linked Lists
Course Description Algorithms are: Recipes for solving problems.
Exam Review Session William Cohen.
MST in Log-Star Rounds of Congested Clique
CSE373: Data Structures & Algorithms Lecture 16: Introduction to Graphs Linda Shapiro Spring 2016.
Database Implementation Issues
Latent Dirichlet Analysis
STA 216 Generalized Linear Models
Overview: Fault Diagnosis
Logistic Regression & Parallel SGD
Algorithm Efficiency Chapter 10.
Indexing and Hashing Basic Concepts Ordered Indices
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Chapter 2: Evaluative Feedback
Topic models for corpora and for graphs
Scaling up Link Prediction with Ensembles
Simple Sorting Methods: Bubble, Selection, Insertion, Shell
DATABASE IMPLEMENTATION ISSUES
Michal Rosen-Zvi University of California, Irvine
LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT
Topic models for corpora and for graphs
10-405: LDA Lecture 2.
Topic Models in Text Processing
Indexing 4/11/2019.
Hierarchical Search on DisCSPs
CS703 - Advanced Operating Systems
Database Implementation Issues
Important Problem Types and Fundamental Data Structures
Course Description Algorithms are: Recipes for solving problems.
Lecture 10 Graph Algorithms
Chapter 2: Evaluative Feedback
Database Implementation Issues
Intelligent Tutoring Systems
Time Complexity and the divide and conquer strategy
Lecture-Hashing.
Presentation transcript:

Speeding up LDA

The LDA Topic Model

Unsupervised NB vs LDA different class distrib θ for each doc one class prior Y Nd D θd γk K α β Zdi Y Nd D π W γ K α β one Y per doc one Z per word Wdi

LDA’s view of a document Mixed membership model

LDA topics: top words w by Pr(w|Z=k)

Parallel LDA

JMLR 2009

Observation How much does the choice of z depend on the other z’s in the same document? quite a lot How much does the choice of z depend on the other z’s in elsewhere in the corpus? maybe not so much depends on Pr(w|t) but that changes slowly Can we parallelize Gibbs and still get good results?

Question Can we parallelize Gibbs sampling? formally, no: every choice of z depends on all the other z’s Gibbs needs to be sequential just like SGD

What if you try and parallelize? Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”

What if you try and parallelize? D=#docs W=#word(types) K=#topics N=words in corpus

Update c. 2014 Algorithms: Distributed variational EM Asynchronous LDA (AS-LDA) Approximate Distributed LDA (AD-LDA) Ensemble versions of LDA: HLDA, DCM-LDA Implementations: GitHub Yahoo_LDA not Hadoop, special-purpose communication code for synchronizing the global counts Alex Smola, YahooCMU Mahout LDA Andy Schlaikjer, CMUTwitter

Faster Sampling for LDA

RECAP Way way more detail

RECAP More detail

RECAP

RECAP

You spend a lot of time sampling RECAP z=1 z=2 random z=3 unit height … … You spend a lot of time sampling There’s a loop over all topics here in the sampler

KDD 09

z=s+r+q

lookup U on line segment for r If U<s: lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … If s<U<r: lookup U on line segment for r Only need to check t such that nt|d>0 z=s+r+q

lookup U on line segment for r If s+r<U: If U<s: lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that nw|t>0 z=s+r+q

Only need to check occasionally (< 10% of the time) Only need to check t such that nt|d>0 Only need to check t such that nw|t>0 z=s+r+q

Trick; count up nt|d for d when you start working on d and update incrementally Only need to store (and maintain) total words per topic and α’s,β,V Only need to store nt|d for current d Need to store nw|t for each word, topic pair …??? z=s+r+q

1. Precompute, for each t, 2. Quickly find t’s such that nw|t is large for w Most (>90%) of the time and space is here… Need to store nw|t for each word, topic pair …??? z=s+r+q

1. Precompute, for each t, 2. Quickly find t’s such that nw|t is large for w map w to an int array no larger than frequency w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order Most (>90%) of the time and space is here… Need to store nw|t for each word, topic pair …???

Other Fast Samplers for LDA

Alias tables O(K) Basic problem: how can we sample from a multinomial quickly? If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree r in (23/40,7/10] O(log2K) http://www.keithschwarz.com/darts-dice-coins/

Alias tables Another idea… Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*pmax keep throwing till you hit a stripe http://www.keithschwarz.com/darts-dice-coins/

Alias tables An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking… http://www.keithschwarz.com/darts-dice-coins/

Alias sampling for LDA You can sample quickly…. But you need to regenerate the alias table after every topic switch Workaround: Sample from an old, stale alias table Update it periodically Compensate for the stale parameters by replacing Gibbs sampling with Metropolis-Hastings (MH)

Alias sampling for LDA MH sampler: q defined by stale alias sampler we can sample quickly and easily from q we can evaluate p(i) easily p is based on actual counts here i and j are vectors of topic assignments

Yet More Fast Samplers for LDA

Fenwick Tree (1994) O(K) Basic problem: how can we sample from a multinomial quickly? If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree r in (23/40,7/10] O(log2K) http://www.keithschwarz.com/darts-dice-coins/

Data structures and algorithms LSearch: linear search

Data structures and algorithms BSearch: binary search store cumulative probability

Data structures and algorithms Alias sampling…..

Data structures and algorithms Fenwick tree

Data structures and algorithms Fenwick tree Binary search βq: dense, changes slowly, re-used for each word in a document r: sparse, a different one is needed for each uniq term in doc Sampler is:

Speedup vs std LDA sampler (1024 topics)

Speedup vs std LDA sampler (10k-50k topics)

And Parallelism….

Second idea: you can sample document-by-document or word-by-word …. or…. use a MF-like approach to distributing the data.

Multi-core NOMAD method

On Beyond LDA

Network Datasets UBMCBlog AGBlog MSPBlog Cora Citeseer

Motivation Social graphs seem to have some aspects of randomness small diameter, giant connected components,.. some structure homophily, scale-free degree dist? How do you model it?

More terms “Stochastic block model”, aka “Block-stochastic matrix”: Draw ni nodes in block i With probability pij, connect pairs (u,v) where u is in block i, v is in block j Special, simple case: pii=qi, and pij=s for all i≠j Question: can you fit this model to a graph? find each pij and latent nodeblock mapping

Not? football

Not? books Artificial graph

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks zp,zq are exchangeable a b zp zp zq p apq N N2

Another mixed membership block model

Another mixed membership block model z=(zi,zj) is a pair of block ids nz = #pairs z qz1,i = #links to i from block z1 qz1,. = #outlinks in block z1 δ = indicator for diagonal M = #nodes

Experiments Balasubramanyan, Lin, Cohen, NIPS w/s 2010