10-405: LDA Lecture 2.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

A Tutorial on Learning with Bayesian Networks
Overview of this week Debugging tips for ML algorithms
Scaling up LDA William Cohen. Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling.
Segmentation and Fitting Using Probabilistic Methods
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
Generative Topic Models for Community Analysis
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
British Museum Library, London Picture Courtesy: flickr.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Priors, Normal Models, Computing Posteriors
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Markov Random Fields Probabilistic Models for Images
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Part 1: Overview of Low Density Parity Check(LDPC) codes.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Techniques for Dimensionality Reduction
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Analysis of Social Media MLD , LTI William Cohen
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
How can we maintain an error bound? Settle for a “per-step” bound What’s the probability of a mistake at each step? Not cumulative, but Equal footing with.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.
Scaling up LDA William Cohen. First some pictures…
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
Analysis of Social Media MLD , LTI William Cohen
Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Algorithm Analysis 1.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Online Multiscale Dynamic Topic Models
Latent Dirichlet Allocation
Classification of unlabeled data:
Speeding up LDA.
Multimodal Learning with Deep Boltzmann Machines
Exam Review Session William Cohen.
Jun Liu Department of Statistics Stanford University
CS 4/527: Artificial Intelligence
Bayesian inference Presented by Amir Hadadi
CAP 5636 – Advanced Artificial Intelligence
Latent Dirichlet Analysis
Overview: Fault Diagnosis
Hidden Markov Models Part 2: Algorithms
Haim Kaplan and Uri Zwick
Logistic Regression & Parallel SGD
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
N-Gram Model Formulas Word sequences Chain rule of probability
Bayesian Inference for Mixture Language Models
Learning Sequence Motif Models Using Gibbs Sampling
CS 188: Artificial Intelligence
Chapter 2: Evaluative Feedback
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Quicksort and Randomized Algs
Lecture 10 Graph Algorithms
Chapter 2: Evaluative Feedback
CS249: Neural Language Model
Presentation transcript:

10-405: LDA Lecture 2

Recap: The LDA Topic Model

Unsupervised NB vs LDA different class distrib θ for each doc one class prior Y Nd D θd γk K α β Zdi Y Nd D π W γ K α β one Y per doc one Z per word Wdi

LDA topics: top words w by Pr(w|Z=k)

LDA’s view of a document Mixed membership model

LDA and (Collapsed) Gibbs Sampling Gibbs sampling – works for any directed model! Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

Recap: Collapsed Sampling for LDA Y Nd D θd γk K α β Zdi Pr(E-|Z) Pr(Z|E+) “fraction” of time Z=t in doc d fraction of time W=w in topic t Wdi ignores a detail – counts should not include the Zdi being sampled Only sample the Z’s

Parallelize it Speed up sampling Speeding up LDA Parallelize it Speed up sampling

Parallel LDA

JMLR 2009

Observation How much does the choice of z depend on the other z’s in the same document? quite a lot How much does the choice of z depend on the other z’s in elsewhere in the corpus? maybe not so much depends on Pr(w|t) but that changes slowly Can we parallelize Gibbs and still get good results? formally, no: every choice of z depends on all the other z’s Gibbs needs to be sequential, just like SGD

What if you try and parallelize? Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA” Let the local counters diverge This is iterative parameter mixing

What if you try and parallelize? Per pass All-Reduce cost D=#docs W=#word(types) K=#topics N=words in corpus

perplexity – NIPS dataset

match topics by similarity not topic id

Parallelize it Speed up sampling Speeding up LDA Parallelize it Speed up sampling

RECAP More detail linear in corpus size and #topics

each iteration: linear in corpus size RECAP each iteration: linear in corpus size resample: linear in #topics most of the time is resampling

You spend a lot of time sampling RECAP z=1 z=2 z=3 … unit height random You spend a lot of time sampling There’s a loop over all topics here in the sampler

KDD 09

z=1 z=2 z=3 … unit height random

z=2 z=3 … height s z=2 z=3 … r z=s+r+q z=1 z=2 z=3 … q

Draw random U from uniform[0,s+r+q] If U<s: lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … … height s random U normalizer = s+r+q

lookup U on line segment for r If U<s: lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … If s<U<s+r: lookup U on line segment for r Only need to check t such that nt|d>0 z=s+r+q

lookup U on line segment for r If s+r<U: If U<s: lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that nw|t>0 z=s+r+q

Only need to check occasionally (< 10% of the time) Only need to check t such that nt|d>0 Only need to check t such that nw|t>0 z=s+r+q

Only need to store (and maintain) total words per topic and α’s,β,V Trick: count up nt|d for d when you start working on d and update incrementally Only need to store nt|d for current d Need to store nw|t for each word, topic pair …??? z=s+r+q

1. Maintain, for d and each t, 2. Quickly find t’s such that nw|t is large for w z=1 z=2 z=3 … z=1 z=2 z=3 … Most (>90%) of the time and space is here… Need to store nw|t for each word, topic pair …??? z=s+r+q

Topic distributions are skewed! Topic proportions learned by LDA on a typical NIPS paper z=1 z=2 z=3 … z=1 z=2 z=3 … z=2 z=3 …

2. Quickly find t’s such that nw|t is large for w 1. Precompute, for each t, 2. Quickly find t’s such that nw|t is large for w Here’s how Mimno did it: associate each w with an int array no larger than frequency of w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order Most (>90%) of the time and space is here… Need to store nw|t for each word, topic pair …???

NIPS dataset

Other Fast Samplers for LDA

Alias tables Basic problem: how can we sample from a biased die quickly? Naively this is O(K)

Alias tables Simulate the dart with two drawn values: rx  int(u1*K) Another idea… Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*pmax keep throwing till you hit a stripe http://www.keithschwarz.com/darts-dice-coins/

Alias tables An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking… http://www.keithschwarz.com/darts-dice-coins/

LDA with Alias Sampling [KDD 2014] Sample Z’s with alias sampler Don’t update the sampler with each flip: Correct for “staleness” with Metropolis-Hastings algorithm

Yet More Fast Samplers for LDA

WWW 2015

Fenwick Tree (1994) O(K) Basic problem: how can we sample from a biased die quickly…. …and update quickly? maybe we can use a binary tree…. r in (23/40,7/10] O(log2K) http://www.keithschwarz.com/darts-dice-coins/

Data structures and algorithms LSearch: linear search

Data structures and algorithms BSearch: binary search store cumulative probability

Data structures and algorithms Alias sampling…..

Data structures and algorithms F+ tree

Data structures and algorithms Fenwick tree Binary search βq: dense, changes slowly, re-used for each word in a document r: sparse, a different one is needed for each uniq term in doc Sampler is:

Speedup vs std LDA sampler (1024 topics)

Speedup vs std LDA sampler (10k-50k opics)

WWW 2015 Also describe some nice ways to parallelize this operation, similar to the distributed MF algorithm we discussed

Multi-core NOMAD method

Parallelize it Speed up sampling Speeding up LDA Parallelize it Speed up sampling

Speeding up LDA-like models Parallelize it Speed up sampling Use these tricks for other models…

Network Datasets UBMCBlog AGBlog MSPBlog Cora Citeseer

How do you model such graphs? “Stochastic block model”, aka “Block-stochastic matrix”: Draw ni nodes in block i With probability pij, connect pairs (u,v) where u is in block i, v is in block j Special, simple case: pii=qi, and pij=s for all i≠j Question: can you fit this model to a graph? find each pij and latent nodeblock mapping

Not? football

Not? books Artificial graph

Artificial graph

A mixed membership stochastic block model

Stochastic Block models Airoldi, Blei, Feinberg Xing JMLR 2008 Stochastic Block models a b zp zp zq p apq N N2

Another mixed membership block model Parkkinen, SinkkonenGyenge, Kaski, MLG 2009

Another mixed membership block model Pick two multinomials over nodes For each edge in the graph: Pick z=(zi,zj), a pair of block ids Pick node i based on Pr(.|zi) Pick node j based on Pr(.| zj)

Experiments – for next week Balasubramanyan, Lin, Cohen, NIPS w/s 2010