Presentation is loading. Please wait.

Presentation is loading. Please wait.

10-405: LDA Lecture 2.

Similar presentations


Presentation on theme: "10-405: LDA Lecture 2."— Presentation transcript:

1 10-405: LDA Lecture 2

2 Recap: The LDA Topic Model

3 Unsupervised NB vs LDA different class distrib θ for each doc
one class prior Y Nd D θd γk K α β Zdi Y Nd D π W γ K α β one Y per doc one Z per word Wdi

4 LDA topics: top words w by Pr(w|Z=k)

5 LDA’s view of a document
Mixed membership model

6 LDA and (Collapsed) Gibbs Sampling
Gibbs sampling – works for any directed model! Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

7 Recap: Collapsed Sampling for LDA
Y Nd D θd γk K α β Zdi Pr(E-|Z) Pr(Z|E+) “fraction” of time Z=t in doc d fraction of time W=w in topic t Wdi ignores a detail – counts should not include the Zdi being sampled Only sample the Z’s

8 Parallelize it Speed up sampling
Speeding up LDA Parallelize it Speed up sampling

9 Parallel LDA

10 JMLR 2009

11 Observation How much does the choice of z depend on the other z’s in the same document? quite a lot How much does the choice of z depend on the other z’s in elsewhere in the corpus? maybe not so much depends on Pr(w|t) but that changes slowly Can we parallelize Gibbs and still get good results? formally, no: every choice of z depends on all the other z’s Gibbs needs to be sequential, just like SGD

12 What if you try and parallelize?
Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA” Let the local counters diverge This is iterative parameter mixing

13 What if you try and parallelize?
Per pass All-Reduce cost D=#docs W=#word(types) K=#topics N=words in corpus

14

15 perplexity – NIPS dataset

16

17 match topics by similarity not topic id

18 Parallelize it Speed up sampling
Speeding up LDA Parallelize it Speed up sampling

19 RECAP More detail linear in corpus size and #topics

20 each iteration: linear in corpus size
RECAP each iteration: linear in corpus size resample: linear in #topics most of the time is resampling

21 You spend a lot of time sampling
RECAP z=1 z=2 z=3 unit height random You spend a lot of time sampling There’s a loop over all topics here in the sampler

22 KDD 09

23 z=1 z=2 z=3 unit height random

24 z=2 z=3 height s z=2 z=3 r z=s+r+q z=1 z=2 z=3 q

25 Draw random U from uniform[0,s+r+q] If U<s:
lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … height s random U normalizer = s+r+q

26 lookup U on line segment for r
If U<s: lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … If s<U<s+r: lookup U on line segment for r Only need to check t such that nt|d>0 z=s+r+q

27 lookup U on line segment for r If s+r<U:
If U<s: lookup U on line segment with tic-marks at α1β/(βV + n.|1), α2β/(βV + n.|2), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that nw|t>0 z=s+r+q

28 Only need to check occasionally (< 10% of the time)
Only need to check t such that nt|d>0 Only need to check t such that nw|t>0 z=s+r+q

29 Only need to store (and maintain) total words per topic and α’s,β,V
Trick: count up nt|d for d when you start working on d and update incrementally Only need to store nt|d for current d Need to store nw|t for each word, topic pair …??? z=s+r+q

30 1. Maintain, for d and each t,
2. Quickly find t’s such that nw|t is large for w z=1 z=2 z=3 z=1 z=2 z=3 Most (>90%) of the time and space is here… Need to store nw|t for each word, topic pair …??? z=s+r+q

31 Topic distributions are skewed!
Topic proportions learned by LDA on a typical NIPS paper z=1 z=2 z=3 z=1 z=2 z=3 z=2 z=3

32 2. Quickly find t’s such that nw|t is large for w
1. Precompute, for each t, 2. Quickly find t’s such that nw|t is large for w Here’s how Mimno did it: associate each w with an int array no larger than frequency of w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order Most (>90%) of the time and space is here… Need to store nw|t for each word, topic pair …???

33 NIPS dataset

34 Other Fast Samplers for LDA

35 Alias tables Basic problem: how can we sample from a biased die quickly? Naively this is O(K)

36 Alias tables Simulate the dart with two drawn values: rx  int(u1*K)
Another idea… Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*pmax keep throwing till you hit a stripe

37 Alias tables An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking…

38 LDA with Alias Sampling
[KDD 2014] Sample Z’s with alias sampler Don’t update the sampler with each flip: Correct for “staleness” with Metropolis-Hastings algorithm

39

40 Yet More Fast Samplers for LDA

41 WWW 2015

42 Fenwick Tree (1994) O(K) Basic problem: how can we sample from a biased die quickly…. …and update quickly? maybe we can use a binary tree…. r in (23/40,7/10] O(log2K)

43 Data structures and algorithms
LSearch: linear search

44 Data structures and algorithms
BSearch: binary search store cumulative probability

45 Data structures and algorithms
Alias sampling…..

46 Data structures and algorithms
F+ tree

47 Data structures and algorithms
Fenwick tree Binary search βq: dense, changes slowly, re-used for each word in a document r: sparse, a different one is needed for each uniq term in doc Sampler is:

48 Speedup vs std LDA sampler (1024 topics)

49 Speedup vs std LDA sampler (10k-50k opics)

50 WWW 2015 Also describe some nice ways to parallelize this operation, similar to the distributed MF algorithm we discussed

51

52 Multi-core NOMAD method

53 Parallelize it Speed up sampling
Speeding up LDA Parallelize it Speed up sampling

54 Speeding up LDA-like models
Parallelize it Speed up sampling Use these tricks for other models…

55 Network Datasets UBMCBlog AGBlog MSPBlog Cora Citeseer

56 How do you model such graphs?
“Stochastic block model”, aka “Block-stochastic matrix”: Draw ni nodes in block i With probability pij, connect pairs (u,v) where u is in block i, v is in block j Special, simple case: pii=qi, and pij=s for all i≠j Question: can you fit this model to a graph? find each pij and latent nodeblock mapping

57 Not? football

58 Not? books Artificial graph

59 Artificial graph

60 A mixed membership stochastic block model

61 Stochastic Block models
Airoldi, Blei, Feinberg Xing JMLR 2008 Stochastic Block models a b zp zp zq p apq N N2

62 Another mixed membership block model
Parkkinen, SinkkonenGyenge, Kaski, MLG 2009

63 Another mixed membership block model
Pick two multinomials over nodes For each edge in the graph: Pick z=(zi,zj), a pair of block ids Pick node i based on Pr(.|zi) Pick node j based on Pr(.| zj)

64

65 Experiments – for next week
Balasubramanyan, Lin, Cohen, NIPS w/s 2010


Download ppt "10-405: LDA Lecture 2."

Similar presentations


Ads by Google