Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.

Slides:



Advertisements
Similar presentations
Scaling Up Graphical Model Inference
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
A Tutorial on Learning with Bayesian Networks
Hierarchical Dirichlet Processes
Expectation Maximization
Randomized Algorithms Kyomin Jung KAIST Applied Algorithm Lab Jan 12, WSAC
Scaling up LDA William Cohen. Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling.
Segmentation and Fitting Using Probabilistic Methods
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Visual Recognition Tutorial
Generative Topic Models for Community Analysis
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
British Museum Library, London Picture Courtesy: flickr.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Visual Recognition Tutorial
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Mixing Times of Markov Chains for Self-Organizing Lists and Biased Permutations Prateek Bhakta, Sarah Miracle, Dana Randall and Amanda Streib.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
Markov Random Fields Probabilistic Models for Images
Integrating Topics and Syntax -Thomas L
Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Topic Models.
Latent Dirichlet Allocation
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Pattern Recognition and Machine Learning-Chapter 13: Sequential Data
Techniques for Dimensionality Reduction
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Analysis of Social Media MLD , LTI William Cohen
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
How can we maintain an error bound? Settle for a “per-step” bound What’s the probability of a mistake at each step? Not cumulative, but Equal footing with.
Latent Dirichlet Allocation (LDA)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
Scaling up LDA William Cohen. First some pictures…
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
Analysis of Social Media MLD , LTI William Cohen
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Latent Dirichlet Allocation
Speeding up LDA.
Path Coupling And Approximate Counting
Distributions and Concepts in Probability Theory
Latent Dirichlet Analysis
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
10-405: LDA Lecture 2.
Topic Models in Text Processing
Presentation transcript:

Probabilistic models for corpora and graphs

Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document d = 1, , M Generate C d ~ Mult( C |  ) For each position n = 1, , N d Generate w n ~ Mult(W |  )

Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document d = 1, , M Generate C d ~ Mult( C |  ) For each position n = 1, , N d Generate w n ~ Mult(W |  C d  )  K

Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each class k=1,…K Generate  k  For each document d = 1, , M Generate C d ~ Mult( C |  ) For each position n = 1, , N d Generate w n ~ Mult(W |  C d  )  K

Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each class k=1,…K Generate  k  Dir(  For each document d = 1, , M Generate C d ~ Mult( C |  ) For each position n = 1, , N d Generate w n ~ Mult(W |  C d  )  K  Dirichlet is a prior for multinomials, defined by k params (    K  and      K MAP for P(  i |  (n k +  i )/(n +  0 ) Symmetric Dirichlet: all  i ‘s are equal

Review – unsupervised Naïve Bayes Mixture model: unsupervised naïve Bayes model C W N M   Z K  For each class k=1,…K Generate  k  Dir(  For each document d = 1, , M Generate C d ~ Mult( C |  ) For each position n = 1, , N d Generate w n ~ Mult(W |  C d  ) Same generative story – different learning problem…

Latent Dirichlet Allocation z w  M  N  For each document d = 1, ,M Generate  d ~ Dir( |  ) For each position n = 1, , N d generate z n ~ Mult( |  d ) generate w n ~ Mult( |  z n ) “Mixed membership” K 

LDA’s view of a document

Review - LDA Gibbs sampling –Applicable when joint distribution is hard to evaluate but conditional distribution is known –Sequence of samples comprises a Markov Chain –Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2

D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 z w N1 z w N2 z w N3    K=2 

D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 z11z21 thank z31 istam    K=2  happy z12 thank z22 turkey z32 turkey

D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 12 thank 1 istam    K=2  happy 2 thank 1 turkey 1 Step 1: initialize z’s randomly

D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 1 2 thank 1 istam    K=2  happy 2 thank 1 turkey 1 Step 1: initialize z’s randomly Step 2: sample z 11 from: slow derivation:

D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 2 thank 1 istam    K=2  happy 2 thank 1 turkey 1 #times ‘happy’ in topic t #words in topic t (smoothed by 

D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 2 thank 1 istam    K=2  happy 2 thank 1 turkey 1 #times topic t in D #words in D1 (smoothed by 

D1: “Happy Thanksgiving!” D2: “Thanksgiving turkey” D3: “Istambul, Turkey K=2 2 2 thank 1 istam    K=2  happy thank 1 turkey 1 Step 1: initialize z’s randomly Step 2: sample z 11 from: Pr(z 11 =k|z 12,…,z 32,  ) Step 3: sample z 12

Review – unsupervised Naïve Bayes Mixture model: unsupervised naïve Bayes model C W N M   Z K  For each class k=1,…K Generate  k  Dir(  For each document d = 1, , M Generate C d ~ Mult( C |  ) For each position n = 1, , N d Generate w n ~ Mult(W |  C d  ) Question: could you use collapsed Gibbs sampling here? what would Pr(z i |z 1,..,z i-1,z i+1,….,z m ) look like?

Models for corpora and graphs

21

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks z p,z q are exchangeable zizi zjzj a uv N2N2 zizi N  u  for each node i, pick a latent class z i for each node pair i,j, pick an edge weight a by a ~ Pr(a |  [z i,z j ])

Another mixed membership block model pick a multinomial θ over pairs of node topics k 1,k 2 for each node topic k, pick a multinomial ϕ k over node id’s for each edge e: pick z=(k 1,k 2 ) pick i~ Pr(. | ϕ k1 ) pick i~ Pr(. | ϕ k2 )

Speeding up modeling for corpora and graphs

Solutions Parallelize –IPM (Newman et al, AD-LDA) –Parameter server (Qiao et al, HW7) Speedups: –Exploit sparsity –Fast sampling techniques State of the art methods use both….

z=1 z=2 z=3 … … … … unit height random This is almost all the run-time, and it’s linear in the number of topics….and bigger corpora need more topics….

KDD 09

z=t1 z=t2 z=t3 … … … … unit height random z=t1 z=t2 z=t3 … … … … z=t1 z=t2 z=t3 … … … … z=t2 z=t3 … … … …

z=s+r+q

If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<r: lookup U on line segment for r Only need to check t such that n t|d >0

z=s+r+q If U<s: lookup U on line segment with tic-marks at α 1 β/(βV + n.|1 ), α 2 β/(βV + n.|2 ), … If s<U<s+r: lookup U on line segment for r If s+r<U: lookup U on line segment for q Only need to check t such that n w|t >0

z=s+r+q Only need to check t such that n w|t >0 Only need to check t such that n t|d >0 Only need to check occasionally (< 10% of the time)

z=s+r+q Need to store n w|t for each word, topic pair …??? Only need to store n t|d for current d Only need to store (and maintain) total words per topic and α ’s, β,V Trick; count up n t|d for d when you start working on d and update incrementally

z=s+r+q Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w

Need to store n w|t for each word, topic pair …??? 1. Precompute, for each t, Most (>90%) of the time and space is here… 2. Quickly find t’s such that n w|t is large for w map w to an int array no larger than frequency of w no larger than #topics encode (t,n) as a bit vector n in the high-order bits t in the low-order bits keep ints sorted in descending order

Outline LDA/Gibbs algorithm details How to speed it up by parallelizing How to speed it up by faster sampling – Why sampling is key – Some sampling ideas for LDA The Mimno/McCallum decomposition (SparseLDA) Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)

Alias tables Basic problem: how can we sample from a biased coin quickly? If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree r in (23/40,7/10] O(K) O(log2K)

Alias tables Another idea… Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*p max keep throwing till you hit a stripe Simulate the dart with two drawn values: rx  int(u1*K) ry  u1*p max keep throwing till you hit a stripe

Alias tables An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit. You can always do this using only two colors in each column of the final alias table and the dart never misses! mathematically speaking…

KDD 2014 Key ideas use variant of Mimno/McCallum decomposition Use alias tables to sample from the dense parts Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs

KDD 2014 q is stale, easy-to-draw from distribution p is updated distribution computing ratios p(i)/q(i) is cheap usually the ratio is close to one else the dart missed

KDD 2014