Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generative Topic Models for Community Analysis

Similar presentations


Presentation on theme: "Generative Topic Models for Community Analysis"— Presentation transcript:

1 Generative Topic Models for Community Analysis
Ramesh Nallapati

2 Objectives Provide an overview of topic models and their learning techniques Mixture models, PLSA, LDA EM, variational EM, Gibbs sampling Convince you that topic models are an attractive framework for community analysis 5 definitive papers 9/18/2007 10-802: Guest Lecture

3 Outline Part I: Introduction to Topic Models
Naive Bayes model Mixture Models Expectation Maximization PLSA LDA Variational EM Gibbs Sampling Part II: Topic Models for Community Analysis Citation modeling with PLSA Citation Modeling with LDA Author Topic Model Author Topic Recipient Model Modeling influence of Citations Mixed membership Stochastic Block Model 9/18/2007 10-802: Guest Lecture

4 Introduction to Topic Models
Multinomial Naïve Bayes For each document d = 1,, M Generate Cd ~ Mult( ¢ | ) For each position n = 1,, Nd Generate wn ~ Mult(¢|,Cd) C ….. WN W1 W2 W3 M b 9/18/2007 10-802: Guest Lecture

5 Introduction to Topic Models
Naïve Bayes Model: Compact representation C C ….. WN W1 W2 W3 W M N b M b 9/18/2007 10-802: Guest Lecture

6 Introduction to Topic Models
Multinomial naïve Bayes: Learning Maximize the log-likelihood of observed variables w.r.t. the parameters: Convex function: global optimum Solution: 9/18/2007 10-802: Guest Lecture

7 Introduction to Topic Models
Mixture model: unsupervised naïve Bayes model Joint probability of words and classes: But classes are not visible: Z C W N M b 9/18/2007 10-802: Guest Lecture

8 Introduction to Topic Models
Mixture model: learning Not a convex function No global optimum solution Solution: Expectation Maximization Iterative algorithm Finds local optimum Guaranteed to maximize a lower-bound on the log-likelihood of the observed data 9/18/2007 10-802: Guest Lecture

9 Introduction to Topic Models
log(0.5x1+0.5x2) Quick summary of EM: Log is a concave function Lower-bound is convex! Optimize this lower-bound w.r.t. each variable instead 0.5log(x1)+0.5log(x2) X2 X1 0.5x1+0.5x2 H() 9/18/2007 10-802: Guest Lecture

10 Introduction to Topic Models
Mixture model: EM solution E-step: M-step: 9/18/2007 10-802: Guest Lecture

11 Introduction to Topic Models
9/18/2007 10-802: Guest Lecture

12 Introduction to Topic Models
Probabilistic Latent Semantic Analysis Model d d Select document d ~ Mult() For each position n = 1,, Nd generate zn ~ Mult( ¢ | d) generate wn ~ Mult( ¢ | zn) Topic distribution z w N M 9/18/2007 10-802: Guest Lecture

13 Introduction to Topic Models
Probabilistic Latent Semantic Analysis Model Learning using EM Not a complete generative model Has a distribution  over the training set of documents: no new document can be generated! Nevertheless, more realistic than mixture model Documents can discuss multiple topics! 9/18/2007 10-802: Guest Lecture

14 Introduction to Topic Models
PLSA topics (TDT-1 corpus) 9/18/2007 10-802: Guest Lecture

15 Introduction to Topic Models
9/18/2007 10-802: Guest Lecture

16 Introduction to Topic Models
Latent Dirichlet Allocation For each document d = 1,,M Generate d ~ Dir(¢ | ) For each position n = 1,, Nd generate zn ~ Mult( ¢ | d) generate wn ~ Mult( ¢ | zn) a z w N M 9/18/2007 10-802: Guest Lecture

17 Introduction to Topic Models
Latent Dirichlet Allocation Overcomes the issues with PLSA Can generate any random document Parameter learning: Variational EM Numerical approximation using lower-bounds Results in biased solutions Convergence has numerical guarantees Gibbs Sampling Stochastic simulation unbiased solutions Stochastic convergence 9/18/2007 10-802: Guest Lecture

18 Introduction to Topic Models
Variational EM for LDA Approximate the posterior by a simpler distribution A convex function in each parameter! 9/18/2007 10-802: Guest Lecture

19 Introduction to Topic Models
Gibbs sampling Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution 9/18/2007 10-802: Guest Lecture

20 Introduction to Topic Models
LDA topics 9/18/2007 10-802: Guest Lecture

21 Introduction to Topic Models
LDA’s view of a document 9/18/2007 10-802: Guest Lecture

22 Introduction to Topic Models
Perplexity comparison of various models Unigram Mixture model PLSA Lower is better LDA 9/18/2007 10-802: Guest Lecture

23 Introduction to Topic Models
Summary Generative models for exchangeable data Unsupervised models Automatically discover topics Well developed approximate techniques available for inference and learning 9/18/2007 10-802: Guest Lecture

24 Outline Part I: Introduction to Topic Models
Naive Bayes model Mixture Models Expectation Maximization PLSA LDA Variational EM Gibbs Sampling Part II: Topic Models for Community Analysis Citation modeling with PLSA Citation Modeling with LDA Author Topic Model Author Topic Recipient Model Modeling influence of Citations Mixed membership Stochastic Block Model 9/18/2007 10-802: Guest Lecture

25 Hyperlink modeling using PLSA
9/18/2007 10-802: Guest Lecture

26 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]
Select document d ~ Mult() For each position n = 1,, Nd generate zn ~ Mult( ¢ | d) generate wn ~ Mult( ¢ | zn) For each citation j = 1,, Ld generate zj ~ Mult( ¢ | d) generate cj ~ Mult( ¢ | zj) d d z z w c N L M g 9/18/2007 10-802: Guest Lecture

27 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]
PLSA likelihood: d d z z New likelihood: w c N L M g Learning using EM 9/18/2007 10-802: Guest Lecture

28 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]
Heuristic: (1-) 0 ·  · 1 determines the relative importance of content and hyperlinks 9/18/2007 10-802: Guest Lecture

29 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]
Experiments: Text Classification Datasets: Web KB 6000 CS dept web pages with hyperlinks 6 Classes: faculty, course, student, staff, etc. Cora 2000 Machine learning abstracts with citations 7 classes: sub-areas of machine learning Methodology: Learn the model on complete data and obtain d for each document Test documents classified into the label of the nearest neighbor in training set Distance measured as cosine similarity in the  space Measure the performance as a function of  9/18/2007 10-802: Guest Lecture

30 Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]
Classification performance content Hyperlink Hyperlink content 9/18/2007 10-802: Guest Lecture

31 Hyperlink modeling using LDA
9/18/2007 10-802: Guest Lecture

32 Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]
For each document d = 1,,M Generate d ~ Dir(¢ | ) For each position n = 1,, Nd generate zn ~ Mult( ¢ | d) generate wn ~ Mult( ¢ | zn) For each citation j = 1,, Ld generate zj ~ Mult( . | d) generate cj ~ Mult( . | zj) z z w c N L M g Learning using variational EM 9/18/2007 10-802: Guest Lecture

33 Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]
9/18/2007 10-802: Guest Lecture

34 Author-Topic Model for Scientific Literature
9/18/2007 10-802: Guest Lecture

35 Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
For each author a = 1,,A Generate a ~ Dir(¢ | ) For each topic k = 1,,K Generate fk ~ Dir( ¢ | ) For each document d = 1,,M For each position n = 1,, Nd Generate author x ~ Unif(¢ | ad) generate zn ~ Mult( ¢ | a) generate wn ~ Mult( ¢ | fzn) a x z A w N M f b K 9/18/2007 10-802: Guest Lecture

36 Learning: Gibbs sampling
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs sampling P x z A w N M f b K 9/18/2007 10-802: Guest Lecture

37 Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
Perplexity results 9/18/2007 10-802: Guest Lecture

38 Topic-Author visualization
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Topic-Author visualization 9/18/2007 10-802: Guest Lecture

39 Application 1: Author similarity
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Application 1: Author similarity 9/18/2007 10-802: Guest Lecture

40 Application 2: Author entropy
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Application 2: Author entropy 9/18/2007 10-802: Guest Lecture

41 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
9/18/2007 10-802: Guest Lecture

42 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
Gibbs sampling 9/18/2007 10-802: Guest Lecture

43 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
Datasets Enron data 23,488 messages between 147 users McCallum’s personal 23,488(?) messages with 128 authors 9/18/2007 10-802: Guest Lecture

44 Topic Visualization: Enron set
Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Topic Visualization: Enron set 9/18/2007 10-802: Guest Lecture

45 Topic Visualization: McCallum’s data
Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Topic Visualization: McCallum’s data 9/18/2007 10-802: Guest Lecture

46 Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
9/18/2007 10-802: Guest Lecture

47 Modeling Citation Influences
9/18/2007 10-802: Guest Lecture

48 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007]
Copycat model 9/18/2007 10-802: Guest Lecture

49 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007]
Citation influence model 9/18/2007 10-802: Guest Lecture

50 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007]
Citation influence graph for LDA paper 9/18/2007 10-802: Guest Lecture

51 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007]
Words in LDA paper assigned to citations 9/18/2007 10-802: Guest Lecture

52 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007]
Performance evaluation Data: 22 seed papers and 132 cited papers Users labeled citations on a scale of 1-4 Models considered: Citation influence model Copy cat model LDA-JS-divergence Symmetric Divergence in topic space LDA-post Page Rank TF-IDF Evaulation measure: Area under the ROC curve where 9/18/2007 10-802: Guest Lecture

53 Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007]
Results 9/18/2007 10-802: Guest Lecture

54 Mixed membership Stochastic Block models [Work In Progress]
A complete generative model for text and citations Can model the topicality of citations Topic Specific PageRank Can also predict citations between unseen documents 9/18/2007 10-802: Guest Lecture

55 Summary Topic Modeling is an interesting, new framework for community analysis Sound theoretical basis Completely unsupervised Simultaneous modeling of multiple fields Discovers “soft”-communities and clusters in terms of “topic” membership Can also be used for predictive purposes 9/18/2007 10-802: Guest Lecture


Download ppt "Generative Topic Models for Community Analysis"

Similar presentations


Ads by Google