Generative Topic Models for Community Analysis

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Mixture Models and the EM Algorithm
Information retrieval – LSI, pLSI and LDA
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Expectation Maximization
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Statistical Topic Modeling part 1
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Mixture Language Models and EM Algorithm
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Topic Modeling with Network Regularization Md Mustafizur Rahman.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
Latent Dirichlet Allocation a generative model for text
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Visual Recognition Tutorial
Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Online Learning for Latent Dirichlet Allocation
Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Lecture 19: More EM Machine Learning April 15, 2010.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Probabilistic Topic Models
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
Techniques for Dimensionality Reduction
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Automatic Labeling of Multinomial Topic Models
Web-Mining Agents Topic Analysis: pLSI and LDA
Analysis of Social Media MLD , LTI William Cohen
Latent Dirichlet Allocation (LDA)
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Online Multiscale Dynamic Topic Models
The topic discovery models
The topic discovery models
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
The topic discovery models
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Topic models for corpora and for graphs
Topic Models in Text Processing
Presentation transcript:

Generative Topic Models for Community Analysis Ramesh Nallapati

Objectives Provide an overview of topic models and their learning techniques Mixture models, PLSA, LDA EM, variational EM, Gibbs sampling Convince you that topic models are an attractive framework for community analysis 5 definitive papers 9/18/2007 10-802: Guest Lecture

Outline Part I: Introduction to Topic Models Naive Bayes model Mixture Models Expectation Maximization PLSA LDA Variational EM Gibbs Sampling Part II: Topic Models for Community Analysis Citation modeling with PLSA Citation Modeling with LDA Author Topic Model Author Topic Recipient Model Modeling influence of Citations Mixed membership Stochastic Block Model 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Multinomial Naïve Bayes  For each document d = 1,, M Generate Cd ~ Mult( ¢ | ) For each position n = 1,, Nd Generate wn ~ Mult(¢|,Cd) C ….. WN W1 W2 W3 M b 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Naïve Bayes Model: Compact representation   C C ….. WN W1 W2 W3 W M N b M b 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Multinomial naïve Bayes: Learning Maximize the log-likelihood of observed variables w.r.t. the parameters: Convex function: global optimum Solution: 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Mixture model: unsupervised naïve Bayes model Joint probability of words and classes: But classes are not visible:  Z C W N M b 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Mixture model: learning Not a convex function No global optimum solution Solution: Expectation Maximization Iterative algorithm Finds local optimum Guaranteed to maximize a lower-bound on the log-likelihood of the observed data 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models log(0.5x1+0.5x2) Quick summary of EM: Log is a concave function Lower-bound is convex! Optimize this lower-bound w.r.t. each variable instead 0.5log(x1)+0.5log(x2) X2 X1 0.5x1+0.5x2 H() 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Mixture model: EM solution E-step: M-step: 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Probabilistic Latent Semantic Analysis Model d d Select document d ~ Mult() For each position n = 1,, Nd generate zn ~ Mult( ¢ | d) generate wn ~ Mult( ¢ | zn)  Topic distribution z w N M  9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Probabilistic Latent Semantic Analysis Model Learning using EM Not a complete generative model Has a distribution  over the training set of documents: no new document can be generated! Nevertheless, more realistic than mixture model Documents can discuss multiple topics! 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models PLSA topics (TDT-1 corpus) 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Latent Dirichlet Allocation  For each document d = 1,,M Generate d ~ Dir(¢ | ) For each position n = 1,, Nd generate zn ~ Mult( ¢ | d) generate wn ~ Mult( ¢ | zn) a z w N M  9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Latent Dirichlet Allocation Overcomes the issues with PLSA Can generate any random document Parameter learning: Variational EM Numerical approximation using lower-bounds Results in biased solutions Convergence has numerical guarantees Gibbs Sampling Stochastic simulation unbiased solutions Stochastic convergence 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Variational EM for LDA Approximate the posterior by a simpler distribution A convex function in each parameter! 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Gibbs sampling Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models LDA topics 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models LDA’s view of a document 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Perplexity comparison of various models Unigram Mixture model PLSA Lower is better LDA 9/18/2007 10-802: Guest Lecture

Introduction to Topic Models Summary Generative models for exchangeable data Unsupervised models Automatically discover topics Well developed approximate techniques available for inference and learning 9/18/2007 10-802: Guest Lecture

Outline Part I: Introduction to Topic Models Naive Bayes model Mixture Models Expectation Maximization PLSA LDA Variational EM Gibbs Sampling Part II: Topic Models for Community Analysis Citation modeling with PLSA Citation Modeling with LDA Author Topic Model Author Topic Recipient Model Modeling influence of Citations Mixed membership Stochastic Block Model 9/18/2007 10-802: Guest Lecture

Hyperlink modeling using PLSA 9/18/2007 10-802: Guest Lecture

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]  Select document d ~ Mult() For each position n = 1,, Nd generate zn ~ Mult( ¢ | d) generate wn ~ Mult( ¢ | zn) For each citation j = 1,, Ld generate zj ~ Mult( ¢ | d) generate cj ~ Mult( ¢ | zj) d d z z w c N L M  g 9/18/2007 10-802: Guest Lecture

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001]  PLSA likelihood: d d z z New likelihood: w c N L M  g Learning using EM 9/18/2007 10-802: Guest Lecture

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Heuristic:  (1-) 0 ·  · 1 determines the relative importance of content and hyperlinks 9/18/2007 10-802: Guest Lecture

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Experiments: Text Classification Datasets: Web KB 6000 CS dept web pages with hyperlinks 6 Classes: faculty, course, student, staff, etc. Cora 2000 Machine learning abstracts with citations 7 classes: sub-areas of machine learning Methodology: Learn the model on complete data and obtain d for each document Test documents classified into the label of the nearest neighbor in training set Distance measured as cosine similarity in the  space Measure the performance as a function of  9/18/2007 10-802: Guest Lecture

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Classification performance content Hyperlink Hyperlink content 9/18/2007 10-802: Guest Lecture

Hyperlink modeling using LDA 9/18/2007 10-802: Guest Lecture

Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]  For each document d = 1,,M Generate d ~ Dir(¢ | ) For each position n = 1,, Nd generate zn ~ Mult( ¢ | d) generate wn ~ Mult( ¢ | zn) For each citation j = 1,, Ld generate zj ~ Mult( . | d) generate cj ~ Mult( . | zj) z z w c N L M  g Learning using variational EM 9/18/2007 10-802: Guest Lecture

Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] 9/18/2007 10-802: Guest Lecture

Author-Topic Model for Scientific Literature 9/18/2007 10-802: Guest Lecture

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] For each author a = 1,,A Generate a ~ Dir(¢ | ) For each topic k = 1,,K Generate fk ~ Dir( ¢ | ) For each document d = 1,,M For each position n = 1,, Nd Generate author x ~ Unif(¢ | ad) generate zn ~ Mult( ¢ | a) generate wn ~ Mult( ¢ | fzn) a x z  A w N M f b K 9/18/2007 10-802: Guest Lecture

Learning: Gibbs sampling Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a Learning: Gibbs sampling P  x z  A w N M f b K 9/18/2007 10-802: Guest Lecture

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Perplexity results 9/18/2007 10-802: Guest Lecture

Topic-Author visualization Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Topic-Author visualization 9/18/2007 10-802: Guest Lecture

Application 1: Author similarity Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Application 1: Author similarity 9/18/2007 10-802: Guest Lecture

Application 2: Author entropy Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Application 2: Author entropy 9/18/2007 10-802: Guest Lecture

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] 9/18/2007 10-802: Guest Lecture

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Gibbs sampling 9/18/2007 10-802: Guest Lecture

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Datasets Enron email data 23,488 messages between 147 users McCallum’s personal email 23,488(?) messages with 128 authors 9/18/2007 10-802: Guest Lecture

Topic Visualization: Enron set Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Topic Visualization: Enron set 9/18/2007 10-802: Guest Lecture

Topic Visualization: McCallum’s data Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] Topic Visualization: McCallum’s data 9/18/2007 10-802: Guest Lecture

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05] 9/18/2007 10-802: Guest Lecture

Modeling Citation Influences 9/18/2007 10-802: Guest Lecture

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Copycat model 9/18/2007 10-802: Guest Lecture

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Citation influence model 9/18/2007 10-802: Guest Lecture

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Citation influence graph for LDA paper 9/18/2007 10-802: Guest Lecture

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Words in LDA paper assigned to citations 9/18/2007 10-802: Guest Lecture

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Performance evaluation Data: 22 seed papers and 132 cited papers Users labeled citations on a scale of 1-4 Models considered: Citation influence model Copy cat model LDA-JS-divergence Symmetric Divergence in topic space LDA-post Page Rank TF-IDF Evaulation measure: Area under the ROC curve where 9/18/2007 10-802: Guest Lecture

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Results 9/18/2007 10-802: Guest Lecture

Mixed membership Stochastic Block models [Work In Progress] A complete generative model for text and citations Can model the topicality of citations Topic Specific PageRank Can also predict citations between unseen documents 9/18/2007 10-802: Guest Lecture

Summary Topic Modeling is an interesting, new framework for community analysis Sound theoretical basis Completely unsupervised Simultaneous modeling of multiple fields Discovers “soft”-communities and clusters in terms of “topic” membership Can also be used for predictive purposes 9/18/2007 10-802: Guest Lecture