Web-Mining Agents Topic Analysis: pLSI and LDA

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Information retrieval – LSI, pLSI and LDA
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Generative Topic Models for Community Analysis
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
British Museum Library, London Picture Courtesy: flickr.
Maximum Likelihood (ML), Expectation Maximization (EM)
A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Scalable Text Mining with Sparse Generative Models
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Probabilistic Topic Models
27. May Topic Models Nam Khanh Tran L3S Research Center.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation (LDA)
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
CSE 517 Natural Language Processing Winter 2015
Techniques for Dimensionality Reduction
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Analysis of Social Media MLD , LTI William Cohen
Latent Dirichlet Allocation (LDA)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Inferring User Interest Familiarity and Topic Similarity with Social Neighbors in Facebook INSTRUCTOR: DONGCHUL KIM ANUSHA BOOTHPUR
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
Online Multiscale Dynamic Topic Models
Reading Notes Wang Ning Lab of Database and Information Systems
Latent Dirichlet Analysis
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Topic Models in Text Processing
Latent Semantic Analysis
Presentation transcript:

Web-Mining Agents Topic Analysis: pLSI and LDA Tanya Braun Universität zu Lübeck Institut für Informationssysteme

Recap Agents Today: Topic models Soon: What agents can take with them Task/goal: Information retrieval Environment: Documents Means: Vector space or probability based retrieval Dimension reduction (vector model) Topic models (probability model) Today: Topic models Probabilistic LSI (pLSI) Latent Dirichlet Allocation (LDA) Soon: What agents can take with them

Acknowledgements Pilfered from: Ramesh M. Nallapati Machine Learning applied to Natural Language Processing Thomas J. Watson Research Center, Yorktown Heights, NY USA from his presentation on Generative Topic Models for Community Analysis

Objectives Cultural literacy for ML: Q: What are “topic models”? A1: popular indoor sport for machine learning researchers A2: a particular way of applying unsupervised learning of Bayes nets to text Topic Models: statistical methods that analyze the words of the original texts to Discover the themes that run through them (topics) How those themes are connected to each other How they change over time

Introduction to Topic Models Multinomial Naïve Bayes  For each document d = 1,, M Generate Cd ~ Mult( ∙ | ) For each position n = 1,, Nd Generate wn ~ Mult( ∙ | , Cd) C ….. WN W1 W2 W3 M b

Introduction to Topic Models Naïve Bayes Model: Compact representation   C C ….. WN W1 W2 W3 W M N b M b

Introduction to Topic Models Mixture model: unsupervised naïve Bayes model Joint probability of words and classes: But classes are not visible:  Z C W N M b

Introduction to Topic Models Mixture model: learning Not a convex function No global optimum solution Solution: Expectation Maximization Iterative algorithm Finds local optimum Guaranteed to maximize a lower-bound on the log-likelihood of the observed data

Introduction to Topic Models log(0.5x1+0.5x2) Quick summary of EM: Log is a concave function Lower-bound is convex! Optimize this lower-bound w.r.t. each variable instead 0.5log(x1)+0.5log(x2) X2 X1 0.5x1+0.5x2 H()

Introduction to Topic Models Mixture model: EM solution E-step: M-step:

Mixture of Unigrams (traditional) Zi wi1 w2i w3i w4i Mixture of Unigrams Model (this is just Naïve Bayes) For each of M documents, Choose a topic z. Choose N words by drawing each one independently from a multinomial conditioned on z. In the Mixture of Unigrams model, we can only have one topic per document!

Probabilistic Latent Semantic Indexing (pLSI) Model The pLSI Model d For each word of document d in the training set, Choose a topic z according to a multinomial conditioned on the index d. Generate the word by drawing from a multinomial conditioned on z. In pLSI, documents can have multiple topics. zd1 zd2 zd3 zd4 wd1 wd2 wd3 wd4 Probabilistic Latent Semantic Indexing (pLSI) Model

Introduction to Topic Models PLSA topics (TDT-1 corpus)

Introduction to Topic Models Probabilistic Latent Semantic Analysis Model Learning using EM Not a complete generative model Has a distribution  over the training set of documents: no new document can be generated! Nevertheless, more realistic than mixture model Documents can discuss multiple topics!

LSI: Simplistic picture The “dimensionality” of a corpus is the number of distinct topics represented in it. if A has a rank k approximation of low Frobenius error, then there are no more than k distinct topics in the corpus. Topic 1 Topic 2 Topic 3 15

Cutting the dimensions with the least singular values

LSI and PLSI LSI: find the k-dimensions that minimize the Frobenius norm of A-A’. Frobenius norm of A: pLSI: defines one’s own objective function to minimize (maximize)

pLSI – a probabilistic approach k = number of topics V = vocabulary size M = number of documents

pLSI Assume a multinomial distribution Distribution of topics (z) Question: How to determine z ?

Introduction to Topic Models Probabilistic Latent Semantic Analysis Model d d Select document d ~ Mult() For each position n = 1,, Nd generate zn ~ Mult( ∙ | d) generate wn ~ Mult( ∙ | zn)  Topic distribution z w N M 

Using EM Likelihood E-step M-step

Relation with LSI Relation Difference: LSI: minimize Frobenius (L-2) norm ~ additive Gaussian noise assumption on counts pLSI: log-likelihood of training data ~ cross-entropy / Kullback-Leibler divergence

pLSI – a generative model Markov Chain Monte Carlo, EM

Problem of pLSI It is not a proper generative model for document: Document is generated from a mixture of topics The number of topics may grow linearly with the size of the corpus Difficult to generate a new document

Introduction to Topic Models Latent Dirichlet Allocation Overcomes the issues with PLSA Can generate any random document Parameter learning: Variational EM Numerical approximation using lower-bounds Results in biased solutions Convergence has numerical guarantees Gibbs Sampling Stochastic simulation unbiased solutions Stochastic convergence

Dirichlet Distributions In the LDA model, we would like to say that the topic mixture proportions for each document are drawn from some distribution. So, we want to put a distribution on multinomials. That is, k-tuples of non-negative numbers that sum to one. The space is of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex, which is just a generalization of a triangle to (k-1) dimensions. Criteria for selecting our prior: It needs to be defined for a (k-1)-simplex. Algebraically speaking, we would like it to play nice with the multinomial distribution.

Dirichlet Distributions Useful Facts: This distribution is defined over a (k-1)-simplex. That is, it takes k non-negative arguments which sum to one. Consequently it is a natural distribution to use over multinomial distributions. In fact, the Dirichlet distribution is the conjugate prior to the multinomial distribution. (This means that if our likelihood is multinomial with a Dirichlet prior, then the posterior is also Dirichlet!) The Dirichlet parameter i can be thought of as a prior count of the ith class.

The LDA Model For each document, Choose ~Dirichlet() z1 z2 z3 z4 z1 z2 z3 z4 z1 z2 z3 z4 w1 w2 w3 w4 w1 w2 w3 w4 w1 w2 w3 w4 For each document, Choose ~Dirichlet() For each of the N words wn: Choose a topic zn» Multinomial() Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn. b

The LDA Model For each document, Choose » Dirichlet() For each of the N words wn: Choose a topic zn» Multinomial() Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn.

LDA (Latent Dirichlet Allocation) Document = mixture of topics (as in pLSI), but according to a Dirichlet prior When we use a uniform Dirichlet prior, pLSI=LDA A word is also generated according to another variable :

Variational Inference In variational inference, we consider a simplified graphical model with variational parameters ,  and minimize the KL Divergence between the variational and posterior distributions.

Use of LDA A widely used topic model Complexity is an issue Use in IR: Interpolate a topic model with traditional LM Improvements over traditional LM, But no improvement over Relevance model (Wei and Croft, SIGIR 06)

Use of LDA: Social Network Analysis “follow relationship” among users often looks unorganized and chaotic follow relationships are created haphazardly by each individual user and not controlled by a central entity Provide more structure to this follow relationship by “grouping” the users based on their topic interests by “labeling” each follow relationship with the identified topic group

Use of LDA: Social Network Analysis

Perplexity In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample Perplexity of a random variable X may be defined as the perplexity of the distribution over its possible values x. In natural language processing, perplexity is a way of evaluating language models. A language model is a probability distribution over entire sentences or texts. [Wikipedia]

Introduction to Topic Models Perplexity comparison of various models Unigram Mixture model PLSA Lower is better LDA

References Also see Wikipedia articles on LSI, pLSI and LDA LSI pLSI Improving Information Retrieval with Latent Semantic Indexing, Deerwester, S., et al, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36–40. Using Linear Algebra for Intelligent Information Retrieval, Michael W. Berry, Susan T. Dumais and Gavin W. O'Brien, UT-CS-94-270,1994 pLSI Probabilistic Latent Semantic Indexing, Thomas Hofmann, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999 LDA Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Finding Scientific Topics. Griffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. Hierarchical topic models and the nested Chinese restaurant process. D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum In S. Thrun, L. Saul, and B. Scholkopf, editors, Advances in Neural Information Processing Systems (NIPS) 16, Cambridge, MA, 2004. MIT Press. LDA and Social Network Analysis Social-Network Analysis Using Topic Models. Youngchul Cha and Junghoo Cho, Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR '12), 2012 Also see Wikipedia articles on LSI, pLSI and LDA