Latent Dirichlet Allocation (LDA)

Slides:

Advertisements

Similar presentations

Part 2: Unsupervised Learning

Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei

Topic models Source: Topic models, David Blei, MLSS 09.

Information retrieval – LSI, pLSI and LDA

Hierarchical Dirichlet Processes

An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.

Probabilistic Clustering-Projection Model for Discrete Data

Statistical Topic Modeling part 1

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

Visual Recognition Tutorial

Latent Dirichlet Allocation a generative model for text

British Museum Library, London Picture Courtesy: flickr.

Visual Recognition Tutorial

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Scalable Text Mining with Sparse Generative Models

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.

Online Learning for Latent Dirichlet Allocation

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

27. May Topic Models Nam Khanh Tran L3S Research Center.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Integrating Topics and Syntax -Thomas L

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Topic Modeling using Latent Dirichlet Allocation

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Link Distribution on Wikipedia [0407]KwangHee Park.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Analysis of Social Media MLD , LTI William Cohen

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.

14.0 Linguistic Processing and Latent Topic Analysis.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Hidden Markov Models BMI/CS 576

Learning Deep Generative Models by Ruslan Salakhutdinov

Multimodal Learning with Deep Boltzmann Machines

Latent Dirichlet Analysis

Hidden Markov Models Part 2: Algorithms

Probabilistic Models with Latent Variables

Bayesian Inference for Mixture Language Models

Stochastic Optimization Maximization for Latent Variable Models

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

Latent Dirichlet Allocation

Junghoo “John” Cho UCLA

Topic models for corpora and for graphs

Topic Models in Text Processing

Text Categorization Berlin Chen 2003 Reference:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Presentation transcript:

Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015

Agenda Agenda Items 1 3 4 2 5 6 What is topic modeling? Intro Text Mining & Pre-Processing 3 Application - Customer Incident Routing Natural Language Processing & Topics 4 2 Demo in R Introduction into Latent Dirichlet Allocation (LDA) LDA Graphical Model 5 Wrap up The Dirichlet Distribution Generative Process 6 Questions Gibbs Sampling Maximum Likelihood Estimates 3/11/2015

Quick Text Mining Introduction 3/11/2015

Intro Text Mining & Pre-Processing What is topic modeling? Intro Text Mining & Pre-Processing Unstructured text Text data preparation Structured text Text analytics Words and grammar parsing Text corpus for analysis, with metadata, indices, TDM Analyze structured text Initial text corpus in natural language Text and natural language processing Source: Adaptive from Miller (2005) 3/11/2015

Text mining and other terms What is topic modeling? Text mining and other terms Corpus: is a large and structured set of texts Stop words: words which are filtered out before or after processing of natural language data (text) Unstructured text: information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Tokenizing: process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens (see also lexical analysis) Natural language processing: field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages Term document (or document term) matrix: is a mathematical matrix that describes the frequency of terms that occur in a collection of documents Supervised learning: s the machine learning task of inferring a function from labeled training data Unsupervised learning: find hidden structure in unlabeled data Stemming: the process for reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form Source: Wikipedia 3/11/2015

Document & information retrieval What is topic modeling? Document & information retrieval The idea is how do we take this unstructured text, index it in such a way that allows us to integrate the structure analytics back to the core information to move, sort, search, process, categorize, etc. by document. Common IR goals: Ad-hoc retrieval Filtering/Sorting Browsing Source: http://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining 3/11/2015

Generate Term Doc Matrix What is topic modeling? Pre-Processing for Topic Modeling packages <- c('tm', 'NLP', 'SnowballC', 'openNLP', 'openNLPmodels.en', 'RWeka') Base Corpus Pre-processing The input data for topic models is a document-term matrix. The rows in this matrix correspond to the documents and the columns to the terms. The number of rows is equal to the size of the corpus and the number of columns to the size of the vocabulary Mapping from the document to the term frequency vector involves tokenizing the document and then processing the tokens for example by converting them to lower-case, removing punctuation characters, removing numbers, stemming, removing stop words and omitting terms with a length below a certain minimum Each term in a collection's vocabulary the index maps in which document the term was posted (inverted indices or lists) Clean the Corpus Lower case Remove number, punctuation, etc. Steaming Remove stop words Generate Term Doc Matrix Tokenize Word length Apply constraints (sparsity, etc.) 3/11/2015

Topic Modeling 3/11/2015

Introduction into Latent Dirichlet Allocation (LDA) Probabilistic modeling 1 Treat data as observations that arise from a generative probabilistic process that includes hidden variables: For documents, the hidden variables reflect the thematic structure of the collection. 2 Infer the hidden structure using posterior inference: What are the topics that describe this collection? 3 Situate new data into the estimated model: How does this query or new document fit into the estimated topic structure? 3/11/2015

Introduction into Latent Dirichlet Allocation (LDA) Generative Model & The Posterior Distribution Each doc is a random mixture of corpus-wide topics and each word is drawn from one of those topics. This assumes topics exists outside of the doc collection. Each topic is a distribution over fixed vocabulary. Generative Process: First, choose a distribution over topics (drawn from a Dirichlet distribution where yellow, pink, green, and blue have some probabilities) Then, repeatedly draw a word (color) from each distribution Next, lookup what each word topic it belongs to by the color Finally, choose the word from that distribution Posterior Distribution: Conditional distribution of all latent variables given the observations which are in this case are each of the words of the documents. However, we actually only observe the docs and therefore must infer the underlying topic structure. Goal is to infer the underlying topic structure, given documents being considered/observed What are the topics generated under these assumptions? What are the distribution over terms that generated these topics? For each document, what is the distributions over topics associated with that document? For each word, which topic generated each word Conditional distribution of all of these latent variables given the observations which are the words in the documents 3/11/2015

What is Latent Dirichlet Allocation (LDA)? Introduction into Latent Dirichlet Allocation (LDA) Intro to Latent Dirichlet Allocation (LDA) What is Latent Dirichlet Allocation (LDA)? A generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of latent topics. Each observed word originates from a topic that we do not directly observe. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. What is used for? The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. How is it related to text mining and other machine learning techniques? Topic models can be seen as classical text mining or natural language processing tools. Fitting topic models based on data structures from the text mining usually done by considering the problem of modeling text corpora and other collections of discrete data. One of the advantages of LDA over related latent variable models is that it provides well-defined inference procedures for previously unseen documents (LSI uses a singular value decomposition) 3/11/2015

≡ Y X1 X2 X3 α θd Zd,n Wd,n βd,n η N Y D K Xn N Introduction into Latent Dirichlet Allocation (LDA) Plates LDA Graphical Model D = docs N = words K = topics Per-document topic proportions (d =doc replication) Per-word topic assignment Observed Word (nth word in the dth doc) Dirichlet parameter Topics (Beta= distribution over terms) Topics hyperparameter Graphical models Y X1 X2 X3 α θd Zd,n Wd,n βd,n η ≡ N Y The parameters alpha and beta are corpus level parameters, assumed to be sampled once in the process of generating a corpus. D K Graphical model representation of LDA. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. Nodes are random variables Edges denote possible dependence Observed variables are shaded Plates denote replicated structure Xn N 3/11/2015

Introduction into Latent Dirichlet Allocation (LDA) Topic Matrix K Once we select Z, we know what topic its coming from, then lookup the cell Lookup in beta in the Zdn column the Wdn word and get the words probability from there That is why we have the observed Wdn depend on all the Zdn and beta’s 1 Zd,n 2 𝜌( 𝑤 𝑑,𝑛 | 𝑍 𝑑,𝑛 , 𝛽 𝑑,𝑘 ) Probability of observing this word: WHERE Wd,n is the observed word AND Zd,n is an index from 1 to k AND Beta d,k are the topics Wd,n V 3 Joint probability of all the hidden and observed variables according to this model: Comes from a Dirichlet ( ( ( 𝑛=1 𝑁 𝑝(𝑍 𝑑,𝑛 | 𝜃 𝑑 ) 𝑘=1 𝐾 𝑝(𝛽 𝐾 | η) ( ( 𝑑=1 𝐷 𝑝(𝜃 𝑑 | 𝛼) ( 𝜌( 𝑤 𝑑,𝑛 | 𝑍 𝑑,𝑛 , 𝛽 𝑑,𝑘 ) 1 1 1 1 …….… 1 K topics which is a distribution over words V words in the vocabulary Each topic coming from some distribution that is appropriate over topics (Dirichlet) and is independent In our documents, generate the topic proportion, using alpha Within each doc, we have the words, drawn from the topic assignment from theta d Probability of observing this word, conditioned on Zdn and the Beta’s 3/11/2015

Introduction into Latent Dirichlet Allocation (LDA) The Dirichlet Distribution 1 The Dirichlet distribution is an exponential family distribution over the simplex, i.e., positive vectors that sum to one 2 The Dirichlet is conjugate to the multinomial. Given a multinomial observation, the posterior distribution of θ is a Dirichlet. 3 The parameter α controls the mean shape and sparsity of θ. Parameter α is a k-vector with components αi >0 4 The topic proportions are a K dimensional Dirichlet. The topics are a V dimensional Dirichlet. 3/11/2015

Introduction into Latent Dirichlet Allocation (LDA) Geometric Interpretation of LDA as we draw random variables from theta, I’m going to get distributions over 3 elements. θ ~ Dirichlet(1,1,1) = α1 = α2 = α3 = 1, uniform distribution as an example Term A All mass on A All mass on B A B C A B C Topic 2 Term Simplex Topic 3 Doc2 Some point within the space over the distribution of 3 items. The Dirichlet places a distribution over this space. Doc3 All mass on C Doc1 Topic Simplex Doc4 A B C A B C Topic 1 Term B Term C Dirichlet is parameterized by α, so as α increases the chart gets more peaky. 3 Dimension Dirichlet 3/11/2015

Introduction into Latent Dirichlet Allocation (LDA) Density Example Important piece of info: The expectations of the posterior (sometimes called M for mean) The sum of the alphas, which determines the peaky-ness of the Dirichlet If this sum is small, the Dirichlet will be more spread out If large, the Dirichlet will have more peaks at its expectation (sometimes called S for scaling) 2D view α > 1 α < 1 When α < 1 (s < k), you get sparsity and on the 3 simplex you get a figure with increased probability at the corners. 1 3/11/2015

( ( ( ( K V Introduction into Latent Dirichlet Allocation (LDA) LDA Inferences K LDA puts posterior topical words together by: Maximizing the word probabilities by dividing the words among the topics. Joint distribution: In a mixture model, finds cluster of co-occurring words (in the same topic) In LDA, a document will be penalized for having too many topics (hyperparameter) Loosely, this can be thought of as softening the strict definition of “co-occurrence" in a mixture model This flexibility leads to sets of terms that more tightly co-occur Some words assigned, more mass on a smaller set of words V Likelihood term ( ( 𝑑=1 𝐷 𝑝(𝜃 𝑑 | 𝛼) 𝑛=1 𝑁 𝑝(𝑍 𝑑,𝑛 | 𝜃 𝑑 ) ( ( 1 1 1 1 …….… 1 𝜌( 𝑤 𝑑,𝑛 | 𝑍 𝑑,𝑛 , 𝛽 𝑑,𝑘 ) K topics which is a distribution over words V words in the vocabulary 3/11/2015

Introduction into Latent Dirichlet Allocation (LDA) Posterior distribution & model estimation for LDA Approximate posterior inference methods 1 Gibbs sampling 2 Variational methods (Bayesian inference & Collapsed variational Bayesian inference) The Gibbs sampling algorithm is a typical Markov Chain Monte Carlo (Mcmc) method and was originally proposed for image restoration Define a Markov chain whose stationary distribution is the posterior of interest Collect independent samples from that stationary distribution; approximate the posterior with them The chain is run by iteratively sampling from the conditional distribution of each hidden variable given observations and the current state of the other hidden variables Once a chain has “burned in," collect samples at a lag to approximate the posterior. Variational methods are a deterministic alternative to MCMC. For many interesting distributions, the marginal likelihood of the observations is difficult to efficiently compute The goal is to optimize the variational parameters to make tight as possible 3 Particle filtering Summary of learning algorithm for Gibbs: Initialize the topic to word assignments z randomly from {1,. . . ,K} For each Gibbs sample: “For each word token, the count matrices n^-(a,b) are first decremented by one for the entries that correspond to the current topic assignment.“ The count matrices are updated by incrementing by one at the new topic assignment. Discard samples during the initial burn-in period After the Markov chain has reached a stationary distribution, i.e., the posterior distribution over topic assignments, samples can be taken at a fixed lag (averaging over Gibbs samples is recommended for statistics that are invariant to the ordering of topics) 3/11/2015

Introduction into Latent Dirichlet Allocation (LDA) Maximum likelihood (Ml) estimation Example: Estimated marginal log-likelihoods per number of topics (circles), average likelihoods are connected by lines Empirical Bayes method for parameter estimation: Given a corpus of docs we want to find parameters α and β that maximize the (marginal) log likelihood of the data 3/11/2015

Using R & Demo 3/11/2015

R implementations of Latent Dirichlet Allocation (LDA) Available packages through CRAN Topic models These functions use a collapsed Gibbs sampler to fit three different models: latent Dirichlet allocation (LDA), the mixed-membership stochastic blockmodel (MMSB), and supervised LDA (sLDA). These functions take sparsely represented input documents, perform inference, and return point estimates of the latent parameters using the state at the last iteration of Gibbs sampling. Provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors lda This package implements latent Dirichlet allocation (LDA) and related models. This includes (but is not limited to) sLDA, corrLDA, and the mixed-membership stochastic blockmodel. Inference for all of these models is implemented via a fast collapsed Gibbs sampler writtten in C. Utility functions for reading/writing data typically used in topic models, as well as tools for examining posterior distributions are also included Also very interesting is the post “Finding structure in xkcd comics with Latent Dirichlet Allocation”: http://cpsievert.github.io/projects/615/xkcd/ 3/11/2015

Incident Routing Solution Application - Customer Incident Routing What are the important drivers of incidents? Text mining of incident data to find keywords and themes Text Mining What are the categories of incidents? Text clustering to identify incident groups Incident Routing Solution Text Clustering Text Classification How to derive insights and map to known solutions? Text classification to tag each incident to theme clusters Incident Tracking 3/11/2015

User Defined Supervised parameters Machine Learning Modeling Incident Text Analytics Process Example incident text analytical process Grouped by topics Solution 1 Solution 2 Text Pattern & Cluster Modeling System monitoring incident Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Incident Stored Customer reported incident User Defined Supervised parameters Machine Learning Modeling 3/11/2015

References “Latent Dirichlet Allocation” David M. Blei, Andrew Y. Ng, Michael I. Jordan - Journal of Machine Learning Research 3 (2003) 993-1022 “Topic Models” lecture David M. Blei, September 1, 2009 found at http://videolectures.net/mlss09uk_blei_tm/ “Latent Dirichlet Allocation in R” Martin Ponweiser, Institute for Statistics and Mathematics http://statmath.wu.ac.at/, Thesis 2, May 2012 “topicmodels: An R Package for Fitting Topic Models”, Bettina Grun & Kurt Hornik “Text mining” Ian H. Witten, Computer Science, University of Waikato, Hamilton, New Zealand 3/11/2015

Wrap up & Questions 3/11/2015