CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Slides:

Advertisements

Similar presentations

Topic models Source: Topic models, David Blei, MLSS 09.

Advertisements

A Tutorial on Learning with Bayesian Networks

Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.

Information retrieval – LSI, pLSI and LDA

Statistical Topic Modeling part 1

A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.

CHAPTER 16 MARKOV CHAIN MONTE CARLO

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Latent Dirichlet Allocation a generative model for text

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

British Museum Library, London Picture Courtesy: flickr.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Clustering Unsupervised learning Generating “classes”

CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Monte Carlo Simulation and Personal Finance Jacob Foley.

1 Physical Fluctuomatics 5th and 6th Probabilistic information processing by Gaussian graphical model Kazuyuki Tanaka Graduate School of Information Sciences,

Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.

Popularity-Aware Topic Model for Social Graphs Junghoo “John” Cho UCLA.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

27. May Topic Models Nam Khanh Tran L3S Research Center.

Monte Carlo Methods Versatile methods for analyzing the behavior of some activity, plan or process that involves uncertainty.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Integrating Topics and Syntax -Thomas L

The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.

An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.

The final exam solutions. Part I, #1, Central limit theorem Let X1,X2, …, Xn be a sequence of i.i.d. random variables each having mean μ and variance.

An Introduction to Latent Dirichlet Allocation (LDA)

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Latent Dirichlet Allocation

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.

Lecture #9: Introduction to Markov Chain Monte Carlo, part 3

Sampling and estimation Petter Mostad

Abdul Wahid, Xiaoying Gao, Peter Andreae

Techniques for Dimensionality Reduction

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

STA347 - week 91 Random Vectors and Matrices A random vector is a vector whose elements are random variables. The collective behavior of a p x 1 random.

Analysis of Social Media MLD , LTI William Cohen

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.

Latent Dirichlet Allocation (LDA)

Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:

B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,

MCMC Output & Metropolis-Hastings Algorithm Part I

Gibbs sampling.

Classification of unlabeled data:

CAP 5636 – Advanced Artificial Intelligence

Latent Dirichlet Analysis

Ch13 Empirical Methods.

Bayesian Inference for Mixture Language Models

CS 188: Artificial Intelligence

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

CS246: Latent Dirichlet Analysis

Junghoo “John” Cho UCLA

Topic models for corpora and for graphs

Topic Models in Text Processing

Presentation transcript:

CS246 Latent Dirichlet Analysis

LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative numbers  Q: Can we develop a more interpretable method?

Theory of LDA (Model-based Approach)  Develop a simplified model on how users write a document based on topics.  Fit the model to the existing corpus and “reverse engineer” the topics used in a document  Q: How do we write a document?  A: (1) Pick the topic(s) (2) Start writing on the topic(s) with related terms

Two Probability Vectors  For every document d, we assume that the user will first pick the topics to write about  P(z|d) : probability to pick topic z when the user write each word in document d.  Document-topic vector of d  We also assume that every topic is associated with each term with certain probability  P(w|z) : the probability of picking the term w when the user write on the topic z.  Topic-term vector of z

Probabilistic Topic Model  There exists T number of topics  The topics-term vector for each topic is set before any document is written  P(w j |z i ) is set for every z i and w j  Then for every document d,  The user decides the topics to write on, i.e., P(z i |d)  For each word in d  The user selects a topic z i with probability P(z i |d)  The user selects a word w j with probability P(w j |z i )

Probabilistic Document Model Topic 1 Topic 2 DOC 1 DOC 2 DOC P(w|z)P(z|d ) money 1 bank 1 loan 1 bank 1 money 1... river 2 stream 2 river 2 bank 2 stream 2... money 1 river 2 bank 1 stream 2 bank 2...

Example: Calculating Probability  z 1 = {w 1 :0.8, w 2 :0.1, w 3 :0.1} z 2 = {w 1 :0.1, w 2 :0.2, w 3 :0.7}  d’s topics are {z 1 : 0.9, z 2 :0.1} d has three terms {w 3 2, w 1 1, w 2 1 }.  Q: What is the probability that a user will write such a document?

Corpus Generation Probability  T: # topics  D: # documents  M: # words per document  Probability of generating the corpus C

Generative Model vs Inference (1) Topic 1 Topic 2 DOC 1 DOC 2 DOC P(w|z)P(z|d ) money 1 bank 1 loan 1 bank 1 money 1... river 2 stream 2 river 2 bank 2 stream 2... money 1 river 2 bank 1 stream 2 bank 2...

Generative Model vs Inference (2) Topic 1 Topic 2 DOC 1 DOC 2 DOC 3 ? ? ? ? money ? bank ? loan ? bank ? money ?... river ? stream ? river ? bank ? stream ?... money ? river ? bank ? stream ? bank ?...

Probabilistic Latent Semantic Index (pLSI)  Basic Idea: We pick P(z j |d i ), P(w k |z j ), and z ij values to maximize the corpus generation probability  Maximum-likelihood estimation (MLE)  More discussion later on how to compute the P(z j |d i ), P(w k |z j ), and z ij values that maximize the probability

Problem of pLSI  Q: 1M documents, 1000 topics, 1M words words/doc. How much input data? How many variables do we have to estimate?  Q: Too much freedom. How can we avoid overfitting problem?  A: Adding constraints to reduce degree of freedom

Latent Dirichlet Analysis (LDA)  When term probabilities are selected for each topic  Topic-term probability vector, (P(w 1 |z j ), …, P(w W |z j )), is sampled randomly from Dirichlet distribution  When users select topics for a document  Document-topic probability vector, (P(z 1 |d), …, P(z T |d)), is sampled randomly from Dirichlet distribution

What is Dirichlet Distribution?  Multinomial distribution  Given the probability p i of each event e i, what is the probability that each event e i occurs  i times after n trial?  We assume p i ’s. The distribution assigns  i ’s probability.  Dirichlet distribution  “Inverse” of multinomial distribution: We assume  i ’s. The distribution assigns p i ’s probability.

Dirichlet Distribution  Q: Given  1,  2,…,  k, what are the most likely p 1, p 2, p k values?

Normalized Probability Vector and Simplex  Remember that and  When (p1, …, pn) satisfies p1 + … + pn = 1, they are on a “simplex plane”  (p1, p2, p3) and their 2-simplex plane

Effect of  values p1 p2 p3 p1 p2 p3

Effect of  values p1 p2 p3 p1 p2 p3

Effect of  values p1 p2 p3 p1 p2 p3

Effect of  values p1 p2 p3 p1 p2 p3

Minor Correction is not “standard” Dirichlet distribution. The “standard” Dirichlet Distribution formula:  Used non-standard to make the connection to multinomial distribution clear  From now on, we use the standard formula

Back to LDA Document Generation Model  For each topic z  Pick the word probability vector P(w j |z)’s by taking a random sample from Dir(  1,…,  W )  For every document d  The user decides its topic vector P(z i |d)’s by taking a random sample from Dir(  1,…,  T )  For each word in d  The user selects a topic z with probability P(z|d)  The user selects a word w with probability P(w|z)  Once all is said and done, we have  P(w j |z): topic-term vector for each topic  P(z i |d): document-topic vector for each document  Topic assignment to every word in each document

Symmetric Dirichlet Distribution  In principle, we need to assume two vectors, (  1,…,  T ) and (  1,…,  W ) as input parameters.  In practice, we often assume all  i ’s are equal to  and all  i ’s =   Use two scalar values  and , not two vectors.  Symmetric Dirichlet distribution  Q: What is the implication of this assumption?

 Q: What does it mean? How will the sampled document topic vectors change as  grows?  Common choice:  = 50/T,  200/W Effect of  value on Symmetric Dirichlet p1 p2 p3 p1 p2 p3

Plate Notation T M N w z P(z|d) P(w|z)  

LDA as Topic Inference  Given a corpus d 1 : w 11, w 12, …, w 1m … d N : w N1, w N2, …, w Nm  Find P(z|d), P(w|z), z ij that are most “consistent” with the given corpus  Q: How can we compute such P(z|d), P(w|z), z ij ?  The best method so far is to use Monte Carlo method together with Gibbs sampling

Monte Carlo Method (1)  Class of methods that compute a number through repeated random sampling of certain event(s).  Q: How can we compute Pi?

Monte Carlo Method (2) 1. Define the domain of possible events 2. Generate the events randomly from the domain using a certain probability distribution 3. Perform a deterministic computation using the events 4. Aggregate the results of the individual computation into the final result  Q: How can we take random samples from a particular distribution?

Gibbs Sampling  Q: How can we take a random sample x from the distribution f(x)?  Q: How can we take a random sample (x, y) from the distribution f(x, y)?  Gibbs sampling  Given current sample (x1, …, xn), pick an axis xi, and take a random sample of xi value assuming all other (x1, …, xn) values  In practice, we iterative over xi’s sequentially

Markov-Chain Monte-Carlo Method (MCMC)  Gibbs sampling is in the class of Markov Chain sampling  Next sample depends only on the current sample  Markov-Chain Monte-Carlo Method  Generate random events using Markov-Chain sampling and apply Monte-Carlo method to compute the result

Applying MCMC to LDA  Let us apply Monte Carlo method to estimate LDA parameters.  Q: How can we map the LDA inference problem to random events?  We first focus on identifying topics {z ij } for each word {w ij }.  Event: Assignment of the topics {z ij } to w ij ’s. The assignment should be done according to P({z ij }|C)  Q: How to sample according to P({z ij }|C)?  Q: Can we use Gibbs sampling? How will it work?  Q: What is P(z ij |{z -ij },C)?

 n wt : how many times the word w has been assigned to the topic t  n dt : how many words in the document d have been assigned to the topic t  Q: What is the meaning of each term?

LDA with Gibbs Sampling  For each word w ij  Assign to topic t with probability  For the prior topic t of w ij, decrease n wt and n dt by 1  For the new topic t of w ij, increase n wt and n dt by 1  Repeat the process many times  At least hundreds of times  Once the process is over, we have  z ij for every w ij  n wt and n dt

Result of LDA (Latent Dirichlet Analysis)  TASA corpus  37,000 text passages from educational materials collected by Touchstone Applied Science Associates  Set T=300 (300 topics)

Inferred Topics

Word Topic Assignments

LDA Algorithm: Simulation  Two topics: River, Money Five words: “river”, “stream”, “bank”, “money”, “loan”  Generate 16 documents by randomly mixing the two topics and using the LDA model riverstreambankmoneyloan River1/3 Money1/3

Generated Documents and Initial Topic Assignment before Inference First 6 and the last 3 documents are purely from one topic. Others are mixture White dot: “River”. Black dot: “Money”

Topic Assignment After LDA Inference First 6 and the last 3 documents are purely from one topic. Others are mixture After 64 iterations

Inferred Topic-Term Matrix  Model parameter  Estimated parameter  Not perfect, but very close especially given the small data size riverstreambankmoneyloan River0.33 Money0.33 riverstreambankmoneyloan River Money

SVD vs LDA  Both perform the following decomposition  SVD views this as matrix approximation  LDA views this as probabilistic inference based on a generative model  Each entry corresponds to “probability”: better interpretability doc topic term = X topic

LDA as Soft Classfication  Soft vs hard clustering/classification  After LDA, every document is assigned to a small number of topics with some weights  Documents are not assigned exclusively to a topic  Soft clustering

Summary  Probabilistic Topic Model  Generative model of documents  pLSI and overfitting  LDA, MCMC, and probabilistic interpretation  Statistical parameter estimation  Multinomial distribution and Dirichlet distribution  Monte Carlo method  Gibbs sampling  Markov-Chain class of sampling