Probabilistic Topic Model (Lecture for CS410 Intro Text Info Systems) April 11, 2007 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are adapted/taken from Qiaozhu Mei’s presentations
Outline Motivation – Contextual text mining General ideas of probabilistic topic models Probabilistic latent semantic analysis/indexing (PLSA/PLSI) Variants
Context features of a document Weblog Article communities Author Compared with other kinds of data, Weblogs have some interesting special characteristics, which make it interesting to exploit for text mining. source Location Time Author’s Occupation
papers written by Andrew McCallum What is a Context? …… papers written in 1998 WWW SIGIR ACL KDD SIGMOD papers written by Andrew McCallum 1998 Any context feature (metadata) and combination of features of documents can decide a context Contexts can overlap The choice of contexts depends on the need of mining tasks Each context corresponds to a support document set 1999 2005 2006
Text Collections with Context Information News articles: time, publisher, Weblogs: time, location, author, age-group, etc. Scientific Literature: author, publication time, conference, citations, etc. Customer review: product, source, time Emails: sender, receiver, time, thread Query Logs: time, IP address, clickthrough Webpage: domain, time, etc. ……
Application Questions (Contextual Text Mining) What’re people interested in language models before and after the year 2000? Which country responded first to the release of iPod Nano? China, UK or Canada? Who cares more about gas price during Hurricane Katrina? People in Illinois or Washington? What’s the common interest of the two researcher? Who is more application oriented? Do people like Dell laptops more than IBM laptops? … … 1: temporal text mining, KDD05 2-3: spatiotemporal text mining, WWW06 4: author-topic analysis 5: comparative text mining, KDD04 6: …
General Idea of Probabilistic Topic Models Modeling a topic/subtopic/theme with a multinomial distribution (unigram LM) Modeling text data with a mixture model involving multinomial distributions A document is “generated” by sampling words from some multinomial distribution Each time, a word may be generated from a different distribution Many variations of how these multinomial distributions are mixed Topic mining = Fitting the probabilistic model to text Answer topic-related questions by computing various kinds of conditional probabilities based on the estimated model (e.g., p(time|topic), p(time|topic, location))
What is a Theme? Semantic Resolution How many in a document? A theme is a semantically coherent topic/subtopic Represented with a multinomial distribution of terms over the whole vocabulary, i.e., a unigram language model E.g. 1? Topics Themes Several Patterns price 0.0772 oil 0.0643 gas 0.0454 increase 0.0210 product 0.0203 fuel 0.0188 company 0.0182 … … Concepts Theme: “Oil Price” Entities Words Many..
The Usage of a Theme government 0.3 response 0.2.. [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … Theme 1 donate 0.1 relief 0.05 help 0.02 .. Theme 2 … city 0.2 new 0.1 orleans 0.05 .. Theme k Is 0.05 the 0.04 a 0.03 .. Usage of a theme: Summarize topics/subtopics Navigate documents Retrieve documents Segment documents All other tasks involving unigram language models Background B
Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99] Mix k multinomial distributions to generate a document Each document has a potentially different set of mixing weights which captures the topic coverage When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model) We may add a background distribution to “attract” background words
in doc d in the collection PLSA as a Mixture Model Document d warning 0.3 system 0.2.. Theme 1 d,1 1 “Generating” word w in doc d in the collection Aid 0.1 donation 0.05 support 0.02 .. 2 Theme 2 d,2 1 - B d, k W … k statistics 0.2 loss 0.1 dead 0.05 .. B Theme k B Is 0.05 the 0.04 a 0.03 .. Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood
Application of Bayes rule (in multiple collections) Parameter Estimation E-Step: Word w in doc d is generated from cluster j from background Application of Bayes rule Fractional counts contributing to using cluster j in generating d generating w from cluster j M-Step: Re-estimate mixing weights cluster LM Sum over all docs (in multiple collections)
Use the Model for Text Mining Term clustering: d,j naturally serves as a word cluster Coverage of cluster j: Document clustering: We may use Naïve Bayes classifier We may use d,j to decide which cluster d should be in Contextual text mining: Make d,j conditioned on context E.g., p(d,j |time), from which we can compute/plot p(time| d,j )
in doc d in the collection Adding Prior Most likely Document d warning 0.3 system 0.2.. Theme 1 d,1 1 “Generating” word w in doc d in the collection Aid 0.1 donation 0.05 support 0.02 .. 2 Theme 2 d,2 1 - B d, k W … k statistics 0.2 loss 0.1 dead 0.05 .. B Theme k B Is 0.05 the 0.04 a 0.03 .. Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood
Maximum A Posterior (MAP) Estimation Pseudo counts of w from prior ’ +p(w|’j) + Sum of all pseudo counts What if =0? What if =+?
Variants Different contexts Further regularize the mixing weights Latent Dirichlet Allocation (LDA) [Blei et al. 03] Parameter estimation is more complex
A pool of text Collections Comparative Text Mining (CTM) [Zhai et al. 04] Problem definition: Given a comparable set of text collections Discover & analyze their common and unique properties A pool of text Collections Collection C1 Collection C2 …. Collection Ck Common themes C1- specific themes C2- specific themes Ck- specific themes
Example: Summarize Customer Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Ideal results from comparative text mining
A More Realistic Setup of CTM IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery 0.129 Hours 0.080 Life 0.060 … Long 0.120 4hours 0.010 3hours 0.008 Reasonable 0.10 Medium 0.08 2hours 0.002 Short 0.05 Poor 0.01 1hours 0.005 .. Disk 0.015 IDE 0.010 Drive 0.005 Large 0.100 80GB 0.050 Small 0.050 5GB 0.030 ... Medium 0.123 20GB 0.080 …. Pentium 0.113 Processor 0.050 Slow 0.114 200Mhz 0.080 Fast 0.151 3Ghz 0.100 Moderate 0.116 1Ghz 0.070 Common Word Distr. Collection-specific Word Distributions
Cross-Collection Mixture Models Cm Explicitly distinguish common and specific themes Fit a mixture model with the text data Estimate parameters using EM Clusters are designed to be meaningful Background B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m ………………… … Theme k in common: k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m
Details of the Mixture Model Account for noise (common non-informative words) Background B Common Distribution “Generating” word w in doc d in collection Ci B 1 C Theme 1 1,i W 1-C d,1 Collection-specific Distr. 1-B … d,k Common Distribution k C Theme k k,i Parameters: B=noise-level (manually set) C=Common-Specific tradeoff (manually set) ’s and ’s are estimated with Maximum Likelihood 1-C Collection-specific Distr.
Comparing News Articles Iraq War (30 articles) vs Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) The common theme indicates that “United Nations” is involved in both wars Cluster 1 Cluster 2 Cluster 3 Common Theme united 0.042 nations 0.04 … killed 0.035 month 0.032 deaths 0.023 Iraq n 0.03 Weapons 0.024 Inspections 0.023 troops 0.016 hoon 0.015 sanches 0.012 Afghan Northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 Collection-specific themes indicate different roles of “United Nations” in the two wars
Comparing Laptop Reviews Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents
What You Should Know The basic idea of PLSA How to estimate parameters How to use the model to do clustering/mining