Probabilistic Topic Model

Probabilistic Topic Model
(Lecture for CS410 Intro Text Info Systems) April 11, 2007 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are adapted/taken from Qiaozhu Mei’s presentations

Outline Motivation – Contextual text mining
General ideas of probabilistic topic models Probabilistic latent semantic analysis/indexing (PLSA/PLSI) Variants

Context features of a document
Weblog Article communities Author Compared with other kinds of data, Weblogs have some interesting special characteristics, which make it interesting to exploit for text mining. source Location Time Author’s Occupation

papers written by Andrew McCallum
What is a Context? …… papers written in 1998 WWW SIGIR ACL KDD SIGMOD papers written by Andrew McCallum 1998 Any context feature (metadata) and combination of features of documents can decide a context Contexts can overlap The choice of contexts depends on the need of mining tasks Each context corresponds to a support document set 1999 2005 2006

Text Collections with Context Information
News articles: time, publisher, Weblogs: time, location, author, age-group, etc. Scientific Literature: author, publication time, conference, citations, etc. Customer review: product, source, time s: sender, receiver, time, thread Query Logs: time, IP address, clickthrough Webpage: domain, time, etc. ……

Application Questions (Contextual Text Mining)
What’re people interested in language models before and after the year 2000? Which country responded first to the release of iPod Nano? China, UK or Canada? Who cares more about gas price during Hurricane Katrina? People in Illinois or Washington? What’s the common interest of the two researcher? Who is more application oriented? Do people like Dell laptops more than IBM laptops? … … 1: temporal text mining, KDD05 2-3: spatiotemporal text mining, WWW06 4: author-topic analysis 5: comparative text mining, KDD04 6: …

General Idea of Probabilistic Topic Models
Modeling a topic/subtopic/theme with a multinomial distribution (unigram LM) Modeling text data with a mixture model involving multinomial distributions A document is “generated” by sampling words from some multinomial distribution Each time, a word may be generated from a different distribution Many variations of how these multinomial distributions are mixed Topic mining = Fitting the probabilistic model to text Answer topic-related questions by computing various kinds of conditional probabilities based on the estimated model (e.g., p(time|topic), p(time|topic, location))

What is a Theme? Semantic Resolution How many in a document?
A theme is a semantically coherent topic/subtopic Represented with a multinomial distribution of terms over the whole vocabulary, i.e., a unigram language model E.g. 1? Topics Themes Several Patterns price oil gas increase product fuel company … … Concepts Theme: “Oil Price” Entities Words Many..

The Usage of a Theme government 0.3 response [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … Theme 1 donate 0.1 relief 0.05 help Theme 2 … city 0.2 new orleans Theme k Is 0.05 the a Usage of a theme: Summarize topics/subtopics Navigate documents Retrieve documents Segment documents All other tasks involving unigram language models Background B

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99]
Mix k multinomial distributions to generate a document Each document has a potentially different set of mixing weights which captures the topic coverage When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model) We may add a background distribution to “attract” background words

in doc d in the collection
PLSA as a Mixture Model Document d warning 0.3 system Theme 1 d,1 1 “Generating” word w in doc d in the collection Aid 0.1 donation 0.05 support 2 Theme 2 d,2 1 - B d, k W … k statistics 0.2 loss dead B Theme k B Is 0.05 the a Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood

Application of Bayes rule (in multiple collections)
Parameter Estimation E-Step: Word w in doc d is generated from cluster j from background Application of Bayes rule Fractional counts contributing to using cluster j in generating d generating w from cluster j M-Step: Re-estimate mixing weights cluster LM Sum over all docs (in multiple collections)

Use the Model for Text Mining
Term clustering: d,j naturally serves as a word cluster Coverage of cluster j: Document clustering: We may use Naïve Bayes classifier We may use d,j to decide which cluster d should be in Contextual text mining: Make d,j conditioned on context E.g., p(d,j |time), from which we can compute/plot p(time| d,j )

in doc d in the collection
Adding Prior Most likely  Document d warning 0.3 system Theme 1 d,1 1 “Generating” word w in doc d in the collection Aid 0.1 donation 0.05 support 2 Theme 2 d,2 1 - B d, k W … k statistics 0.2 loss dead B Theme k B Is 0.05 the a Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood

Maximum A Posterior (MAP) Estimation
Pseudo counts of w from prior ’ +p(w|’j) + Sum of all pseudo counts What if =0? What if =+?

Variants Different contexts Further regularize the mixing weights
Latent Dirichlet Allocation (LDA) [Blei et al. 03] Parameter estimation is more complex

A pool of text Collections
Comparative Text Mining (CTM) [Zhai et al. 04] Problem definition: Given a comparable set of text collections Discover & analyze their common and unique properties A pool of text Collections Collection C1 Collection C2 …. Collection Ck Common themes C1- specific themes C2- specific themes Ck- specific themes

Example: Summarize Customer Reviews
IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, GB Small, 5-10 GB Medium, GB Speed Slow, Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Ideal results from comparative text mining

A More Realistic Setup of CTM
IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery 0.129 Hours 0.080 Life 0.060 … Long 0.120 4hours 0.010 3hours 0.008 Reasonable 0.10 Medium 0.08 2hours 0.002 Short 0.05 Poor 0.01 1hours 0.005 .. Disk 0.015 IDE 0.010 Drive 0.005 Large 0.100 80GB 0.050 Small 0.050 5GB ... Medium 0.123 20GB …. Pentium 0.113 Processor 0.050 Slow 0.114 200Mhz 0.080 Fast 0.151 3Ghz 0.100 Moderate 0.116 1Ghz Common Word Distr. Collection-specific Word Distributions

Cross-Collection Mixture Models
Cm Explicitly distinguish common and specific themes Fit a mixture model with the text data Estimate parameters using EM Clusters are designed to be meaningful Background B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m ………………… … Theme k in common: k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m

Details of the Mixture Model
Account for noise (common non-informative words) Background B Common Distribution “Generating” word w in doc d in collection Ci B 1 C Theme 1 1,i W 1-C d,1 Collection-specific Distr. 1-B … d,k Common Distribution k C Theme k k,i Parameters: B=noise-level (manually set) C=Common-Specific tradeoff (manually set) ’s and ’s are estimated with Maximum Likelihood 1-C Collection-specific Distr.

Comparing News Articles Iraq War (30 articles) vs
Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) The common theme indicates that “United Nations” is involved in both wars Cluster 1 Cluster 2 Cluster 3 Common Theme united nations … killed month deaths Iraq n Weapons Inspections troops hoon sanches Afghan Northern alliance kabul taleban aid taleban rumsfeld hotel front Collection-specific themes indicate different roles of “United Nations” in the two wars

Comparing Laptop Reviews
Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents

What You Should Know The basic idea of PLSA How to estimate parameters
How to use the model to do clustering/mining

Probabilistic Topic Model

Similar presentations

Presentation on theme: "Probabilistic Topic Model"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Topic Model

Similar presentations

Presentation on theme: "Probabilistic Topic Model"— Presentation transcript:

Similar presentations

About project

Feedback