Probabilistic Topic Model

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

A Cross-Collection Mixture Model for Comparative Text Mining
Information retrieval – LSI, pLSI and LDA
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Statistical Topic Modeling part 1
Mixture Language Models and EM Algorithm
Generative Topic Models for Community Analysis
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Scalable Text Mining with Sparse Generative Models
Statistical Topic Models for Integrating and Analyzing Opinions in Blog articles Yue Lu Qiaozhu Mei ChengXiang Zhai.
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Probabilistic Topic Models for Text Mining
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
2010 © University of Michigan Latent Semantic Indexing SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Probabilistic Topic Models
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Topic Models for Text Mining ChengXiang Zhai ( 翟成祥 )
Latent Dirichlet Allocation
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
A Study of Poisson Query Generation Model for Information Retrieval
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Context Analysis in Text Mining and Search
Online Multiscale Dynamic Topic Models
The topic discovery models
Statistical Language Models
Hidden Markov Models (HMMs)
Course Summary (Lecture for CS410 Intro Text Info Systems)
Lecture 15: Text Classification & Naive Bayes
Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†
Special Topics In Scientific Computing
The topic discovery models
Information Retrieval Models: Probabilistic Models
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
Hidden Markov Models (HMMs)
Latent Dirichlet Analysis
Statistical NLP: Lecture 9
The topic discovery models
Topic Modeling Nick Jordan.
Bayesian Inference for Mixture Language Models
John Lafferty, Chengxiang Zhai School of Computer Science
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Bayesian Inference for Mixture Language Models
Jiawei Han Department of Computer Science
Michal Rosen-Zvi University of California, Irvine
Machine Learning on Data Lecture 9b- Clustering
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Learning to Rank Typed Graph Walks: Local and Global Approaches
Language Models for TR Rong Jin
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Probabilistic Topic Model (Lecture for CS410 Intro Text Info Systems) April 11, 2007 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are adapted/taken from Qiaozhu Mei’s presentations

Outline Motivation – Contextual text mining General ideas of probabilistic topic models Probabilistic latent semantic analysis/indexing (PLSA/PLSI) Variants

Context features of a document Weblog Article communities Author Compared with other kinds of data, Weblogs have some interesting special characteristics, which make it interesting to exploit for text mining. source Location Time Author’s Occupation

papers written by Andrew McCallum What is a Context? …… papers written in 1998 WWW SIGIR ACL KDD SIGMOD papers written by Andrew McCallum 1998 Any context feature (metadata) and combination of features of documents can decide a context Contexts can overlap The choice of contexts depends on the need of mining tasks Each context corresponds to a support document set 1999 2005 2006

Text Collections with Context Information News articles: time, publisher, Weblogs: time, location, author, age-group, etc. Scientific Literature: author, publication time, conference, citations, etc. Customer review: product, source, time Emails: sender, receiver, time, thread Query Logs: time, IP address, clickthrough Webpage: domain, time, etc. ……

Application Questions (Contextual Text Mining) What’re people interested in language models before and after the year 2000? Which country responded first to the release of iPod Nano? China, UK or Canada? Who cares more about gas price during Hurricane Katrina? People in Illinois or Washington? What’s the common interest of the two researcher? Who is more application oriented? Do people like Dell laptops more than IBM laptops? … … 1: temporal text mining, KDD05 2-3: spatiotemporal text mining, WWW06 4: author-topic analysis 5: comparative text mining, KDD04 6: …

General Idea of Probabilistic Topic Models Modeling a topic/subtopic/theme with a multinomial distribution (unigram LM) Modeling text data with a mixture model involving multinomial distributions A document is “generated” by sampling words from some multinomial distribution Each time, a word may be generated from a different distribution Many variations of how these multinomial distributions are mixed Topic mining = Fitting the probabilistic model to text Answer topic-related questions by computing various kinds of conditional probabilities based on the estimated model (e.g., p(time|topic), p(time|topic, location))

What is a Theme? Semantic Resolution How many in a document? A theme is a semantically coherent topic/subtopic Represented with a multinomial distribution of terms over the whole vocabulary, i.e., a unigram language model E.g. 1? Topics Themes Several Patterns price 0.0772 oil 0.0643 gas 0.0454 increase 0.0210 product 0.0203 fuel 0.0188 company 0.0182 … … Concepts Theme: “Oil Price” Entities Words Many..

The Usage of a Theme government 0.3 response 0.2.. [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … Theme 1 donate 0.1 relief 0.05 help 0.02 .. Theme 2 … city 0.2 new 0.1 orleans 0.05 .. Theme k Is 0.05 the 0.04 a 0.03 .. Usage of a theme: Summarize topics/subtopics Navigate documents Retrieve documents Segment documents All other tasks involving unigram language models Background B

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99] Mix k multinomial distributions to generate a document Each document has a potentially different set of mixing weights which captures the topic coverage When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model) We may add a background distribution to “attract” background words

in doc d in the collection PLSA as a Mixture Model Document d warning 0.3 system 0.2.. Theme 1 d,1 1 “Generating” word w in doc d in the collection Aid 0.1 donation 0.05 support 0.02 .. 2 Theme 2 d,2 1 - B d, k W … k statistics 0.2 loss 0.1 dead 0.05 .. B Theme k B Is 0.05 the 0.04 a 0.03 .. Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood

Application of Bayes rule (in multiple collections) Parameter Estimation E-Step: Word w in doc d is generated from cluster j from background Application of Bayes rule Fractional counts contributing to using cluster j in generating d generating w from cluster j M-Step: Re-estimate mixing weights cluster LM Sum over all docs (in multiple collections)

Use the Model for Text Mining Term clustering: d,j naturally serves as a word cluster Coverage of cluster j: Document clustering: We may use Naïve Bayes classifier We may use d,j to decide which cluster d should be in Contextual text mining: Make d,j conditioned on context E.g., p(d,j |time), from which we can compute/plot p(time| d,j )

in doc d in the collection Adding Prior Most likely  Document d warning 0.3 system 0.2.. Theme 1 d,1 1 “Generating” word w in doc d in the collection Aid 0.1 donation 0.05 support 0.02 .. 2 Theme 2 d,2 1 - B d, k W … k statistics 0.2 loss 0.1 dead 0.05 .. B Theme k B Is 0.05 the 0.04 a 0.03 .. Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood

Maximum A Posterior (MAP) Estimation Pseudo counts of w from prior ’ +p(w|’j) + Sum of all pseudo counts What if =0? What if =+?

Variants Different contexts Further regularize the mixing weights Latent Dirichlet Allocation (LDA) [Blei et al. 03] Parameter estimation is more complex

A pool of text Collections Comparative Text Mining (CTM) [Zhai et al. 04] Problem definition: Given a comparable set of text collections Discover & analyze their common and unique properties A pool of text Collections Collection C1 Collection C2 …. Collection Ck Common themes C1- specific themes C2- specific themes Ck- specific themes

Example: Summarize Customer Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Ideal results from comparative text mining

A More Realistic Setup of CTM IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery 0.129 Hours 0.080 Life 0.060 … Long 0.120 4hours 0.010 3hours 0.008 Reasonable 0.10 Medium 0.08 2hours 0.002 Short 0.05 Poor 0.01 1hours 0.005 .. Disk 0.015 IDE 0.010 Drive 0.005 Large 0.100 80GB 0.050 Small 0.050 5GB 0.030 ... Medium 0.123 20GB 0.080 …. Pentium 0.113 Processor 0.050 Slow 0.114 200Mhz 0.080 Fast 0.151 3Ghz 0.100 Moderate 0.116 1Ghz 0.070 Common Word Distr. Collection-specific Word Distributions

Cross-Collection Mixture Models Cm Explicitly distinguish common and specific themes Fit a mixture model with the text data Estimate parameters using EM Clusters are designed to be meaningful Background B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m ………………… … Theme k in common: k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m

Details of the Mixture Model Account for noise (common non-informative words) Background B Common Distribution “Generating” word w in doc d in collection Ci B 1 C Theme 1 1,i W 1-C d,1 Collection-specific Distr. 1-B … d,k Common Distribution k C Theme k k,i Parameters: B=noise-level (manually set) C=Common-Specific tradeoff (manually set) ’s and ’s are estimated with Maximum Likelihood 1-C Collection-specific Distr.

Comparing News Articles Iraq War (30 articles) vs Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) The common theme indicates that “United Nations” is involved in both wars Cluster 1 Cluster 2 Cluster 3 Common Theme united 0.042 nations 0.04 … killed 0.035 month 0.032 deaths 0.023 Iraq n 0.03 Weapons 0.024 Inspections 0.023 troops 0.016 hoon 0.015 sanches 0.012 Afghan Northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 Collection-specific themes indicate different roles of “United Nations” in the two wars

Comparing Laptop Reviews Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents

What You Should Know The basic idea of PLSA How to estimate parameters How to use the model to do clustering/mining