2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1 龙星计划课程 : 信息检索 Topic Models for Text Mining ChengXiang Zhai ( 翟成祥 )

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
A Cross-Collection Mixture Model for Comparative Text Mining
Topic models Source: Topic models, David Blei, MLSS 09.
Information retrieval – LSI, pLSI and LDA
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Statistical Topic Modeling part 1
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Mixture Language Models and EM Algorithm
Generative Topic Models for Community Analysis
Topic Modeling with Network Regularization Md Mustafizur Rahman.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Scalable Text Mining with Sparse Generative Models
Data Mining – Intro.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Probabilistic Topic Models for Text Mining
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Text Classification, Active/Interactive learning.
2010 © University of Michigan Latent Semantic Indexing SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Frame an IR Research Problem and Form Hypotheses ChengXiang Zhai Department.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Probabilistic Topic Models
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Web-Mining Agents Topic Analysis: pLSI and LDA
Relevance Feedback Hongning Wang
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
A Study of Poisson Query Generation Model for Information Retrieval
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Context Analysis in Text Mining and Search
Probabilistic Topic Model
Multimodal Learning with Deep Boltzmann Machines
Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†
Relevance Feedback Hongning Wang
Bayesian Inference for Mixture Language Models
John Lafferty, Chengxiang Zhai School of Computer Science
Michal Rosen-Zvi University of California, Irvine
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Topic Models in Text Processing
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Topic Models for Text Mining ChengXiang Zhai ( 翟成祥 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Text Management Applications Access Mining Organization Select information Create Knowledge Add Structure/Annotations

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, What Is Text Mining? “The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999) (Slide from Rebecca Hwa’s “Intro to Text Mining”)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Two Different Views of Text Mining Data Mining View: Explore patterns in textual data –Find latent topics –Find topical trends –Find outliers and other hidden patterns Natural Language Processing View: Make inferences based on partial understanding natural language text –Information extraction –Question answering Shallow mining Deep mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Applications of Text Mining Direct applications: Go beyond search to find knowledge –Question-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? –Data-driven (WWW, literature, , customer reviews, etc): We have a lot of data; what can we do with it? Indirect applications –Assist information access (e.g., discover latent topics to better summarize search results) –Assist information organization (e.g., discover hidden structures)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Text Mining Methods Data Mining Style: View text as high dimensional data –Frequent pattern finding –Association analysis –Outlier detection Information Retrieval Style: Fine granularity topical analysis –Topic extraction –Exploit term weighting and text similarity measures Natural Language Processing Style: Information Extraction –Entity extraction –Relation extraction –Sentiment analysis –Question answering Machine Learning Style: Unsupervised or semi-supervised learning –Mixture models –Dimension reduction Topic of this lecture

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Outline The Basic Topic Models: –Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99] –Latent Dirichlet Allocation (LDA) [Blei et al. 02] Extensions –Contextual Probabilistic Latent Semantic Analysis (CPLSA) [Mei & Zhai 06] –Other extensions

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Basic Topic Model: PLSA

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, PLSA: Motivation What did people say in their blog articles about “Hurricane Katrina”? Query = “Hurricane Katrina” Results:

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99] Mix k multinomial distributions to generate a document Each document has a potentially different set of mixing weights which captures the topic coverage When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model) We may add a background distribution to “attract” background words

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, PLSA as a Mixture Model Topic  1 Topic  k Topic  2 … Document d Background B warning 0.3 system aid 0.1 donation 0.05 support statistics 0.2 loss 0.1 dead is 0.05 the 0.04 a kk 11 22 B B W  d,1  d, k 1 - B  d,2 “Generating” word w in doc d in the collection Parameters: B =noise-level (manually set)  ’s and  ’s are estimated with Maximum Likelihood ? ? ? ? ? ? ? ? ? ? ?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Special Case: Model-based Feedback Simple case: there is only one topic P(w|  F ) P(w|  B ) 1- P(source) Background words Topic words Maximum Likelihood: What about there are k topics?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, How to Estimate  j : EM Algorithm the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) … text =? mining =? association =? word =? … Unknown topic model p(w|  1 )=? “Text mining” Observed Doc(s) ML Estimator … … information =? retrieval =? query =? document =? … Unknown topic model p(w|  2 )=? “information retrieval” Suppose, we know the identity of each word...

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, How the Algorithm Works 14 aid price oil π d1,1 ( P(θ 1 |d 1 ) ) π d1,2 ( P(θ 2 |d 1 ) ) π d2,1 ( P(θ 1 |d 2 ) ) π d2,2 ( P(θ 2 |d 2 ) ) aid price oil Topic 1Topic 2 aid price oil P(w| θ) Initial value Initializing π d, j and P(w| θ j ) with random values Iteration 1: E Step: split word counts with different topics (by computing z’ s) Iteration 1: M Step: re- estimate π d, j and P(w| θ j ) by adding and normalizing the splitted word counts Iteration 2: E Step: split word counts with different topics (by computing z’ s) Iteration 2: M Step: re- estimate π d, j and P(w| θ j ) by adding and normalizing the splitted word counts Iteration 3, 4, 5, … Until converging d1d1 d2d2 c(w, d) c(w,d)p(z d,w = B) c(w,d)(1 - p(z d,w = B))p(z d,w =j)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Parameter Estimation E-Step: Word w in doc d is generated - from cluster j - from background Application of Bayes rule M-Step: Re-estimate - mixing weights - cluster LM Fractional counts contributing to - using cluster j in generating d - generating w from cluster j Sum over all docs (in multiple collections) m = 1 if one collection

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, PLSA with Prior Knowledge There are different ways of choosing aspects (topics) –Google = Google News + Google Map + Google scholar, … –Google = Google US + Google France + Google China, … Users have some domain knowledge in mind, e.g., –We expect to see “retrieval models” as a topic in IR. –We want to show the aspects of “history” and “statistics” for Youtube A flexible way to incorporate such knowledge as priors of PLSA model In Bayesian, it’s your “belief” on the topic distributions 16

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Adding Prior Topic  1 Topic  k Topic  2 … Document d Background B warning 0.3 system aid 0.1 donation 0.05 support statistics 0.2 loss 0.1 dead is 0.05 the 0.04 a kk 11 22 B B W  d,1  d, k 1 - B  d,2 “Generating” word w in doc d in the collection Parameters: B =noise-level (manually set)  ’s and  ’s are estimated with Maximum Likelihood Most likely 

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Adding Prior as Pseudo Counts 18 the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) … text =? mining =? association =? word =? … Unknown topic model p(w|  1 )=? “Text mining” … information =? retrieval =? query =? document =? … … Unknown topic model p(w|  2 )=? “information retrieval” Suppose, we know the identity of each word... Observed Doc(s) MAP Estimator Pseudo Doc Size = μ text mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Maximum A Posterior (MAP) Estimation +  p(w|  ’ j ) ++ Pseudo counts of w from prior  ’ Sum of all pseudo counts What if  =0? What if  =+  ?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Basic Topic Model: LDA The following slides about LDA are taken from Michael C. Mozer’s course lecture

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 LDA: Motivation –“Documents have no generative probabilistic semantics” i.e., document is just a symbol –Model has many parameters linear in number of documents need heuristic methods to prevent overfitting –Cannot generalize to new documents

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Unigram Model

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Mixture of Unigrams

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Topic Model / Probabilistic LSI d is a localist representation of (trained) documents LDA provides a distributed representation

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 LDA Vocabulary of |V| words Document is a collection of words from vocabulary. N words in document ww = (w 1,..., w N ) Latent topics random variable z, with values 1,..., k Like topic model, document is generated by sampling a topic from a mixture and then sampling a word from a mixture. But topic model assumes a fixed mixture of topics (multinomial distribution) for each document. LDA assumes a random mixture of topics (Dirichlet distribution) for each topic.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Generative Model “Plates” indicate looping structure Outer plate replicated for each document Inner plate replicated for each word Same conditional distributions apply for each replicate Document probability

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Fancier Version

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Inference

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Inference In general, this formula is intractable: Expanded version: 1 if w n is the j'th vocab word

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Variational Approximation Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)] Find variational distribution q such that the above equation is computable. –q parameterized by γ and φ n –Maximize bound with respect to γ and φ n to obtain best approximation to p(w | α, β) –Lead to variational EM algorithm Sampling algorithms (e.g., Gibbs sampling) are also common

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 Data Sets C. Elegans Community abstracts 5,225 abstracts 28,414 unique terms TREC AP corpus (subset) 16,333 newswire articles 23,075 unique terms Held-out data – 10% Removed terms 50 stop words, words appearing once

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 C. Elegans Note: fold in hack for pLSI to allow it to handle novel documents. Involves refitting p(z|d new ) parameters -> sort of a cheat

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 AP

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Summary: PLSA vs. LDA LDA adds a Dirichlet distribution on top of PLSA to regularize the model Estimation of LDA is more complicated than PLSA LDA is a generative model, while PLSA isn’t PLSA is more likely to over-fit the data than LDA Which one to use? –If you need generalization capacity, LDA –If you want to mine topics from a collection, PLSA may be better (we want overfitting!)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Extension of PLSA: Contextual Probabilistic Latent Semantic Analysis (CPLSA)

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, A General Introduction to EM Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L(  )= log p(X|  ) “Complete” likelihood: L c (  )= log p(X,H|  ) EM tries to iteratively maximize the incomplete likelihood: Starting with an initial guess  (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute  (n) by maximizing the Q-function

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Convergence Guarantee Goal: maximizing “Incomplete” likelihood : L(  )= log p(X|  ) I.e., choosing  (n), so that L(  (n) )-L(  (n-1) )  0 Note that, since p(X,H|  ) =p(H|X,  ) P(X|  ), L(  ) =L c (  ) -log p(H|X,  ) L(  (n) )-L(  (n-1) ) = L c (  (n) )-L c (  (n-1) )+log [p(H|X,  (n-1) )/p(H|X,  (n) )] Taking expectation w.r.t. p(H|X,  (n-1) ), L(  (n) )-L(  (n-1) ) = Q(  (n) ;  (n-1) )-Q(  (n-1) ;  (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  (n) )) KL-divergence, always non-negative EM chooses  (n) to maximize Q Therefore, L(  (n) )  L(  (n-1) )! Doesn’t contain H

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Another way of looking at EM Likelihood p(X|  )  current guess Lower bound (Q function) next guess E-step = computing the lower bound M-step = maximizing the lower bound L(  (n-1) ) + Q(  ;  (n-1) ) -Q(  (n-1) ;  (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  )) L(  (n-1) ) + Q(  ;  (n-1) ) -Q(  (n-1) ;  (n-1) )

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Why Contextual PLSA?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Motivating Example: Comparing Product Reviews Common Themes“IBM” specific“APPLE” specific“DELL” specific Battery LifeLong, 4-3 hrsMedium, 3-2 hrsShort, 2-1 hrs Hard diskLarge, GBSmall, 5-10 GBMedium, GB SpeedSlow, MhzVery Fast, 3-4 GhzModerate, 1-2 Ghz IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Unsupervised discovery of common topics and their variations

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Motivating Example: Comparing News about Similar Topics Common Themes“Vietnam” specific“Afghan” specific“Iraq” specific United nations ……… Death of people ……… … ……… Vietnam WarAfghan War Iraq War Unsupervised discovery of common topics and their variations

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Motivating Example: Discovering Topical Trends in Literature Unsupervised discovery of topics and their temporal variations Theme Strength Time TF-IDF Retrieval IR Applications Language Model Text Categorization

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Motivating Example: Analyzing Spatial Topic Patterns How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”? Unsupervised discovery of topics and their variations in different locations

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Motivating Example: Sentiment Summary Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Research Questions Can we model all these problems generally? Can we solve these problems with a unified approach? How can we bring human into the loop?

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Contextual Text Mining Given collections of text with contextual information (meta- data) Discover themes/subtopics/topics (interesting word clusters) Compute variations of themes over contexts Applications: –Summarizing search results –Federation of text information –Opinion analysis –Social network analysis –Business intelligence –..

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Context Features of Text (Meta-data) Weblog Article Author Author’s Occupation Location Time communities source

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Context = Partitioning of Text …… papers written in 1998 WWWSIGIRACLKDDSIGMOD papers written by authors in US Papers about Web

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Themes/Topics Uses of themes: –Summarize topics/subtopics –Navigate in a document space –Retrieve documents –Segment documents –… Theme  1 Theme  k Theme  2 … Background B government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans Is 0.05 the 0.04 a [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, View of Themes: Context-Specific Version of Views Context: After 1998 (Language models) Context: Before 1998 (Traditional models) vector space TF-IDF Okapi LSI vector Rocchio weighting feedback term retrieval feedback language model smoothing query generation mixture estimate EM pseudo model feedback judge expansion pseudo query Theme 2: Feedback Theme 1: Retrieval Model retrieve model relevance documen t query

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Coverage of Themes: Distribution over Themes Background Theme coverage can depend on context Oil Price Government Response Aid and donation Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. … Background Oil Price Government Response Aid and donation Context: Texas Context: Louisiana

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, General Tasks of Contextual Text Mining Theme Extraction: Extract the global salient themes –Common information shared over all contexts View Comparison: Compare a theme from different views –Analyze the content variation of themes over contexts Coverage Comparison: Compare the theme coverage of different contexts –Reveal how closely a theme is associated to a context Others: –Causal analysis –Correlation analysis

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, A General Solution: CPLSA CPLAS = Contextual Probabilistic Latent Semantic Analysis An extension of PLSA model ([Hofmann 99]) by –Introducing context variables –Modeling views of topics –Modeling coverage variations of topics Process of contextual text mining –Instantiation of CPLSA (context, views, coverage) –Fit the model to text data (EM algorithm) –Compute probabilistic topic patterns

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Document context: Time = July 2005 Location = Texas Author = xxx Occup. = Sociologist Age Group = 45+ … “Generation” Process of CPLSA View1View2View3 Themes government donation New Orleans government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans TexasJuly 2005 sociolo gist Theme coverages: Texas July 2005 document …… Choose a view Choose a Coverage government donate new Draw a word from  i response aid help Orleans Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut- in gas production … Over seventy countries pledged monetary donations or other assistance. … Choose a theme

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, To generate a document D with context feature set C: –Choose a view v i according to the view distribution –Choose a coverage к j according to the coverage distribution –Choose a theme according to the coverage к j –Generate a word using –The likelihood of the document collection is: Probabilistic Model

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Parameter Estimation: EM Algorithm Interesting patterns: –Theme content variation for each view: –Theme strength variation for each context Prior from a user can be incorporated using MAP estimation

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Regularization of the Model Why? –Generality  high complexity (inefficient, multiple local maxima) –Real applications have domain constraints/knowledge Two useful simplifications: –Fixed-Coverage: Only analyze the content variation of themes (e.g., author-topic analysis, cross-collection comparative analysis ) –Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis) In general –Impose priors on model parameters –Support the whole spectrum from unsupervised to supervised learning

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Interpretation of Topics Statistical topic models term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Multinomial topic models NLP Chunker Ngram stat. database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Collection (Context) Ranked List of Labels clustering algorithm; distance measure; … Relevance Score Re-ranking Coverage; Discrimination

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Relevance: the Zero-Order Score Intuition: prefer phrases covering high probability words Clustering dimensional algorithm birch shape Latent Topic  … Good Label ( l 1 ): “clustering algorithm” body Bad Label ( l 2 ): “body shape” … p(w|  )

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Relevance: the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Clustering hash dimension algorithm partition C: SIGMOD Proceedings Topic  … … P(w|  ) P(w| l 1 ) D(  || l 1 ) < D(  || l 2 ) Good Label ( l 1 ): “clustering algorithm” Clustering hash dimension join algorithm … Bad Label ( l 2 ): “hash join” P(w| l 2 ) Score (l,  )

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Sample Results Comparative text mining Spatiotemporal pattern mining Sentiment summary Event impact analysis Temporal author-topic analysis

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) Cluster 1Cluster 2Cluster 3 Common Theme united nations 0.04 … killed month deaths … … Iraq Theme n 0.03 Weapons Inspections … troops hoon sanches … … Afghan Theme Northern 0.04 alliance 0.04 kabul 0.03 taleban aid 0.02 … taleban rumsfeld 0.02 hotel front … … The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Comparing Laptop Reviews Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Spatiotemporal Patterns in Blog Articles Query= “Hurricane Katrina” Topics in the results: Spatiotemporal patterns

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Theme Life Cycles for Hurricane Katrina city orleans new louisiana flood evacuate storm … price oil gas increase product fuel company … Oil Price New Orleans

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Theme Snapshots for Hurricane Katrina Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Theme Life Cycles: KDD Global Themes life cycles of KDD Abstracts gene expressions probability microarray … marketing customer model business … rules association support …

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Theme Evolution Graph: KDD T SVM criteria classifica – tion linear … decision tree classifier class Bayes … Classifica - tion text unlabeled document labeled learning … Informa - tion web social retrieval distance networks … ………… 1999 … web classifica – tion features0.006 topic … mixture random cluster clustering variables … topic mixture LDA semantic … …

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Blog Sentiment Summary (query=“Da Vinci Code”) NeutralPositiveNegative Facet 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Results: Sentiment Dynamics Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg ) Facet: religious beliefs ( Bursts during the movie, Neg > Pos )

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Event Impact Analysis: IR Research vector concept extend model space boolean function feedback … xml model collect judgment rank subtopic … probabilist model logic ir boolean algebra estimate weight … model language estimate parameter distribution probable smooth markov likelihood … 1998 Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term relevance weight feedback independence model frequent probabilistic document … Theme: retrieval models SIGIR papers

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Temporal-Author-Topic Analysis pattern frequent frequent-pattern sequential pattern-growth constraint push … project itemset intertransaction support associate frequent closet prefixspan … research next transaction panel technical article revolution innovate … close pattern sequential min_support threshold top-k fp-tree … index graph web gspan substructure gindex bide xml … 2000 time Author Author B Author A Global theme: frequent patterns Jiawei Han Rakesh Agrawal

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Modeling Topical Communities (Mei et al. 08) 73 Community 1: Information Retrieval Community 2: Data Mining Community 3: Machine Learning

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Other Extensions (LDA Extensions) Many extensions of LDA, mostly done by David Blei, Andrew McCallum and their co-authors Some examples: –Hierarchical topic models [Blei et al. 03] –Modeling annotated data [Blei & Jordan 03] –Dynamic topic models [Blei & Lafferty 06] –Pachinko allocation [Li & McCallum 06]) Also, some specific context extension of PLSA, e.g., author-topic model [Steyvers et al. 04]

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Future Research Directions Topic models for text mining –Evaluation of topic models –Improve the efficiency of estimation and inferences –Incorporate linguistic knowledge –Applications in new domains and for new tasks Text mining in general –Combination of NLP-style and DM-style mining algorithms –Integrated mining of text (unstructured) and unstructured data (e.g., Text OLAP) –Interactive mining: Incorporate user constraints and support iterative mining Design and implement mining languages

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, What You Should Know How PLSA works How EM algorithm works in general Contextual PLSA can be used to perform many quite different interesting text mining tasks

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Roadmap This lecture: Topic models for text mining Next lecture: Next generation search engines