Context Analysis in Text Mining and Search

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics.
Generative Topic Models for Community Analysis
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
2010 © University of Michigan 1 Text Retrieval and Data Mining in SI - An Introduction Qiaozhu Mei School of Information Computer Science and Engineering.
Scalable Text Mining with Sparse Generative Models
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign
Overview of Search Engines
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Information Retrieval in Practice
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Automatic Construction of Topic Maps for Navigation in Information Space ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Towards Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Probabilistic Topic Models
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Chapter 6: Information Retrieval and Web Search
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
A Study of Poisson Query Generation Model for Information Retrieval
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Hierarchical Clustering & Topic Models
Information Retrieval in Practice
CS510 Advanced Topics in Information Retrieval (Fall 2017)
Search Engine Architecture
Sentiment analysis algorithms and applications: A survey
Online Multiscale Dynamic Topic Models
Semantic Processing with Context Analysis
Probabilistic Topic Model
Course Summary (Lecture for CS410 Intro Text Info Systems)
Text Retrieval and Data Mining in SI - An Introduction
Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†
Aspect-based sentiment analysis
ChengXiang (“Cheng”) Zhai Department of Computer Science
Applying Key Phrase Extraction to aid Invalidity Search
CS510 (Fall 2018) Advanced Topics in Information Retrieval
Bayesian Inference for Mixture Language Models
John Lafferty, Chengxiang Zhai School of Computer Science
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Dan Roth Department of Computer Science
Language Models for TR Rong Jin
Presentation transcript:

Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign http://sifaka.cs.uiuc.edu/~qmei2, qmei2@uiuc.edu Joint work with ChengXiang Zhai 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Motivating Example: Personalized Search MSR Metropolis Street Racer Magnetic Stripe Reader Molten salt reactor Mars Sample Return … Mountain safety research Actually Looking for Microsoft Research… 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Motivating Example: Comparing Product Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Unsupervised discovery of common topics and their variations 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Motivating Example: Discovering Topical Trends in Literature SIGIR topics Topic Strength Time Explain the Plots. Temporal theme analysis separate to make the fonts bigger… more explanations. Title: sample ETP: theme evolutionary graph 1980 1990 1998 2003 TF-IDF Retrieval Language Model IR Applications Text Categorization Unsupervised discovery of topics and their temporal variations 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Motivating Example: Analyzing Spatial Topic Patterns How do bloggers in different states respond to topics such as “oil price increase during Hurricane Karina”? Unsupervised discovery of topics and their variations in different locations 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Motivating Example: Summarizing Sentiments Query: Dell Laptops Topic-sentiment summary positive negative Facet 2 (Battery) Facet 1 (Price) neutral my Dell battery sucks Stupid Dell laptop battery One thing I really like about this Dell battery is the Express Charge feature. i still want a free battery from dell.. …… it is the best site and they show Dell coupon code as early as possible Even though Dell's price is cheaper, we still don't want it. mac pro vs. dell precision: a price comparis.. DELL is trading at $24.66 time strength Positive Negative Topic-sentiment dynamics (Topic = Price) Neutral Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Motivating Example: Analyzing Topics on a Social Network Bruce Croft Publications of Gerard Salton Publications of Bruce Croft Information retrieval Machine learning Data mining Gerard Salton Unsupervised discovery of topics and correlated research communities 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Research Questions What are these problems in common? Can we model all these problems generally? Can we solve these problems with a unified approach? How can we bring human into the loop? 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Rest of Talk Background: Language Models in Text Mining and Retrieval Definition of context General methodology to model context Models, example applications, results Conclusion and Discussion 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Generative Models of Text Text as observations: words; tags; links, etc Use a unified probabilistic model to explain the appearance (generation) of observations Documents are generated by sampling every observation from such a generative model Different generation assumption  different model Document Language Models Probabilistic Topic Models: PLSA, LDA, etc. Hidden Markov Models … 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Multinomial Language Models A multinomial distribution of words as a text representation retrieval 0.2 information 0.15 model 0.08 query 0.07 language 0.06 feedback 0.03 …… Known as a Topic model when there are k of them in text: e.g., semi-supervised learning; boosting; spectral clustering, etc. 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Language Models in Information Retrieval (e.g., KL-Div. Method) Doc Language Model (LM) θd : p(w|d) text 4/100=0.04 mining 3/100=0.03 clustering 1/100=0.01 … data = 0 computing = 0 … Smoothed Doc LM θd' : p(w|d’) Document d text =0.039 mining =0.028 clustering =0.01 … data = 0.001 computing = 0.0005 … A text mining paper Similarity function Data ½=0.5 Mining ½=0.5 Query Language Model θq : p(w|q) Query q Data ½=0.4 Mining ½=0.4 Clustering =0.1 … ? p(w|q’) data mining 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Probabilistic Topic Models for Text Mining term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independ. 0.03 model 0.03 … Topic models (Multinomial distributions) Text Collections Probabilistic Topic Modeling Subtopic discovery Opinion comparison Summarization Topical pattern analysis … Passage segmentation PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] CPLSA [Mei & Zhai 06] … Pachinko allocation [Li & McCallum 06] CTM [Blei et al. 06] … web 0.21 search 0.10 link 0.08 graph 0.05 … … 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Importance of Context Science in the year 2000 and Science in the year 1500: Are we still working on the same topics? For a computer scientist and a gardener: Does “tree, root, prune” mean the same? “Football” means soccer in Europe. What about in US? Context affects topics! 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context Features of Text (Meta-data) Weblog Article communities Author Compared with other kinds of data, Weblogs have some interesting special characteristics, which make it interesting to exploit for text mining. source Location Time Author’s Occupation 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context = Partitioning of Text Papers about Web papers written in 1998 1998 papers written by authors in US 1999 …… …… 2005 2006 WWW SIGIR ACL KDD SIGMOD 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Rich Context Information in Text News articles: time, publisher, etc. Blogs: time, location, author, … Scientific Literature: author, publication year, conference, citations, … Query Logs: time, IP address, user, clicks, … Customer reviews: product, source, time, sentiments.. Emails: sender, receiver, time, thread, … Web pages: domain, time, click rate, etc. More? entity-relations, social networks, …… 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Categories of Context Some partitions of text are explicit  explicit context Time; location; author; conference; user; IP; etc Similar to metadata Some partitions are implicit  implicit context Sentiments; missions; goals; intents; Some partitions are at document level Some are at a finer granularity Context of a word; an entity; a pattern; a query, etc. Sentences; sliding windows; adjacent words; etc 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context Analysis Use context to infer semantics Annotating frequent patterns; labeling of topic models Use context to provide targeted service Personalized search; intent-based search; etc. Compare contextual patterns of topics Evolutionary topic patterns; spatiotemporal topic patterns; topic-sentiment patterns; etc. Use context to help other tasks Social network analysis; impact summarization; etc. 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

General Methodology to Model Context Context  Generative Model Observations in the same context are generated with a unified model Observations in different contexts are generated with different models Observations in similar contexts are generated with similar models Text is generated with a mixture of such generative models Example Task; Model; Sample results 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Model a unique context with a unified model (Generation) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Probabilistic Latent Semantic Analysis (Hofmann ’99) Documents about “Hurricane Katrina” Topics θ1…k government donation New Orleans P(w|θj) Draw a word from i A Document d Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. … government 0.3 response 0.2.. donate 0.1 relief 0.05 help 0.02 .. city 0.2 new 0.1 orleans 0.05 .. government response donate πd : P(θi|d) help aid Orleans new Choose a topic N D Wd,n θk Zd,n πd K πd θk 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Latent Dirichlet Allocation (Blei ‘03) PLSA: no natural way to assign probability to a unseen document. Number of parameters grow linearly with size of training set  overfits data. Not a fully generative model. LDA solves these problems But need to inference p(topic|d) and p(w|topic) Parameter estimation using Gibbs Sampling or variational inference 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Topics in Science (D. Blei 05) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Label a Multinomial Topic Model Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics Retrieval models term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … iPod Nano じょうほうけんさく Pseudo-feedback Mei and Zhai 06: a topic in SIGIR Information Retrieval 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Automatic Labeling of Topics NLP Chunker Ngram Stat. information retrieval, retrieval model, index structure, relevance feedback, … Candidate label pool 1 Collection (e.g., SIGIR) term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … Discrimination 3 information retriev. 0.26 0.01 retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… 2 4 Coverage retrieval models 0.20 IR models 0.18 0.02 pseudo feedback 0.09 …… information retrieval 0.01 filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 … 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Label Relevance: Context Comparison Intuition: expect the label with similar context (distribution) Clustering dimension partition algorithm hash Topic  … P(w|) Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label (l1) “clustering algorithm” l2: “hash join” Clustering hash dimension key algorithm … p(w | hash join) key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… Score (l,  ) = D(||l) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Results: Sample Topic Labels north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 iran contra … clustering algorithm clustering structure … tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 r tree b tree … large data, data quality, high data, data application, … indexing methods 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Model different contexts with different models (Discrimination, Comparison) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Finding Evolutionary Patterns of Topics 1999 2000 2001 2002 2003 2004 T KDD web 0.009 classifica –tion 0.007 features0.006 topic 0.005 … SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … … Classifica - tion 0.015 text 0.013 unlabeled 0.012 document 0.008 labeled 0.008 learning 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … Content Variations over Contexts … … 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Finding Evolutionary Patterns of Topics (II) Figure from (Mei ‘05) Strength Variations over Contexts 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

View of Topics: Context-Specific Version of Views Context 1: 1998 ~ 2006 (e.g. After “Language Modeling”) One context  one view A document selects from a mix of views language model smoothing query generation feedback mixture estimate EM pseudo vector Rocchio weighting feedback term space TF-IDF Okapi LSI retrieval Topic 1: Retrieval Model retrieve model relevance document query feedback judge expansion pseudo query Topic 2: Feedback Context 2: 1977 ~ 1998 (i.e. Before “Language Modeling”) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Coverage of Topics: Distribution over Topics Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. … Oil Price Government Response Aid and donation Background Context: Texas A coverage of topics: a (strength) distribution over the topics. One context  one coverage A document selects from a mix of multiple coverages. Oil Price Government Response Aid and donation Background Context: Louisiana 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

A General Solution: CPLSA CPLAS = Contextual Probabilistic Latent Semantic Analysis An extension of PLSA model ([Hofmann 99]) by Introducing context variables Modeling views of topics Modeling coverage variations of topics Process of contextual text mining Instantiation of CPLSA (context, views, coverage) Fit the model to text data (EM algorithm) Compare a topic from different views Compute strength dynamics of topics from coverages Compute other probabilistic topic patterns 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

The “Generation” Process Choose a theme View1 View2 View3 Texas July 2005 sociologist Topics government donation New Orleans Draw a word from i Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. … Context of Document: Time = July 2005 Location = Texas Author = Eric Brill Occup. = Sociologist Age = 45+ … government 0.3 response 0.2.. donate 0.1 relief 0.05 help 0.02 .. city 0.2 new 0.1 orleans 0.05 .. government response donate help aid Orleans new Choose a view Topic coverages: Texas July 2005 document …… sociologist Choose a Coverage 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

An Intuitive Example Two topics: web search; machine learning I am writing a WWW paper.  I will cover more about “web search” instead of “machine learning”. But of course I have my own taste. I am from a search engine company, so when I write about “web search”, I will focus on “search engine” and “online advertisements”… Coverage donate 0.1 relief 0.05 help 0.02 .. city 0.2 new 0.1 orleans 0.05 .. View 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

The Probabilistic Model A probabilistic model explaining the generation of a document D and its context features C: if an author wants to write such a document, he will Choose a view vi according to the view distribution Choose a coverage кj according to the coverage distribution . Choose a theme according to the coverage кj . Generate a word using . The likelihood of the document collection is: 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example results: Query Log Analysis Context = Days of week Query & Clicks: more query/clicks on weekdays Search Difficulty: more difficult to predict on weekends

Query Log Analysis Context = Type of Query Business Queries: clear day-week pattern; weekdays more frequent than weekends Consumer Queries: no clear day-week pattern; weekends are comparable, even more frequent than weekdays

Bursting Topics in SIGMOD: Context = Time (Years) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Spatiotemporal Text Mining: Context = Time & Location Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico About Government Response in Hurricane Katrina 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Faceted Opinions Context = Sentiments Neutral Positive Negative Topic 1: Movie ... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman ... Tom Hanks, who is my favorite movie star act the leading role. protesting ... will lose your faith by ... watching the movie. After watching the movie I went online and some research on ... Anybody is interested in it? ... so sick of people making such a big deal about a FICTION book and movie. Topic 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society. Can click on the cells to get to the original articles.. 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Sentiment Dynamics Context = Time & Sentiments “ the da vinci code” Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg ) Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos ) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Event Impact Analysis: IR Research xml 0.0678 email 0.0197 model 0.0191 collect 0.0187 judgment 0.0102 rank 0.0097 subtopic 0.0079 … vector 0.0514 concept 0.0298 extend 0.0297 model 0.0291 space 0.0236 boolean 0.0151 function 0.0123 feedback 0.0077 … Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Theme: retrieval models SIGIR papers 1998 model 0.1687 language 0.0753 estimate 0.0520 parameter 0.0281 distribution 0.0268 probable 0.0205 smooth 0.0198 markov 0.0137 likelihood 0.0059 … probabilist 0.0778 model 0.0432 logic 0.0404 ir 0.0338 boolean 0.0281 algebra 0.0200 estimate 0.0119 weight 0.0111 … 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Model similar context with similar models (Smoothing, Regularization) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Personalization with Backoff Ambiguous query: MSG Madison Square Garden Monosodium Glutamate Disambiguate based on user’s prior clicks We don’t have enough data for everyone! Backoff to classes of users Proof of Concept: Classes defined by IP addresses Better: Market Segmentation (Demographics) Collaborative Filtering (Other users who click like me)

Context = IP Full personalization: every context has a different model: sparse data! 156.111.188.243 156.111.188.* Personalization with backoff: similar contexts have similar models 156.111.*.* 156.*.*.* *.*.*.* No personalization: all contexts share the same model 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Backing Off by IP Sparse Data Missed Opportunity λs estimated with EM and CV A little bit of personalization Better than too much Or too little λ4 : weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IP λ2 : weights for first 2 bytes of IP ……

Social Network as Correlated Contexts Linked contexts are similar to each other Optimization of Relevance Feedback Weights Parallel Architecture in IR ... Predicting query performance … A Language Modeling Approach to Information Retrieval ... 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Social Network Context for Topic Modeling e.g. coauthor network Context = author Coauthor = similar contexts Intuition: I work on similar topics to my neighbors Smoothed Topic distributions over context  2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topic Modeling with Network Regularization (NetPLSA) Basic Assumption (e.g., co-author graph) Related authors work on similar topics PLSA topic distribution of a document tradeoff between topic and smoothness difference of topic distribution on neighbor vertices Graph Harmonic Regularizer, Generalization of [Zhu ’03], importance (weight) of an edge 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topical Communities with PLSA term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social 0.01 Noisy community assignment ? ? ? ? 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topical Communities with NetPLSA retrieval 0.13 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02 user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.01 Web Coherent community assignment Data mining Information Retrieval Machine learning 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Smoothed Topic Map Map a topic on the network (e.g., using p(θ|a)) Core contributors Intermediate Irrelevant NetPLSA PLSA (Topic : “information retrieval”) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Smoothed Topic Map The Windy States Blog articles: “weather” NetPLSA PLSA The Windy States Blog articles: “weather” US states network: Topic: “windy” Real reference 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Related Work Specific Contextual Text Mining Problems Multi-collection Comparative Mining (e.g., [Zhai et al. 04]) Temporal theme pattern (e.g., [Mei et al. 05], [Blei et al. 06], [Wang et al. 06]) Spatiotemporal theme analysis (e.g., [Mei et al. 06], [Wang et al. 07]) Author-topic analysis (e.g., [Steyvers et al. 04], [Zhou et al 06]) … Probabilistic topic models: Probabilistic latent semantic analysis (PLSA) (e.g. [Hofmann 99]) Latent Dirichlet allocation (LDA) (e.g., [Blei et al. 03]) Many extensions (e.g., [Blei et al. 05], [Li and McCallum 06]) 2007 © ChengXiang Zhai LLNL, Aug 15, 2007

Conclusions Context analysis in text mining and search General methodology to model context in text A unified generative model for observations in the same context Different models for different context Similar models for similar contexts Generation  discrimination  smoothing Many applications 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Discussion: Context in Search Not all contexts are useful E.g. personalized search v.s. search by time of day How can we know which contexts are more useful? Many contexts are useful E.g., personalized search; task-based search; localized search; How can we combine them? Can we do better than market segmentations? Backoff to users who search like me – Collaborative Search But who searches like you? 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

References CPLSA NetPLSA Labeling Personalization: Applications: Q. Mei, C. Zhai. A Mixture Model for Contextual Text Mining, In Proceedings of KDD' 06. NetPLSA Q. Mei, D. Cai, D. Zhang, C. Zhai, Topic Modeling with Network Reguarization, Proceedings of WWW’ 08 Labeling Q. Mei, X.Shen, C. Zhai, Automatic Labeling of Multinomial Topic Models, Proceedings KDD'07 Personalization: Q.Mei, K.Church, Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? In Proceedings of WSDM’08. Applications: Q. Mei, C. Zhai, Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining, In Proceedings KDD' 05 Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, In Proceedings of WWW' 06 Q. Mei, X. Ling, M. Wondra, H. Su, C. Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of WWW’ 07 2007 © ChengXiang Zhai LLNL, Aug 15, 2007

The End Thank You! 2007 © ChengXiang Zhai LLNL, Aug 15, 2007

Experiments Bibliography data and coauthor networks DBLP: text = titles; network = coauthors Four conferences (expect 4 topics): SIGIR, KDD, NIPS, WWW Blog articles and Geographic network Blogs from spaces.live.com containing topical words, e.g. “weather” Network: US states (adjacent states) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Coherent Topical Communities PLSA visual 0.02 analog 0.02 neurons 0.02 vlsi 0.01 motion 0.01 chip 0.01 natural 0.01 cortex 0.01 spike 0.01 PLSA peer 0.02 patterns 0.01 mining 0.01 clusters 0.01 stream 0.01 frequent 0.01 e 0.01 page 0.01 gene 0.01 NetPLSA neural 0.06 learning 0.02 networks 0.02 recognition 0.02 analog 0.01 vlsi 0.01 neurons 0.01 gaussian 0.01 network 0.01 NetPLSA mining 0.11 data 0.06 discovery 0.03 databases 0.02 rules 0.02 association 0.02 patterns 0.02 frequent 0.01 streams 0.01 Semantics of community: “machine learning (NIPS)” Semantics of community: “Data Mining (KDD) ” 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign