2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Towards Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
A Cross-Collection Mixture Model for Comparative Text Mining
Evaluating Novelty and Diversity Charles Clarke School of Computer Science University of Waterloo two talks in one!
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
SFU, CMPT 741, Fall 2009, Martin Ester 418 Outlook Outline Trends in KDD research Graph mining and social network analysis Recommender systems Information.
IVITA Workshop Summary Session 1: interactive text analytics (Session chair: Professor Huamin Qu) a) HARVEST: An Intelligent Visual Analytic Tool for the.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
2010 © University of Michigan 1 Text Retrieval and Data Mining in SI - An Introduction Qiaozhu Mei School of Information Computer Science and Engineering.
Scalable Text Mining with Sparse Generative Models
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign
Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of.
Overview of Search Engines
Cohort Modeling for Enhanced Personalized Search Jinyun YanWei ChuRyen White Rutgers University Microsoft BingMicrosoft Research.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.
Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date:
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Information Retrieval
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.
Automatic Labeling of Multinomial Topic Models
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
A Study of Poisson Query Generation Model for Information Retrieval
Context Analysis in Text Mining and Search
Proposal for Term Project
Probabilistic Topic Model
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Personalized Social Image Recommendation
Course Summary (Lecture for CS410 Intro Text Info Systems)
Text Retrieval and Data Mining in SI - An Introduction
Data Mining: Concepts and Techniques Course Outline
Data Warehousing and Data Mining
Overview of Machine Learning
Jiawei Han Department of Computer Science
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Presentation transcript:

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Towards Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Knowledge Discovery from Text 2 Text Mining System

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3 Overload of Text Content Content Type Published Content Professional web content User generated content Private text content Amount / day3-4G~ 2G8-10G~ 3T - Ramakrishnan and Tomkins 2007

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Challenge of Mining Text 4 ~750k /day ~3M day ~150k /day 1M 10B 6M ~100B Where to Start? Where to Go? Gold?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context - “Situation of Text” 5 Author Time Source Author’s occupation Language Social Network Check Lap Kok, HK self designer, publisher, editor … 3:53 AM Jan 28 th From Ping.fm Location Sentiment

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Rich Context Information 6 102M blogs 100M users > 1M groups 8M contributors 100+ languages 73 years ~400k authors ~4k sources ~1B queries Per hour? ~1B Users ~3M msgs /day ~5M users 5M users 500M URLs

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text + Context = ? 7 + Context = Guidance I Have A Guide! =

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Query Log + User = Personalized Search 8 MSR Modern System Research Medical simulation Montessori School of Raleigh Mountain Safety Research MSR Racing Wikipedia definitions Metropolis Street Racer Molten salt reactor Mars sample return Magnetic Stripe Reader How much can personalized help? If you know me, you should give me Microsoft Research…

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9 Common ThemesIBMAPPLEDELL Battery Life Long, 4-3 hrsMedium, 3-2 hrsShort, 2-1 hrs Hard disk Large, GBSmall, 5-10 GBMedium, GB Speed Slow, MhzVery Fast, 3-4 GhzModerate, 1-2 Ghz IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Customer Reviews + Brand = Comparative Product Summary Can we compare Products?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10 Hot Topics in SIGMOD Scientific Literature + Time = Topic Trends What’s hot in literature?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11 One Week Later Blogs + Time & Location = Spatiotemporal Topic Diffusion How does discussion spread?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12 Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by watching the movie. a good book to past time.... so sick of people making such a big deal about a fiction book The Da Vinci Code Blogs + Sentiment = Faceted Opinion Summary What is good and what is bad?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13 Information retrieval Machine learning Data mining Coauthor Network Publications + Social Network = Topical Community Who works together on what?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Query log + User = Personalized Search Scientific Literature + Time = Topic Trends Review + Brand = Comparative Opinion Blog + Time & Location = Spatiotemporal Topic Diffusion Blog + Sentiment = Faceted Opinion Summary Publications + Social Network = Topical Community Text + Context = Contextual Text Mining 14 ….. A General Solution for All ?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Roadmap Generative Model of Text Integrating Contexts in Text Models –Modeling Simple Context –Modeling Implicit Context –Modeling Complex Context Applications of Contextual Text Mining 15

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Generative Model of Text 16 the.. movie.. harry.. potter is.. based.. on.. j..k..rowling the is harry potter movie plot time rowling the Generation Inference, Estimation harry potter movie harry is

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text as a Mixture of Topics 17 Data Mining Web Search Machine Learning search 0.2 engine 0.15 query 0.08 user 0.07 ranking 0.06 …… learning 0.18 model 0.14 training 0.10 kernel 0.09 inference 0.07 …… mining 0.21 data 0.13 pattern 0.10 clustering 0.05 network 0.04 …… Topic (Theme) = the subject of a discourse Database … Using machine learning for web search K topics

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Probabilistic Topic Models (Hofmann ’99, Blei et al. ’03, …) 18 ipod nano music download apple movie harry potter actress music Topic 1 Topic 2 Apple iPod Harry Potter Idownloaded themusicof themovie harrypotterto myipodnano ipod 0.15 harry 0.09

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Parameter Estimation Maximum Likelihood Estimation (MLE): Parameter Estimation using EM algorithm –Gibbs sampling, Variational inference, Expectation propagation 19 ipod nano music download apple movie harry potter actress music Idownloaded themusicof themovie harrypotterto myipodnano ?????????? ?????????? Guess the affiliation Estimate the params Idownloaded themusicof themovie harrypotterto myipodnano Idownloaded themusicof themovie harrypotterto myipodnano Idownloaded themusicof themovie harrypotterto myipodnano Pseudo- Counts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign How Context Affects Topics 20 Topics in science literature: 16 th Century v.s. 21 st Century When do a computer scientist and a gardener write about “tree, root, prune? ” In Europe, “football” appears a lot in a soccer report. What about in the US? Text are generated according to the Context !! “Context of Situation” - B. Malinowski 1923

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Existing Work PLSA (Hofmann ‘99), LDA (Blei et al ‘03), CTM (Blei et al. ‘06), PAM (Li and McCallum ‘06) –Don’t incorporate contexts Author: Author-topic model (Steyvers et al. 04) Time: Topic-over-time (Wang et al. 06), Dynamic Topic model (Blei et al ‘06) 21 Can we capture the context in a general way?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextualized Models 22 book Generation: How to select contexts? How to model context structure? Inference: How to reveal contextual patterns? Location = US Location = China Source = official Sentiment = + harry potter is book harry potter rowling movie harry potter director Year = 1998 Year = 2008 P(w|M, Year = 2008) P(w|M, Year = 1998)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Roadmap: Modeling Simple Context 23 Author Time Source Author’s occupation Language Location Simple Contexts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Simple Contextual Topic Model (Mei and Zhai KDD’06) 24 Topic 1 Topic 2 Context 1: 2004Context 2: 2007 Apple iPod Harry Potter ipod mini 4gb harry prisoner azkaban Idownloaded themusicof themovie harrypotterto myiphone ipod iphone nano potter order phoenix Contextual Topic Patterns

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 25 Hot Topics in SIGMOD Example: Topic Life Cycles (Mei and Zhai KDD’05) Context = Time Contextual Topic Pattern  P(z|time)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06) 26 Topic: Government Response in Hurricane Katrina Hurricane Rita Context = Time & Location Contextual Topic Pattern  P(z|time, location)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 27 Example: Event Impact Analysis (Mei and Zhai KDD’06) vector 0.05 concept 0.03 model 0.03 space 0.02 boolean 0.02 function 0.01 … xml model 0.02 collect 0.02 judgment 0.01 rank 0.01 … probabilist 0.08 model 0.04 logic 0.04 boolean 0.03 algebra 0.02 weight 0.01 … model 0.17 language 0.08 estimate 0.05 parameter 0.03 distribution 0.03 smooth 0.02 likelihood 0.01 … 1998 [Ponte and Croft 98] Starting of TREC 1992 term 0.16 relevance 0.08 weight 0.07 feedback 0.04 model 0.03 probabilistic 0.02 document 0.02 … Topic: retrieval models Context = Event Contextual Pattern  P(w|z, event) SIGIR Traditional Models Evaluation & Applications Probabilistic Models Language Models

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Instantiation: Personalized Search (Mei and Church WSDM’08) 28

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 29 Personalization with Backoff Ambiguous query: MSR –Microsoft Research –Mountain Safety Research Disambiguate based on user’s prior clicks We don’t have enough data for everyone! –Backoff to classes of users Proof of Concept: –Context = Classes of Users defined by IP address

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context  Users (IP), groups of users Personalized Search as Contextual Text Mining 30 Text: query(click) logs (IP, Query, URL) P(URL | Query) Text Model: Contextual Model: P(URL | Query, User) Goal: Estimate Better P(URL | Query, User) * *.* 156.*.*.* *.*.*.*

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 31 Evaluation Metric: Entropy (H) Difficulty of encoding information (a distribution) –Size of search space; difficulty of a task Powerful tool for sizing challenges and opportunities –How hard is web search? –How much does personalization help? Predict future  Cross Entropy H(Future|History)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Difficulty of Queries Easy queries (low H(URL|Q)): –google, yahoo, myspace, ebay, … Hard queries (high H(URL|Q)): –dictionary, yellow pages, movies, “what is may day?” 32 msrgear.com msracing.com research....com msrwheels.com msr.com msr.org msrdev.com … Hard Query: “MSR” – High EntropyEasy Query: “Google” – Low Entropy google.com google.cn maps.google … ~ 0

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33 How Hard Is Search? Traditional Search –H(URL | Query) –2.8 (= 23.9 – 21.1) Personalized Search IP –H(URL | Query, IP) –1.2 –1.2 (= 27.2 – 26.0) Entropy (H) Query21.1 URL22.1 IP22.1 Query, URL23.9 Query, IP26.0 IP, URL27.1 All Three27.2 Personalization cuts H in Half!

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context = First k bytes of IP * *.* 156.*.*.* *.*.*.* Full personalization: every user has a different model: sparse data! No personalization: all users share the same model: Missed Opportunity Personalization with backoff: smooth by similar users

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 35 Context  Market Segmentation Can we do better than IP address? Potential Context Variables –ID, QueryType, Click, Intent, … –Demographics (Age, Gender, Income, …) –Time of day & Day of Week

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Roadmap: Modeling Implicit Context 36 Sentiment Implicit Contexts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Implicit Context of Text 37 ? ?? Need to infer these situations/conditions from the data (with prior knowledge) Sentiments Intents Impact Trust

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Modeling Implicit Context 38 Topic 1 Topic 2 Positive Negative ? ?? hate awful disgust good like perfect Apple iPod Harry Potter Ilikethe songof movieon perfectbut hatetheaccent my ipod the color size quality actress music visual price scratch problem director accent plot Trained Model or Guidance from user – added as prior From training data or user guidance – added as prior

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Faceted Opinion Summarization (Mei et al. WWW’07) 39 Tom Hanks, who is my favorite movie star act the leading role. Protesting.. you will lose your faith by watching the movie. a good book to past time.... so sick of people making such a big deal about a fiction book Context = Sentiment Topic 1: Movie Topic 2: Book The Da Vinci Code

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Roadmap: Modeling Complex Context 40 Social Network Complex Contexts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Complex Context of Text 41 Find novel contextual patterns; Regularize contextual models; Alleviate data sparseness; Structures of contexts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Modeling Complex Context 42 Topic 1 Topic 2 AB ipod nano 4gb harry potter actor ipod nano 8gb harry potter actress Context Structure Intuitions : Model(A) and Model(B) should be similar Context A and B are closely related users in the same building issue similar queries collaborating researchers work on similar things topics in SIGMOD are like topics in VLDB

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Graph-based Regularization 43 v u projection on a plane Intuition = Regularized model = Smoothed Surfaces! Model(u) Model(v) u v Structure of contexts  a graph Intuition: Model(u) and Model(v) should be similar Smoothed  surface(s) on top of the Graph : MLE

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Instantiation: Topical Community Extraction (Mei et al. WWW’08) 44

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Social Network Analysis 45 Generation, evolution e.g., [Leskovec 05] Community extraction e.g., [Kleinberg 00]; Diffusion [Gruhl 04]; [Backstrom 06] Search e.g., [Adamic 05] Ranking e.g., [Brin and Page 98]; [Kleinberg 98] - Kleinberg and Backstrom 2006, New York Times Usually don’t model topics in text - Jeong et al Nature 411

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topical Community Analysis 46 physicist, physics, scientist, theory, gravitation … writer, novel, best-sell, book, language, film… Topics in text help community extraction Information Retrieval + Data Mining + Machine Learning, … = Computer Science Literature Text + Network  topical communities +

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topical Community Extraction as Contextual Text Mining 47 Topic Model Text: Scientific publications Text Model: Contextual Model: Topic Model + Author Context Structure: Social Network (coauthorship) Goal: Assign authors into topical communities using P(z|author) - Regularize using social network Context  Authors

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topic Modeling with Network Regularization 48 Data Likelihood Graph Harmonic Regularizer, (a generalization of [Zhu ’03]) tradeoff between MLE and smoothness Smoothness of between neighbors Model parameters: Text Model Graph Regularization Intuition 2: I work on similar topics with my coauthors Intuition 1: Know my research topics from my publications

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics & Communities without Network Regularization Topic 1Topic 2Topic 3Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social ? ? ? ? Noisy community assignment Fuzzy Topics Four Conferences: SIGIR, KDD, NIPS, WWW

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics & Communities with Network Regularization 50 Topic 1Topic 2Topic 3Topic 4 retrieval 0.13 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02 user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.01 Information Retrieval Data mining Machine learning Web Coherent community assignment Clear Topics

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topic Modeling and SNA Improve Each Other MethodsCut Edge Weights Ratio Cut/ Norm. Cut Community Size Community 1 Community 2 Community 3 Community 4 PLSA / NetPLSA / NCut / Ncut: spectral clustering with normalized cut. (Shi et al. ’00) Network Regularization helps extract coherent communities (ensure tight connection of authors) Topic Modeling helps balancing communities (text implicitly bridges authors) The smaller the better Text Only Network Only

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Summary of My Talk 52 Text + Context = Contextual Text Mining –A new paradigm of text mining General methodology for contextual text mining –Generative models of text (e.g., Topic Models) –Contextualized models with simple context, implicit context, complex context; Applications of contextual text mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Take Away Message 53 += Text Context

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign A Roadmap of My Work 54 Information Retrieval & Web Search Text Mining KDD 06a Annotating frequent patterns KDD 05 KDD 06b WWW 06 WWW 07 WWW 08 Contextual Topic Models KDD 07 Labeling topic models SIGIR 07 CIKM 08 ACL 08 Impact-based summarization Query suggestion using hitting time Poisson language models PSB 06 IP&M 07 KDD 08 Application to Bioinfo. Bio. literature mining SIGIR 08 WSDM 08 Graph-based smoothing

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text Information Management A Roadmap to the Future 55 Information Retrieval & Web Search Text Mining Theoretical Framework Computational challenge; Structure of contexts Task Support Systems Web users Scientists Business users Applications Integrative analysis of heterogeneous data web 2.0 data Science data Information networks Interdisciplinary Bioinformatics Health informatics Business informatics

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Thanks! 56

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Predict the Future 57 IP in the future might not be seen in the history Personalization with backoff No personalization Complete personalization Cross Entropy: H(future | history) At least first k bytes of IP are seen in History Knows at least two bytes Knows every byte – enough data

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 58 Results: Sentiment Dynamics (Mei et al. WWW’07) Facet: the book. (Pos > Neg )Facet: the impact on religious beliefs. (Neg > Pos ) Context = Sentiment + Time

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Context = Hours-of-Day 59 Harder queries Harder Queries at TV Time

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 60 Example: Context = Days-of-Week Easier Queries More Clicks Business Days v. Weekends: More Clicks and Easier Queries