Download presentation
Presentation is loading. Please wait.
Published byAndrea Flowers Modified over 9 years ago
1
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Towards Contextual Text Mining Qiaozhu Mei qmei2@uiuc.edu University of Illinois at Urbana-Champaign
2
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Knowledge Discovery from Text 2 Text Mining System
3
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3 Overload of Text Content Content Type Published Content Professional web content User generated content Private text content Amount / day3-4G~ 2G8-10G~ 3T - Ramakrishnan and Tomkins 2007
4
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Challenge of Mining Text 4 ~750k /day ~3M day ~150k /day 1M 10B 6M ~100B Where to Start? Where to Go? Gold?
5
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context - “Situation of Text” 5 Author Time Source Author’s occupation Language Social Network Check Lap Kok, HK self designer, publisher, editor … 3:53 AM Jan 28 th From Ping.fm Location Sentiment
6
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Rich Context Information 6 102M blogs 100M users > 1M groups 8M contributors 100+ languages 73 years ~400k authors ~4k sources ~1B queries Per hour? ~1B Users ~3M msgs /day ~5M users 5M users 500M URLs
7
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text + Context = ? 7 + Context = Guidance I Have A Guide! =
8
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Query Log + User = Personalized Search 8 MSR Modern System Research Medical simulation Montessori School of Raleigh Mountain Safety Research MSR Racing Wikipedia definitions Metropolis Street Racer Molten salt reactor Mars sample return Magnetic Stripe Reader How much can personalized help? If you know me, you should give me Microsoft Research…
9
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9 Common ThemesIBMAPPLEDELL Battery Life Long, 4-3 hrsMedium, 3-2 hrsShort, 2-1 hrs Hard disk Large, 80-100 GBSmall, 5-10 GBMedium, 20-50 GB Speed Slow, 100-200 MhzVery Fast, 3-4 GhzModerate, 1-2 Ghz IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Customer Reviews + Brand = Comparative Product Summary Can we compare Products?
10
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10 Hot Topics in SIGMOD Scientific Literature + Time = Topic Trends What’s hot in literature?
11
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11 One Week Later Blogs + Time & Location = Spatiotemporal Topic Diffusion How does discussion spread?
12
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12 Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by watching the movie. a good book to past time.... so sick of people making such a big deal about a fiction book The Da Vinci Code Blogs + Sentiment = Faceted Opinion Summary What is good and what is bad?
13
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13 Information retrieval Machine learning Data mining Coauthor Network Publications + Social Network = Topical Community Who works together on what?
14
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Query log + User = Personalized Search Scientific Literature + Time = Topic Trends Review + Brand = Comparative Opinion Blog + Time & Location = Spatiotemporal Topic Diffusion Blog + Sentiment = Faceted Opinion Summary Publications + Social Network = Topical Community Text + Context = Contextual Text Mining 14 ….. A General Solution for All ?
15
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Roadmap Generative Model of Text Integrating Contexts in Text Models –Modeling Simple Context –Modeling Implicit Context –Modeling Complex Context Applications of Contextual Text Mining 15
16
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Generative Model of Text 16 the.. movie.. harry.. potter is.. based.. on.. j..k..rowling the is harry potter movie plot time rowling 0.1 0.07 0.05 0.04 0.02 0.01 the Generation Inference, Estimation harry potter movie harry is
17
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text as a Mixture of Topics 17 Data Mining Web Search Machine Learning search 0.2 engine 0.15 query 0.08 user 0.07 ranking 0.06 …… learning 0.18 model 0.14 training 0.10 kernel 0.09 inference 0.07 …… mining 0.21 data 0.13 pattern 0.10 clustering 0.05 network 0.04 …… Topic (Theme) = the subject of a discourse Database … Using machine learning for web search K topics
18
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Probabilistic Topic Models (Hofmann ’99, Blei et al. ’03, …) 18 ipod nano music download apple 0.15 0.08 0.05 0.02 0.01 movie harry potter actress music 0.10 0.09 0.05 0.04 0.02 Topic 1 Topic 2 Apple iPod Harry Potter Idownloaded themusicof themovie harrypotterto myipodnano ipod 0.15 harry 0.09
19
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Parameter Estimation Maximum Likelihood Estimation (MLE): Parameter Estimation using EM algorithm –Gibbs sampling, Variational inference, Expectation propagation 19 ipod nano music download apple 0.15 0.08 0.05 0.02 0.01 movie harry potter actress music 0.10 0.09 0.05 0.04 0.02 Idownloaded themusicof themovie harrypotterto myipodnano ?????????? ?????????? Guess the affiliation Estimate the params Idownloaded themusicof themovie harrypotterto myipodnano Idownloaded themusicof themovie harrypotterto myipodnano Idownloaded themusicof themovie harrypotterto myipodnano Pseudo- Counts
20
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign How Context Affects Topics 20 Topics in science literature: 16 th Century v.s. 21 st Century When do a computer scientist and a gardener write about “tree, root, prune? ” In Europe, “football” appears a lot in a soccer report. What about in the US? Text are generated according to the Context !! “Context of Situation” - B. Malinowski 1923
21
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Existing Work PLSA (Hofmann ‘99), LDA (Blei et al ‘03), CTM (Blei et al. ‘06), PAM (Li and McCallum ‘06) –Don’t incorporate contexts Author: Author-topic model (Steyvers et al. 04) Time: Topic-over-time (Wang et al. 06), Dynamic Topic model (Blei et al ‘06) 21 Can we capture the context in a general way?
22
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextualized Models 22 book Generation: How to select contexts? How to model context structure? Inference: How to reveal contextual patterns? Location = US Location = China Source = official Sentiment = + harry potter is book harry potter rowling 0.15 0.10 0.08 0.05 movie harry potter director 0.18 0.09 0.08 0.04 Year = 1998 Year = 2008 P(w|M, Year = 2008) P(w|M, Year = 1998)
23
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Roadmap: Modeling Simple Context 23 Author Time Source Author’s occupation Language Location Simple Contexts
24
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Simple Contextual Topic Model (Mei and Zhai KDD’06) 24 Topic 1 Topic 2 Context 1: 2004Context 2: 2007 Apple iPod Harry Potter ipod mini 4gb harry prisoner azkaban Idownloaded themusicof themovie harrypotterto myiphone ipod iphone nano potter order phoenix Contextual Topic Patterns
25
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 25 Hot Topics in SIGMOD Example: Topic Life Cycles (Mei and Zhai KDD’05) Context = Time Contextual Topic Pattern P(z|time)
26
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06) 26 Topic: Government Response in Hurricane Katrina Hurricane Rita Context = Time & Location Contextual Topic Pattern P(z|time, location)
27
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 27 Example: Event Impact Analysis (Mei and Zhai KDD’06) vector 0.05 concept 0.03 model 0.03 space 0.02 boolean 0.02 function 0.01 … xml 0.07 email 0.02 model 0.02 collect 0.02 judgment 0.01 rank 0.01 … probabilist 0.08 model 0.04 logic 0.04 boolean 0.03 algebra 0.02 weight 0.01 … model 0.17 language 0.08 estimate 0.05 parameter 0.03 distribution 0.03 smooth 0.02 likelihood 0.01 … 1998 [Ponte and Croft 98] Starting of TREC 1992 term 0.16 relevance 0.08 weight 0.07 feedback 0.04 model 0.03 probabilistic 0.02 document 0.02 … Topic: retrieval models Context = Event Contextual Pattern P(w|z, event) SIGIR Traditional Models Evaluation & Applications Probabilistic Models Language Models
28
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Instantiation: Personalized Search (Mei and Church WSDM’08) 28
29
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 29 Personalization with Backoff Ambiguous query: MSR –Microsoft Research –Mountain Safety Research Disambiguate based on user’s prior clicks We don’t have enough data for everyone! –Backoff to classes of users Proof of Concept: –Context = Classes of Users defined by IP address
30
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context Users (IP), groups of users Personalized Search as Contextual Text Mining 30 Text: query(click) logs (IP, Query, URL) P(URL | Query) Text Model: Contextual Model: P(URL | Query, User) Goal: Estimate Better P(URL | Query, User) 156.111.188.243 156.111.188.* 156.111.*.* 156.*.*.* *.*.*.*
31
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 31 Evaluation Metric: Entropy (H) Difficulty of encoding information (a distribution) –Size of search space; difficulty of a task Powerful tool for sizing challenges and opportunities –How hard is web search? –How much does personalization help? Predict future Cross Entropy H(Future|History)
32
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Difficulty of Queries Easy queries (low H(URL|Q)): –google, yahoo, myspace, ebay, … Hard queries (high H(URL|Q)): –dictionary, yellow pages, movies, “what is may day?” 32 msrgear.com msracing.com research....com msrwheels.com msr.com msr.org msrdev.com … 0.12 0.10 0.09 0.08 0.07 0.06 0.05 Hard Query: “MSR” – High EntropyEasy Query: “Google” – Low Entropy google.com google.cn maps.google … 0.80 0.10 0.08 ~ 0
33
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33 How Hard Is Search? Traditional Search –H(URL | Query) –2.8 (= 23.9 – 21.1) Personalized Search IP –H(URL | Query, IP) –1.2 –1.2 (= 27.2 – 26.0) Entropy (H) Query21.1 URL22.1 IP22.1 Query, URL23.9 Query, IP26.0 IP, URL27.1 All Three27.2 Personalization cuts H in Half!
34
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context = First k bytes of IP 34 156.111.188.* 156.111.*.* 156.*.*.* *.*.*.* Full personalization: every user has a different model: sparse data! No personalization: all users share the same model: Missed Opportunity Personalization with backoff: smooth by similar users 156.111.188.243
35
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 35 Context Market Segmentation Can we do better than IP address? Potential Context Variables –ID, QueryType, Click, Intent, … –Demographics (Age, Gender, Income, …) –Time of day & Day of Week
36
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Roadmap: Modeling Implicit Context 36 Sentiment Implicit Contexts
37
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Implicit Context of Text 37 ? ?? Need to infer these situations/conditions from the data (with prior knowledge) Sentiments Intents Impact Trust
38
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Modeling Implicit Context 38 Topic 1 Topic 2 Positive Negative ? ?? hate awful disgust 0.21 0.03 0.01 good like perfect 0.10 0.05 0.02 Apple iPod Harry Potter Ilikethe songof movieon perfectbut hatetheaccent my ipod the color size quality actress music visual price scratch problem director accent plot Trained Model or Guidance from user – added as prior From training data or user guidance – added as prior
39
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Faceted Opinion Summarization (Mei et al. WWW’07) 39 Tom Hanks, who is my favorite movie star act the leading role. Protesting.. you will lose your faith by watching the movie. a good book to past time.... so sick of people making such a big deal about a fiction book Context = Sentiment Topic 1: Movie Topic 2: Book The Da Vinci Code
40
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Roadmap: Modeling Complex Context 40 Social Network Complex Contexts
41
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Complex Context of Text 41 Find novel contextual patterns; Regularize contextual models; Alleviate data sparseness; Structures of contexts
42
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Modeling Complex Context 42 Topic 1 Topic 2 AB ipod nano 4gb harry potter actor ipod nano 8gb harry potter actress Context Structure Intuitions : Model(A) and Model(B) should be similar Context A and B are closely related users in the same building issue similar queries collaborating researchers work on similar things topics in SIGMOD are like topics in VLDB
43
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Graph-based Regularization 43 v u projection on a plane Intuition = Regularized model = Smoothed Surfaces! Model(u) Model(v) u v Structure of contexts a graph Intuition: Model(u) and Model(v) should be similar Smoothed surface(s) on top of the Graph : MLE
44
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Instantiation: Topical Community Extraction (Mei et al. WWW’08) 44
45
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Social Network Analysis 45 Generation, evolution e.g., [Leskovec 05] Community extraction e.g., [Kleinberg 00]; Diffusion [Gruhl 04]; [Backstrom 06] Search e.g., [Adamic 05] Ranking e.g., [Brin and Page 98]; [Kleinberg 98] - Kleinberg and Backstrom 2006, New York Times Usually don’t model topics in text - Jeong et al. 2001 Nature 411
46
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topical Community Analysis 46 physicist, physics, scientist, theory, gravitation … writer, novel, best-sell, book, language, film… Topics in text help community extraction Information Retrieval + Data Mining + Machine Learning, … = Computer Science Literature Text + Network topical communities +
47
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topical Community Extraction as Contextual Text Mining 47 Topic Model Text: Scientific publications Text Model: Contextual Model: Topic Model + Author Context Structure: Social Network (coauthorship) Goal: Assign authors into topical communities using P(z|author) - Regularize using social network Context Authors
48
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topic Modeling with Network Regularization 48 Data Likelihood Graph Harmonic Regularizer, (a generalization of [Zhu ’03]) tradeoff between MLE and smoothness Smoothness of between neighbors Model parameters: Text Model Graph Regularization Intuition 2: I work on similar topics with my coauthors Intuition 1: Know my research topics from my publications
49
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics & Communities without Network Regularization Topic 1Topic 2Topic 3Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social 0.01 49 ? ? ? ? Noisy community assignment Fuzzy Topics Four Conferences: SIGIR, KDD, NIPS, WWW
50
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics & Communities with Network Regularization 50 Topic 1Topic 2Topic 3Topic 4 retrieval 0.13 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02 user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.01 Information Retrieval Data mining Machine learning Web Coherent community assignment Clear Topics
51
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topic Modeling and SNA Improve Each Other MethodsCut Edge Weights Ratio Cut/ Norm. Cut Community Size Community 1 Community 2 Community 3 Community 4 PLSA48312.14/1.252280217823262257 NetPLSA6620.29/0.132636198930691347 NCut8550.23/0.1226996323811 51 -Ncut: spectral clustering with normalized cut. (Shi et al. ’00) Network Regularization helps extract coherent communities (ensure tight connection of authors) Topic Modeling helps balancing communities (text implicitly bridges authors) The smaller the better Text Only Network Only
52
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Summary of My Talk 52 Text + Context = Contextual Text Mining –A new paradigm of text mining General methodology for contextual text mining –Generative models of text (e.g., Topic Models) –Contextualized models with simple context, implicit context, complex context; Applications of contextual text mining
53
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Take Away Message 53 += Text Context
54
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign A Roadmap of My Work 54 Information Retrieval & Web Search Text Mining KDD 06a Annotating frequent patterns KDD 05 KDD 06b WWW 06 WWW 07 WWW 08 Contextual Topic Models KDD 07 Labeling topic models SIGIR 07 CIKM 08 ACL 08 Impact-based summarization Query suggestion using hitting time Poisson language models PSB 06 IP&M 07 KDD 08 Application to Bioinfo. Bio. literature mining SIGIR 08 WSDM 08 Graph-based smoothing
55
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text Information Management A Roadmap to the Future 55 Information Retrieval & Web Search Text Mining Theoretical Framework Computational challenge; Structure of contexts Task Support Systems Web users Scientists Business users Applications Integrative analysis of heterogeneous data web 2.0 data Science data Information networks Interdisciplinary Bioinformatics Health informatics Business informatics
56
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Thanks! 56
57
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Predict the Future 57 IP in the future might not be seen in the history Personalization with backoff No personalization Complete personalization Cross Entropy: H(future | history) At least first k bytes of IP are seen in History 43 21 0 Knows at least two bytes Knows every byte – enough data
58
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 58 Results: Sentiment Dynamics (Mei et al. WWW’07) Facet: the book. (Pos > Neg )Facet: the impact on religious beliefs. (Neg > Pos ) Context = Sentiment + Time
59
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Context = Hours-of-Day 59 Harder queries Harder Queries at TV Time
60
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 60 Example: Context = Days-of-Week Easier Queries More Clicks Business Days v. Weekends: More Clicks and Easier Queries
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.