Download presentation
Presentation is loading. Please wait.
Published byMadeleine Patrick Modified over 9 years ago
1
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of Illinois at Urbana-Champaign Large Scale Information Management Cross-Collection Text Mining Cross-Collection Text Mining (II) Temporal Text Mining Temporal Text Mining (II) Spatiotemporal Text Mining Spatiotemporal Text Mining (II) 1 4 6 3 2 5 IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews “DELL” specific“APPLE” specific“IBM” specificCommon Themes Moderate, 1-2 GhzVery Fast, 3-4 GhzSlow, 100-200 MhzSpeed Medium, 20-50 GBSmall, 5-10 GBLarge, 80-100 GBHard disk Short, 2-1 hrsMedium, 3-2 hrsLong, 4-3 hrsBattery Life Many applications involve a comparative analysis of several text collections Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis We aim at developing methods for comparing multiple collections of text and performing comparative text mining ………………… Background B Theme 1 in common: 1 Theme 1 Specific to C 1 1,1 … Theme k in common: k Theme k Specific to C 1 k,1 Theme 1 Specific to C 2 1,2 Theme 1 Specific to C m 1,m Theme k Specific to C 2 k,2 Theme k Specific to C m k,m BB 11 1,i 1- C C kk k,i 1- C … d,1 d,k B 1- B Background W C - A mixture model for cross- collection comparative text mining Goal: Extract common themes and specific themes from comparable collections Applications: Opinion extraction, business intelligence, news summarization, etc. “Generating” word w in doc d in collection C i Sample results (comparing news articles about Iraq war and Afghan war) Reference: C. Zhai, A. Velivelli, and B. Yu. A Cross-Collection Mixture Model for Comparative Text Mining. KDD 2004. Goal: Extract evolutionary theme patterns from time labeled collection Applications: News summarization, literature analysis, opinion monitoring, etc. Theme Evolution Graph and threads of Tsunami data set Immediate Reports Statistics of Death and loss Personal Experience of Survivors Statistics of further impact Aid from Local Areas Aid from the world Donations from countries Specific Events of Aid … Lessons from Tsunami Research inspired Time Doc1 Doc3 Doc.. Theme spans Evolutionary transitions Theme evolution thread Theme 1 Theme k Theme 2 … Background B warning 0.3 system 0.2.. Aid 0.1 donation 0.05 support 0.02.. statistics 0.2 loss 0.1 dead 0.05.. Is 0.05 the 0.04 a 0.03.. Document d kk 11 22 B B W d,1 d, k 1 - B d,2 “Generating” word w in doc d in the collection Tt1…t2 A C ? B ? microarray 0.2 gene 0.1 protein 0.05 web 0.3 classification 0.1 topic 0.1 Information 0.2 topic 0.1 classification 0.1 text 0.05 Evolutionary Transition Theme similarity = Themes life cycles of KDD Abstracts gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 … rules 0.0142 association 0.0064 support 0.0053 … Themes life cycles from CNN news dataset The Collection Reference: Q. Mei and C. Zhai. Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining. KDD 2005. Goal: model the spatiotemporal theme patterns from a collection of text. model the mixture of topics: common themes spatiotemporal content analysis: theme life cycles, theme coverage snapshots Applications: Weblog mining, search result summarization, opinion tracking, business intelligence, etc. 11 ii kk Themes Spatiotemporal Context Time = t; Location = l BB Background Word w d Document d at time t and location l …… B 1 - B TL 1 - TL P( i |t,l) P( i |d) P(w| i ) P(w| B ) Spatiotemporal model: Compute theme life cycles: Compute theme snapshots: Reference: Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs. WWW 2006. Sample results: Sample results (Weblog data about “Hurricane Katrina”, 5 weeks, U.S.): Models: Cluster 1Cluster 2Cluster 3 Common Theme united 0.042 nations 0.04 … killed 0.035 month 0.032 deaths 0.023 … … Iraq Theme n 0.03 Weapons 0.024 Inspections 0.023 … troops 0.016 hoon 0.015 sanches 0.012 … … Afghan Theme northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 … taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 … … The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars The first 2 weeks are mostly about “aid from the world” The next 2 weeks are mostly about “personal experience” Dropping Rising Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme is distributed more uniformly over the states Week2: The discussion moves towards the northern and western states Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.