Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

Critical Reading Strategies: Overview of Research Process
A probabilistic model for retrospective news event detection
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
Mixture Language Models and EM Algorithm
Visual Recognition Tutorial
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
Statistical Topic Models for Integrating and Analyzing Opinions in Blog articles Yue Lu Qiaozhu Mei ChengXiang Zhai.
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
How is informational text organized?. Writers use different organizational patterns to present information in a way that makes sense to the reader. This.
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Essential Probability & Statistics
Queensland University of Technology
Identifying Question Stems
Experimental Psychology
Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook By: Lars Backstrom - Facebook Inc, Jon Kleinberg.
Probabilistic Topic Model
Classification of unlabeled data:
Hidden Markov Models (HMMs)
Yi-Chia Wang LTI 2nd year Master student
Text Retrieval and Data Mining in SI - An Introduction
Data Mining Lecture 11.
Unit 4 Introducing the Study.
Relevance Feedback Hongning Wang
BGP update profiles and the implications for secure BGP update validation processing Geoff Huston PAM April 2007.
Finding Story Chains in Newswire Articles
Hidden Markov Models (HMMs)
Statistical NLP: Lecture 9
Bayesian Inference for Mixture Language Models
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Volume 94, Issue 2, Pages e6 (April 2017)
13. The Weak Law and the Strong Law of Large Numbers
Michal Rosen-Zvi University of California, Irvine
EVENT PROJECTION Minzhao Liu, 2018
Volume 94, Issue 2, Pages e6 (April 2017)
Topic Models in Text Processing
Language Models for TR Rong Jin
Inductive Clustering: A technique for clustering search results Hieu Khac Le Department of Computer Science - University of Illinois at Urbana-Champaign.
Fig. 1 Example of the spatiotemporal evolution of Twitter activity across keywords. Example of the spatiotemporal evolution of Twitter activity across.
Statistical NLP : Lecture 9 Word Sense Disambiguation
GhostLink: Latent Network Inference for Influence-aware Recommendation
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Presentation transcript:

A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai† †: University of Illinois at Urbana-Champaign ‡: Vanderbilt University

Weblog as an emerging new data… Searching for “hurricane katrina” returns 1 million blog articles. … …

An Example of Weblog Article Location Info. The time stamp Blog Contents One of the search results is shown here. It has some meta-data such as time and location.

Characteristics of Weblogs Interlinking & Forming communities Immediate response to events Weblog Article With mixed topics Location Time Associated with time & location Highly personal With opinions Compared with other kinds of data, Weblogs have some interesting special characteristics, which make it interesting to exploit for text mining.

Existing Work on Weblog Analysis # of nodes in communities Interlinking and Community Analysis Identifying communities Monitoring the evolution and bursting of communities E.g., [Kumar et al. 2003] # of communities Content Analysis Blog level topic analysis Information diffusion through blogspace Use topic bursting to predict sales spikes E.g., [Gruhl et al. 2005] There are generally two lines of existing work on blog analysis and mining. The first line focuses on the interlinking structure and community analysis. The second focuses on the content analysis and temporal analysis. Blog mentions Sales rank

How to Perform Spatiotemporal Theme Mining? Given a collection of Weblog articles about a topic with time and location information Discover multiple themes (i.e., subtopics) being discussed in these articles For a given location, discover how each theme evolves over time (generate a theme life cycle) For a given time, reveal how each theme spreads over locations (generate a theme snapshot) Compare theme life cycles in different locations Compare theme snapshots in different time periods … However, no existing work has addressed the problem of spatiotemporal theme mining, which is defined as …

Spatiotemporal Theme Patterns Discussion about “Release of iPod Nano” in articles about “iPod Nano” Theme life cycles Strength Unite States Locations China A theme snapshot Discussion about “Government Response” in articles about Hurricane Katrina Canada 09/20/05 – 09/26/05 Time This is an illustration of two most interesting spatiotemporal theme patterns – theme snapshot and theme life cycle. They are revealing different theme patterns; they can help answer different questions about the content and are useful for different purposes.

Applications of Spatiotemporal Theme Mining Help answer questions like Which country responded first to the release of iPod Nano? China, UK, or Canada? Do people in different states (e.g., Illinois vs. Texas) respond differently/similarly to the increase of gas price during Hurricane Katrina? Potentially useful for Summarizing search results Monitoring public opinions Business Intelligence … Such spatiotemporal theme mining has potentially many applications

Challenges in Spatiotemporal Theme Mining How to represent a theme? How to model the themes in a collection? How to model their dependency on time and location? How to compute the theme life cycles and theme snapshots? All these must be done in an unsupervised way… However, it’s not trivial to do….

Our Solution: Use a Probabilistic Spatiotemporal Theme Model Each theme is represented as a multinomial distribution over the vocabulary (language model) Consider the collection as a sample from a mixture of these theme models Fit the model to the data and estimate the parameters Spatiotemporal theme patterns can then be computed from the estimated model parameters

Probabilistic Spatiotemporal Theme Model Choose a theme i Draw a word from i price 0.3 oil 0.2.. Theme 1 oil k 1 2 B donate 0.1 relief 0.05 help 0.02 .. Theme 2 donate … city the … city 0.2 new 0.1 orleans 0.05 .. Theme k Is 0.05 the 0.04 a 0.03 .. This slide shows the general idea of the model. The words drawn from different distributions are mixed together to “generate” a document. + TLP(i |d) Probability of choosing theme i= ... TLP(i|t, l) Document d Time=t Location=l Background B TL= weight on spatiotemporal theme distribution

The “Generation” Process A document d of location l and time t is generated, word by word, as follows First, decide whether to use the background theme B With probability B , we’ll use the background theme and draw a word w from p(w|B) If the background theme is not to be used, we’ll decide how to choose a topic theme With probability TL, we’ll sample a theme using the “shared spatiotemporal distribution” p(|t,l) With probability 1- TL, we’ll sample a theme using p(|d) Draw a word w from the selected theme distribution p(w|i) Parameters {p(w|B), p(w|i ), p(|t,l), p(|d)} (will be estimated) B =Background noise; TL=Weight on spatiotemporal modeling (will be manually set) The two lambda parameters are set manually because they are not designed to best fit the data, but rather give the user some control over the mining process.

The Likelihood Function Count of word w in document d Generating w using a topic theme Choosing a topic theme according to the spatiotemporal context Generating w using the background theme This slide explains different part of the likelihood function. Choosing a topic theme according to the document

Parameter Estimation Use the maximum likelihood estimator Use the Expectation-Maximization (EM) algorithm p(w|B) is set to the collection word probability E Step The EM formulas can be skipped. Just say that it’s an iterative algorithm and finds a local maximum. We do multiple trials to find a good local maximum. M Step

Probabilistic Analysis of Spatiotemporal Themes Once the parameters are estimated, we can easily perform probabilistic analysis of spatiotemporal themes Computing theme life cycles given location Computing theme snapshots given time The two theme patterns can be computed based on the learned parameters. And they can be visualized in different ways.

Experiments and Results Three time-stamped data sets of weblogs, each about one event (broad topic): Extract location information from author profiles On each data set, we extract a set of salient themes and their life cycles / theme snapshots Data Set # docs Time Span(2005) Query Katrina 9377 08/16 -10/04 Hurricane Katrina Rita 1754 08/16 - 10/04 Hurricane Rita iPod Nano 1720 09/02 - 10/26 Weblog articles are downloaded from search results from blogsearch.google.com, with topical queries like ‘’hurricane katrina’’, “ipod nano”, etc.

Theme Life Cycles for Hurricane Katrina Oil Price price 0.0772 oil 0.0643 gas 0.0454 increase 0.0210 product 0.0203 fuel 0.0188 company 0.0182 … New Orleans city 0.0634 orleans 0.0541 new 0.0342 louisiana 0.0235 flood 0.0227 evacuate 0.0211 storm 0.0177 … The upper figure is the life cycles for different themes in Texas. The red line refers to a theme with the top probability words such as price, oil, gas, increase, etc, from which we know that it is talking about “oil price”. The blue one, on the other hand, talks about events that happened in the city “new orleans”. In the upper figure, we can see that both themes were getting hot during the first two weeks, and became weaker around the mid September. The theme New Orleans got strong again around the last week of September while the other theme dropped monotonically. In the bottom figure, which is the life cycles for the same theme “New Orleans” in different states. We observe that this theme reaches the highest probability first in Florida and Louisiana, followed by Washington and Texas, consecutively. During early September, this theme drops significantly in Louisiana while still strong in other states. We suppose this is because of the evacuation in Louisiana. Surprisingly, around late September, a re-arising pattern can be observed in most states, which is most significant in Louisiana. Since this is the time period in which Hurricane Rita arrived, we guess that Hurricane Rita has an impact on the discussion of Hurricane Katrina. This is reasonable since people are likely to mention the two hurricanes together or make comparisons. We can find more clues to this hypothesis from Hurricane Rita data set.

Theme Snapshots for Hurricane Katrina Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico This slide shows the snapshot for theme ``Government Response'' over the first five weeks of Hurricane Katrina. The darker the color is, the hotter the discussion about this theme is. we observe that at the first week of Hurricane Katrina, the theme ``Government Response'‘ is the strongest in the southeast states, especially those along the Gulf of Mexico. In week 2, we can see the pattern that the theme is spreading towards the north and western states because the northern states are getting darker. In week 3, the theme is distributed even more uniformly, which means that it is spreading all over the states. However, in week 4, we observe that the theme converges to east states and southeast coast again. Interestingly, this week happens to overlap with the first week of Hurricane Rita, which may raise the public concern about government response again in those areas. In week 5, the theme becomes weak in most inland states and most of the remaining discussions are along the coasts. Another interesting observation is that this theme is originally very strong in Louisiana (the one to the right of Texas, ), but dramatically weakened in Louisiana during week 2 and 3, and becomes strong again from the fourth week. Interestingly, Week 2 and 3 are consistent with the time of evacuation in Louisiana.

Theme life cycles for Hurricane Rita Hurricane Katrina: Government Response Hurricane Rita: Government Response Hurricane Rita: Storms This figure shows the comparison of theme life cycles in Hurricane Katrina and Hurricane Rita. The red line and the purple line correspond to two themes in Hurricane Rita, which are about “Government Response” and “Storms” respectively. They both got strong rapidly when Hurricane Rita arrives around the last two weeks in September, and dropped at the end of September. The blue line is the theme about “Government Response” in Hurricane Katrina dataset. We can see that after it dropped around the beginning of September, it rose again during the two weeks of Hurricane Rita. This gives more clue to support our guess that the discussion about the two events are correlated. A theme in Hurricane Katrina is inspired again by Hurricane Rita

Theme Snapshots for Hurricane Rita Both Hurricane Katrina and Hurricane Rita have the theme “Oil Price” This is a comparison of the theme snapshots during the first two weeks of Hurricane Rita and the last two weeks of Hurricane Katrina. The theme snapshots over the first two weeks of Hurricane Rita show that the discussion of Hurricane Rita did not spread so significantly as the first two weeks of Hurricane Katrina, which we’ve shown in the former slides. Instead, the spatial patterns are similar to the last two weeks of Hurricane Katrina, which are roughly around the same time period. During the first week of Hurricane Rita, we observe that the theme ``Oil Price'' is already widespread over the States. In the following week, the topic does not further spread; instead, it converges back to the states strongly affected by the hurricane. Comparable patterns for the same theme can be found during the last two weeks of Hurricane Katrina. This further implies that the two comparable events have interacting impact on the public concerns about them. The spatiotemporal patterns of this theme at the same time period are similar

Theme Life Cycles for iPod Nano United States China Release of Nano ipod 0.2875 nano 0.1646 apple 0.0813 september 0.0510 mini 0.0442 screen 0.0242 new 0.0200 … Canada This figure shows the life cycles of theme “Release of Nano” in the iPod Nano dataset. From the figure, we see that United States is indeed the first country where the theme reaches the top of its life cycle, followed by Canada China, and United Kingdom almost around the same time, which is nearly a week after US. The theme in China presents a sharp growing and dropping, which indicates that most discussions there are within a short time period. The life cycles in Canada and United Kingdom both have two peaks. United Kingdom

Contributions and Future Work Defined a new problem -- spatiotemporal text mining Proposed a general mixture model for the mining task Proposed methods for computing two spatiotemporal patterns -- theme life cycles and theme snapshots Applied it to Weblog mining with interesting results Future work: Capture content dependency between adjacent time stamps and locations Study granularity selection in spatiotemporal text mining

Thank You!