Download presentation
Presentation is loading. Please wait.
Published byΦωτινή Αποστολίδης Modified over 6 years ago
1
A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs
Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai† †: University of Illinois at Urbana-Champaign ‡: Vanderbilt University
2
Weblog as an emerging new data…
Searching for “hurricane katrina” returns 1 million blog articles. … …
3
An Example of Weblog Article
Location Info. The time stamp Blog Contents One of the search results is shown here. It has some meta-data such as time and location.
4
Characteristics of Weblogs
Interlinking & Forming communities Immediate response to events Weblog Article With mixed topics Location Time Associated with time & location Highly personal With opinions Compared with other kinds of data, Weblogs have some interesting special characteristics, which make it interesting to exploit for text mining.
5
Existing Work on Weblog Analysis
# of nodes in communities Interlinking and Community Analysis Identifying communities Monitoring the evolution and bursting of communities E.g., [Kumar et al. 2003] # of communities Content Analysis Blog level topic analysis Information diffusion through blogspace Use topic bursting to predict sales spikes E.g., [Gruhl et al. 2005] There are generally two lines of existing work on blog analysis and mining. The first line focuses on the interlinking structure and community analysis. The second focuses on the content analysis and temporal analysis. Blog mentions Sales rank
6
How to Perform Spatiotemporal Theme Mining?
Given a collection of Weblog articles about a topic with time and location information Discover multiple themes (i.e., subtopics) being discussed in these articles For a given location, discover how each theme evolves over time (generate a theme life cycle) For a given time, reveal how each theme spreads over locations (generate a theme snapshot) Compare theme life cycles in different locations Compare theme snapshots in different time periods … However, no existing work has addressed the problem of spatiotemporal theme mining, which is defined as …
7
Spatiotemporal Theme Patterns
Discussion about “Release of iPod Nano” in articles about “iPod Nano” Theme life cycles Strength Unite States Locations China A theme snapshot Discussion about “Government Response” in articles about Hurricane Katrina Canada 09/20/05 – 09/26/05 Time This is an illustration of two most interesting spatiotemporal theme patterns – theme snapshot and theme life cycle. They are revealing different theme patterns; they can help answer different questions about the content and are useful for different purposes.
8
Applications of Spatiotemporal Theme Mining
Help answer questions like Which country responded first to the release of iPod Nano? China, UK, or Canada? Do people in different states (e.g., Illinois vs. Texas) respond differently/similarly to the increase of gas price during Hurricane Katrina? Potentially useful for Summarizing search results Monitoring public opinions Business Intelligence … Such spatiotemporal theme mining has potentially many applications
9
Challenges in Spatiotemporal Theme Mining
How to represent a theme? How to model the themes in a collection? How to model their dependency on time and location? How to compute the theme life cycles and theme snapshots? All these must be done in an unsupervised way… However, it’s not trivial to do….
10
Our Solution: Use a Probabilistic Spatiotemporal Theme Model
Each theme is represented as a multinomial distribution over the vocabulary (language model) Consider the collection as a sample from a mixture of these theme models Fit the model to the data and estimate the parameters Spatiotemporal theme patterns can then be computed from the estimated model parameters
11
Probabilistic Spatiotemporal Theme Model
Choose a theme i Draw a word from i price 0.3 oil Theme 1 oil k 1 2 B donate 0.1 relief 0.05 help Theme 2 donate … city the … city 0.2 new orleans Theme k Is 0.05 the a This slide shows the general idea of the model. The words drawn from different distributions are mixed together to “generate” a document. + TLP(i |d) Probability of choosing theme i= ... TLP(i|t, l) Document d Time=t Location=l Background B TL= weight on spatiotemporal theme distribution
12
The “Generation” Process
A document d of location l and time t is generated, word by word, as follows First, decide whether to use the background theme B With probability B , we’ll use the background theme and draw a word w from p(w|B) If the background theme is not to be used, we’ll decide how to choose a topic theme With probability TL, we’ll sample a theme using the “shared spatiotemporal distribution” p(|t,l) With probability 1- TL, we’ll sample a theme using p(|d) Draw a word w from the selected theme distribution p(w|i) Parameters {p(w|B), p(w|i ), p(|t,l), p(|d)} (will be estimated) B =Background noise; TL=Weight on spatiotemporal modeling (will be manually set) The two lambda parameters are set manually because they are not designed to best fit the data, but rather give the user some control over the mining process.
13
The Likelihood Function
Count of word w in document d Generating w using a topic theme Choosing a topic theme according to the spatiotemporal context Generating w using the background theme This slide explains different part of the likelihood function. Choosing a topic theme according to the document
14
Parameter Estimation Use the maximum likelihood estimator
Use the Expectation-Maximization (EM) algorithm p(w|B) is set to the collection word probability E Step The EM formulas can be skipped. Just say that it’s an iterative algorithm and finds a local maximum. We do multiple trials to find a good local maximum. M Step
15
Probabilistic Analysis of Spatiotemporal Themes
Once the parameters are estimated, we can easily perform probabilistic analysis of spatiotemporal themes Computing theme life cycles given location Computing theme snapshots given time The two theme patterns can be computed based on the learned parameters. And they can be visualized in different ways.
16
Experiments and Results
Three time-stamped data sets of weblogs, each about one event (broad topic): Extract location information from author profiles On each data set, we extract a set of salient themes and their life cycles / theme snapshots Data Set # docs Time Span(2005) Query Katrina 9377 08/16 -10/04 Hurricane Katrina Rita 1754 08/ /04 Hurricane Rita iPod Nano 1720 09/ /26 Weblog articles are downloaded from search results from blogsearch.google.com, with topical queries like ‘’hurricane katrina’’, “ipod nano”, etc.
17
Theme Life Cycles for Hurricane Katrina
Oil Price price oil gas increase product fuel company … New Orleans city orleans new louisiana flood evacuate storm … The upper figure is the life cycles for different themes in Texas. The red line refers to a theme with the top probability words such as price, oil, gas, increase, etc, from which we know that it is talking about “oil price”. The blue one, on the other hand, talks about events that happened in the city “new orleans”. In the upper figure, we can see that both themes were getting hot during the first two weeks, and became weaker around the mid September. The theme New Orleans got strong again around the last week of September while the other theme dropped monotonically. In the bottom figure, which is the life cycles for the same theme “New Orleans” in different states. We observe that this theme reaches the highest probability first in Florida and Louisiana, followed by Washington and Texas, consecutively. During early September, this theme drops significantly in Louisiana while still strong in other states. We suppose this is because of the evacuation in Louisiana. Surprisingly, around late September, a re-arising pattern can be observed in most states, which is most significant in Louisiana. Since this is the time period in which Hurricane Rita arrived, we guess that Hurricane Rita has an impact on the discussion of Hurricane Katrina. This is reasonable since people are likely to mention the two hurricanes together or make comparisons. We can find more clues to this hypothesis from Hurricane Rita data set.
18
Theme Snapshots for Hurricane Katrina
Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico This slide shows the snapshot for theme ``Government Response'' over the first five weeks of Hurricane Katrina. The darker the color is, the hotter the discussion about this theme is. we observe that at the first week of Hurricane Katrina, the theme ``Government Response'‘ is the strongest in the southeast states, especially those along the Gulf of Mexico. In week 2, we can see the pattern that the theme is spreading towards the north and western states because the northern states are getting darker. In week 3, the theme is distributed even more uniformly, which means that it is spreading all over the states. However, in week 4, we observe that the theme converges to east states and southeast coast again. Interestingly, this week happens to overlap with the first week of Hurricane Rita, which may raise the public concern about government response again in those areas. In week 5, the theme becomes weak in most inland states and most of the remaining discussions are along the coasts. Another interesting observation is that this theme is originally very strong in Louisiana (the one to the right of Texas, ), but dramatically weakened in Louisiana during week 2 and 3, and becomes strong again from the fourth week. Interestingly, Week 2 and 3 are consistent with the time of evacuation in Louisiana.
19
Theme life cycles for Hurricane Rita
Hurricane Katrina: Government Response Hurricane Rita: Government Response Hurricane Rita: Storms This figure shows the comparison of theme life cycles in Hurricane Katrina and Hurricane Rita. The red line and the purple line correspond to two themes in Hurricane Rita, which are about “Government Response” and “Storms” respectively. They both got strong rapidly when Hurricane Rita arrives around the last two weeks in September, and dropped at the end of September. The blue line is the theme about “Government Response” in Hurricane Katrina dataset. We can see that after it dropped around the beginning of September, it rose again during the two weeks of Hurricane Rita. This gives more clue to support our guess that the discussion about the two events are correlated. A theme in Hurricane Katrina is inspired again by Hurricane Rita
20
Theme Snapshots for Hurricane Rita
Both Hurricane Katrina and Hurricane Rita have the theme “Oil Price” This is a comparison of the theme snapshots during the first two weeks of Hurricane Rita and the last two weeks of Hurricane Katrina. The theme snapshots over the first two weeks of Hurricane Rita show that the discussion of Hurricane Rita did not spread so significantly as the first two weeks of Hurricane Katrina, which we’ve shown in the former slides. Instead, the spatial patterns are similar to the last two weeks of Hurricane Katrina, which are roughly around the same time period. During the first week of Hurricane Rita, we observe that the theme ``Oil Price'' is already widespread over the States. In the following week, the topic does not further spread; instead, it converges back to the states strongly affected by the hurricane. Comparable patterns for the same theme can be found during the last two weeks of Hurricane Katrina. This further implies that the two comparable events have interacting impact on the public concerns about them. The spatiotemporal patterns of this theme at the same time period are similar
21
Theme Life Cycles for iPod Nano
United States China Release of Nano ipod nano apple september mini screen new … Canada This figure shows the life cycles of theme “Release of Nano” in the iPod Nano dataset. From the figure, we see that United States is indeed the first country where the theme reaches the top of its life cycle, followed by Canada China, and United Kingdom almost around the same time, which is nearly a week after US. The theme in China presents a sharp growing and dropping, which indicates that most discussions there are within a short time period. The life cycles in Canada and United Kingdom both have two peaks. United Kingdom
22
Contributions and Future Work
Defined a new problem -- spatiotemporal text mining Proposed a general mixture model for the mining task Proposed methods for computing two spatiotemporal patterns -- theme life cycles and theme snapshots Applied it to Weblog mining with interesting results Future work: Capture content dependency between adjacent time stamps and locations Study granularity selection in spatiotemporal text mining
23
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.