Download presentation
Presentation is loading. Please wait.
1
Yi-Chia Wang LTI 2nd year Master student
Analysis of Social Media Trend Analysis Yi-Chia Wang LTI 2nd year Master student
2
Analysis of Social Media 2007
Introduction Document streams Arrive continuously over time , news articles, search engine query logs, … Identify topics in document streams Topic detection and tracking Text mining Visualization … Is there a better organizing principle for the enormous archives of document streams? Temporal information in document streams Trausan-Matu et al., 2007 Oct-30 Analysis of Social Media 2007
3
Analysis of Social Media 2007
“Burst of activity” Topics appear, grow in intensity for a period of time, and then fade away. Bursts correspond to points at which the intensity of message arrivals increases sharply Problems with naive identification of bursts Easily identifying large numbers of short bursts Fragmenting long burst into many smaller ones Goal: identifying bursts only when they have sufficient intensity Oct-30 Analysis of Social Media 2007
4
Bursty and Hierarchical Structure in Streams
Jon Kleinberg Department of Computer Science Cornell University SIGKDD ‘02 My advisor is Carolyn Rose This work is funded by PSLC The project name is TagHelper
5
Two-state Automaton (A) Model
Idea: periods of lower message intensity interleave with periods of higher message intensity A begins in state q0 A changes state with probability p When in state q0, messages are emitted at a slow rate; when in state q1, messages are emitted at a faster rate time intensity q0 q1 q0 q1 p 1-p States correspond to higher and higher message intensities State transitions signal bursts Emissions decide arrival times of next messages Oct-30 Analysis of Social Media 2007
6
Exponential Distribution
Modeling the message emission rate Modeling the time gap between messages and Modeling by exponential distribution with parameter being the rate of message arrivals Wikipedia Oct-30 Analysis of Social Media 2007
7
Two-state Automaton (A) Model
Formally, given: messages with specified arrival times : inter-arrival gaps We want to determine the conditional probability of a state sequence Given a set of messages, one can find a likely state sequence based on the model Oct-30 Analysis of Social Media 2007
8
Two-state Automaton (A) Model
Finding a state sequence q maximizing the probability Equivalently, minimizing the following cost function: Favoring state sequences that conform well to the sequence x of gap values Favoring sequences with a small number of state transitions Oct-30 Analysis of Social Media 2007
9
Infinite-state Automata Model
Cost Function Oct-30 Analysis of Social Media 2007
10
Computing a minimum-cost state sequence
THEOREM: If q* is an optimal state sequence in , then it is also an optimal state sequence in Dynamic programming is used for searching an optimal state sequence Oct-30 Analysis of Social Media 2007
11
Bursts exhibit a natural nested structure
A burst of intensity j is a maximal interval over which a part of state sequence is in a state of index j or higher Bursts can also be represented as a tree. Each burst is a node in the tree Oct-30 Analysis of Social Media 2007
12
Analysis of Social Media 2007
Experiments The model makes sense for many datasets (of an analogous flavor) Titles of conference papers U.S. Presidential State of the Union Addresses Web clickstreams Oct-30 Analysis of Social Media 2007
13
Analysis of Social Media 2007
Dataset Is the appearance of messages containing particular words exhibits a burst in the vicinity of significant times such as deadlines? Author’s own collection of June 9, 1997 – August 23, 2001 34344 messages (41.7 MB) Focusing on the response set Oct-30 Analysis of Social Media 2007
14
Results for the Word - ITR
ITR is the name of a large NSF program The author wrote 2 proposals for it in ; one is a small proposal while another is a large one The intervals are annotated with the first and last dates of the messages The first subtree splits further into 2 subtrees For the 2nd subtree, there is no burst since the author did not continue the submission The figure shows the resulting bursts for the optimal state sequence The intervals are annotated with the first and last dates of the messages Oct-30 Analysis of Social Media 2007
15
Results for the Word - prelim
Prelim is the term used at Cornell for non-final exams The author taught courses in 4 of the 8 semesters covered by the collection of , and each of these courses had 2 prelims For the first of these courses, there was a special course account For remaining 3 courses, each corresponds to a long burst and 2 shorter, more intense bursts for the particular prelims The 2 structures suggest how a large folder of might naturally be divided into a hierarchical set of sub-folders around certain key events, based only on the rate of message arrivals Oct-30 Analysis of Social Media 2007
16
Titles of Conference Papers
Goal: extracting bursts in term usage from the titles of conference papers over the past several decades Problem: conference papers arrive in discrete batches every half or one year no message inter-arrivals gaps Modified automaton model: Generating batched arrivals For each state, there is an expected fraction of relevant documents Bursty is identified if the fraction of relevant documents increases Oct-30 Analysis of Social Media 2007
17
Titles of Conference Papers
Cost function for each arrival batch: The weight of the burst : the improvement in cost by using state q1 rather than state q0 Oct-30 Analysis of Social Media 2007
18
Analysis of Social Media 2007
SIGMOD & VLDB, Considering each word in paper titles The 30 bursts of highest weight The bursts with no ending date the interval extends to the most recent conference These bursty words are different from a list of common words The bursts are picking up trend in language use Oct-30 Analysis of Social Media 2007
19
Analysis of Social Media 2007
STOC & FOCS, The 30 bursts of highest weight Particular titling conventions that were in fashion for certain periods “How to construct random functions” … Oct-30 Analysis of Social Media 2007
20
U.S. Presidential State of the Union Addresses
Kleinbergh, SIGKDD ‘02 Oct-30 Analysis of Social Media 2007
21
Web usage data – clickstreams
Settings: 80 undergraduate students Two and a half months in Spring 2000 For every URL w, all bursts in the stream of visits to w are determined Focusing on high-weighted bursts as well as those that involve at least 10 distinct users Results: High-ranked bursts involve the URLs of the online class reading assignments, centered on intervals shortly before and during the weekly sessions at which they were discussed Oct-30 Analysis of Social Media 2007
22
Analysis of Social Media 2007
Conclusions Modeling streams using an infinite-state automaton State transitions lead to bursts First story detection: a single message on which the associated state transition occurred The model offers a means of structuring the information from our patterns of interacting and communicating Document streams have a strong temporal character In many domains, we are accumulating detailed records of our own communication and behavior Oct-30 Analysis of Social Media 2007
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.