Hashtags as Milestones in Time Identifying the hashtags for meaningful events using Twitter search logs and Wikipedia data Stewart Whiting University of Glasgow Omar Alonso Microsoft/Bing Time Aware Information Access Workshop, SIGIR Oregon, (Work done while on internship at Microsoft)
Alright… Outline 1.Hashtags as milestones in time 2.Introduction 1.Why milestones 2.Why hashtags? Can they useful as milestones? 3.Motivation 4.Approach 1.Data preparation 2.Approach steps 5.Constructing a timeline – examples 6.Preliminary conclusions
Abstract: Hashtags as milestones in time What we want to do: Identify event-based hashtags, for timeline creation –Currently using historic/past data Filter out junk Find most temporally significant hashtags –Use multiple signals: Twitter search logs + related Wikipedia article popularity We are not doing topic detection/tracking! Why? A good way to express (anchor) a topic on a timeline… Help users make sense of/navigate temporal information #what?
Introduction Hashtags used by authors to explicitly denote the relevant topic(s) in message –“ Great passing, great game #euro2012 ” Used by authors and searchers –Broadcast a consume a specific topic –Especially useful in short text retrieval where bag of words/language modelling are challenging Reflect mainstream events (or memes!) in real-time –See trending topics right now Timelines are very good for displaying events –But you need to express the events as a meaningful marker, or milestone!
Introduction to the data Two crowds of people –Authors/searchers on Twitter –Editors/browsers on Wikipedia Correlation between signals from the two crowds –People search for what is happening –People edit Wikipedia with what is happening –Two very distinctive signals!
Twitter hashtag signals (in search logs) But plenty of memes too… –#20PeopleWhoIWantToMeet –#PresentingInTheBatCave –#whiteppldoitbutblackppldont
Wikipedia signals Whitney Houston TV appearances Her death in February 2012 Events were reflected by discussion with hashtags in Twitter, e.g. –#ripwhitney –#bgtwhitney (BGT = Britain’s got Talent)
Motivation Both signals have large coverage –Celebrities, news, weather, people, science, movies etc. Two robust signals coming from large crowds –Difficult to influence by individuals (spam?) –Not so reliant on single signal analysis (i.e. wavelets or burst detection etc) Discard memes by looking for associated Wikipedia articles. Meaningful milestones in timelines provide strong features to navigate temporal content –Alonso et al. (2010), Matthews et al. (2010), From et al. (2003)
Data Preparation – Hashtag Data Extracted from Bing Social and IE8 query logs Provides hashtag use, aggregated per day (Proprietary, but could be extracted from other sources) Hashtags are mostly a mix of unigrams and bigrams! We also want the words in the hashtag Need to use a word breaker… –We used Microsoft Web N-Gram Services –Breaks #crosstownshootout into ‘cross town shoutout’ and #basketballwivesla into ‘basketball wives la’
Data Preparation – Wikipedia Data Created a Lucene index using the Wikipedia Extraction (WEX) data. Wikipedia article viewing popularity statistics –Dump available for each hour since Dec 2007 –Published near real-time, for the past hour (on the hour) –Huge number of data points! –So we sampled 8am/8pm each day –Transformed into a daily aggregated time-series (therefore comparable with hashtag signals) –Smoothed with exponential smoothing (alpha = 0.2) –Over 2 billion data points!
Approach Outline 1.For each hashtags from the logs, use word breaker service to extract hashtag terms. 2.Use separated terms to query Wikipedia index – maps each hashtag to a set of possibly associated articles. 3.For each article/hashtag, prepare a same-length comparable time-series of popularity 1.Frequency of hashtag over time 2.Popularity of article over time Pearson correlation co-efficient computed. –Measures association between temporality of the hashtag occurrence and the Wikipedia article popularity.
Example Correlations
Constructing a Timeline
Conclusions Early work, but correlating the signals does yield high- profile temporal events –Hashtag can therefore be used to anchor events on a timeline Occasional spurious correlation (need better hashtag frequency data to improve this) –Correlation does not imply causation! Future work… –Automatic construction of timelines –Improving correlation quality – examine time windows –Designing an evaluation framework to assess overall timeline quality