Hashtags as Milestones in Time Identifying the hashtags for meaningful events using Twitter search logs and Wikipedia data Stewart Whiting University of.

Slides:



Advertisements
Similar presentations
Generation of Multimedia TV News Contents for WWW Hsin Chia Fu, Yeong Yuh Xu, and Cheng Lung Tseng Department of computer science, National Chiao-Tung.
Advertisements

Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Kira Radinsky, Sagie Davidovich, Shaul Markovitch Computer Science Department Technion – Israel Institute of technology.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Information Retrieval in Practice
Search Engines and Information Retrieval
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Semantic (Language) Models: Robustness, Structure & Beyond Thomas Hofmann Department of Computer Science Brown University Chief Scientist.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
A Signal Analysis of Network Traffic Anomalies Paul Barford, Jeffrey Kline, David Plonka, and Amos Ron.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
TwitterSearch : A Comparison of Microblog Search and Web Search
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences
Golder and Huberman, 2006 Journal of Information Science Usage Patterns of Collaborative Tagging System.
Deriving Topics and Opinions from Microblogs Feng Jiang Supervisors: Jixue Liu & Jiuyong Li.
On Sparsity and Drift for Effective Real- time Filtering in Microblogs Date : 2014/05/13 Source : CIKM’13 Advisor : Prof. Jia-Ling, Koh Speaker : Yi-Hsuan.
Search Engines and Information Retrieval Chapter 1.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University.
How to make searchers better searchers Vivian Lin Dufour 21 Oct 2010.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro.
#TwitterSearch: A Comparison Of Microblog Search And Web Search ( WSDM’11 ) Speaker:Chiang,guang-ting Advisor: Dr. Koh. Jia-ling 1.
A Comparison of Microblog Search and Web Search.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Microblogs: Information and Social Network Huang Yuxin.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Recommending Twitter Users to Follow Using Content and Collaborative Filtering Approaches John HannonJohn Hannon, Mike Bennett, Barry SmythBarry Smyth.
Chapter 6: Information Retrieval and Web Search
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web Crawling  Web search.
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
1 Applications of video-content analysis and retrieval IEEE Multimedia Magazine 2002 JUL-SEP Reporter: 林浩棟.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Alvin CHAN Kay CHEUNG Alex YING Relationship between Twitter Events and Real-life.
This multimedia product and its contents are protected under copyright law. The following are prohibited by law: any public performance or display, including.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
PREDICTION ON TWEET FROM DYNAMIC INTERACTION Group 19 Chan Pui Yee Wong Tsz Wing Yeung Chun Kit.
SEARCH AND CONTEXT Susan Dumais, Microsoft Research INFO 320.
Information Retrieval in Practice
Search Engine Architecture
Search User Behavior: Expanding The Web Search Frontier
Hijacking the Hashtag: A Case Study of #BreakTheInternet on Twitter
Yi-Chia Wang LTI 2nd year Master student
IL Step 3: Using Bibliographic Databases
Chapter 5: Information Retrieval and Web Search
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Hashtags as Milestones in Time Identifying the hashtags for meaningful events using Twitter search logs and Wikipedia data Stewart Whiting University of Glasgow Omar Alonso Microsoft/Bing Time Aware Information Access Workshop, SIGIR Oregon, (Work done while on internship at Microsoft)

Alright… Outline 1.Hashtags as milestones in time 2.Introduction 1.Why milestones 2.Why hashtags? Can they useful as milestones? 3.Motivation 4.Approach 1.Data preparation 2.Approach steps 5.Constructing a timeline – examples 6.Preliminary conclusions

Abstract: Hashtags as milestones in time What we want to do: Identify event-based hashtags, for timeline creation –Currently using historic/past data Filter out junk Find most temporally significant hashtags –Use multiple signals: Twitter search logs + related Wikipedia article popularity We are not doing topic detection/tracking! Why? A good way to express (anchor) a topic on a timeline… Help users make sense of/navigate temporal information #what?

Introduction Hashtags used by authors to explicitly denote the relevant topic(s) in message –“ Great passing, great game #euro2012 ” Used by authors and searchers –Broadcast a consume a specific topic –Especially useful in short text retrieval where bag of words/language modelling are challenging Reflect mainstream events (or memes!) in real-time –See trending topics right now Timelines are very good for displaying events –But you need to express the events as a meaningful marker, or milestone!

Introduction to the data Two crowds of people –Authors/searchers on Twitter –Editors/browsers on Wikipedia Correlation between signals from the two crowds –People search for what is happening –People edit Wikipedia with what is happening –Two very distinctive signals!

Twitter hashtag signals (in search logs) But plenty of memes too… –#20PeopleWhoIWantToMeet –#PresentingInTheBatCave –#whiteppldoitbutblackppldont

Wikipedia signals Whitney Houston TV appearances Her death in February 2012 Events were reflected by discussion with hashtags in Twitter, e.g. –#ripwhitney –#bgtwhitney (BGT = Britain’s got Talent)

Motivation Both signals have large coverage –Celebrities, news, weather, people, science, movies etc. Two robust signals coming from large crowds –Difficult to influence by individuals (spam?) –Not so reliant on single signal analysis (i.e. wavelets or burst detection etc) Discard memes by looking for associated Wikipedia articles. Meaningful milestones in timelines provide strong features to navigate temporal content –Alonso et al. (2010), Matthews et al. (2010), From et al. (2003)

Data Preparation – Hashtag Data Extracted from Bing Social and IE8 query logs Provides hashtag use, aggregated per day (Proprietary, but could be extracted from other sources) Hashtags are mostly a mix of unigrams and bigrams! We also want the words in the hashtag Need to use a word breaker… –We used Microsoft Web N-Gram Services –Breaks #crosstownshootout into ‘cross town shoutout’ and #basketballwivesla into ‘basketball wives la’

Data Preparation – Wikipedia Data Created a Lucene index using the Wikipedia Extraction (WEX) data. Wikipedia article viewing popularity statistics –Dump available for each hour since Dec 2007 –Published near real-time, for the past hour (on the hour) –Huge number of data points! –So we sampled 8am/8pm each day –Transformed into a daily aggregated time-series (therefore comparable with hashtag signals) –Smoothed with exponential smoothing (alpha = 0.2) –Over 2 billion data points!

Approach Outline 1.For each hashtags from the logs, use word breaker service to extract hashtag terms. 2.Use separated terms to query Wikipedia index – maps each hashtag to a set of possibly associated articles. 3.For each article/hashtag, prepare a same-length comparable time-series of popularity 1.Frequency of hashtag over time 2.Popularity of article over time Pearson correlation co-efficient computed. –Measures association between temporality of the hashtag occurrence and the Wikipedia article popularity.

Example Correlations

Constructing a Timeline

Conclusions Early work, but correlating the signals does yield high- profile temporal events –Hashtag can therefore be used to anchor events on a timeline Occasional spurious correlation (need better hashtag frequency data to improve this) –Correlation does not imply causation! Future work… –Automatic construction of timelines –Improving correlation quality – examine time windows –Designing an evaluation framework to assess overall timeline quality