Georgiana Ifrim, Bichen Shi, Igor Brigadir Insight Centre for Data Analytics University College Dublin Event Detection in Twitter using Aggressive Filtering.

Slides:



Advertisements
Similar presentations
A probabilistic model for retrospective news event detection
Advertisements

Chapter 5: Introduction to Information Retrieval
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Learning more about Facebook and Twitter. Introduction  What we’ve covered in the Social Media webinar series so far  Agenda for this call Facebook.
Twitter – what is it? The School District of Haverford Township |
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
SOCIAL MEDIA & PHYSICAL ACTIVITY PROMOTION: MAKING THE CONNECTIONS Presented by: Sandra De Freitas
Presenter: Liu, Ya Tian, Yujia Pham, Anh TwitterMonitor: Trend Detection over the Twitter Stream EvenTweet: Online Localized Event Detection from Twitter.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Automatic Timeline Generation Jessica Jenkins Josh Taylor CS 276b.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Symeon Papadopoulos (CERTH) David Corney (RGU) Luca Aiello (Yahoo! Labs)
SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert.
Topic areas What are the wider social issues relating to media regulation?
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
You are about to see a photograph from last week’s news. Your task is to try and work out what the event is and what news story it is. You have just 1.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
TV Debate 1 Contrast 2010 with the 2015 proposals Why are they different?
How important is the media in voting behaviour?. Influence of the media Newspapers, especially tabloids, attempt to influence the result of elections.
On Sparsity and Drift for Effective Real- time Filtering in Microblogs Date : 2014/05/13 Source : CIKM’13 Advisor : Prof. Jia-Ling, Koh Speaker : Yi-Hsuan.
MASFAA 2013 October 6 th – 9 th, 2013 Indianapolis, Indiana You’ve Got A Social Media Site… Now What? Jayme Jarrett, Ohio Northern University Liz Gross,
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Uiversity Library Library and Learning Services How to find Newspaper articles on Nexis and FT.com Why read Newspaper articles?  Provides current / historical.
Ian Reeves. A few facebook facts  1bn+ active global users  250m+ mobile users  31m UK users  130 friends on average per user  Average user creates.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal VideoConference Archives Indexing System.
2008 vs Presidential Election Results
Microblogs: Information and Social Network Huang Yuxin.
Twitter & Election 2012 By Katherine Johnson. Romney vs. Obama Followers: 1,186,658 Tweets: 1,184 Retweets: 3,628 Followers:20,192,254 Tweets: 6,343 Retweets:
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
Resources Print slide 6 as handout for activity 1.
Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts Zhe Zhao Paul Resnick Qiaozhu Mei Presentation Group 2.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
Elections and the Electoral College
 What kind of newspaper?  There are national daily papers (published in the morning), national evening papers, local morning and evening papers.  National.
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
WRITING A PRESS RELEASE FIRSTLY, IS IT NEWSWORTHY? Imagine the biggest and most frequently used button on a news desk's keyboard... Ask yourself, will.
Twitter: What can you do in 140 characters or less? COM 160: New Communications Technologies.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Alvin CHAN Kay CHEUNG Alex YING Relationship between Twitter Events and Real-life.
Using Social Media to Enhance Emergency Situation Awareness
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
DM-Group Meeting Liangzhe Chen, Nov
#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),
Q4 : How does Netflix recommend movies?
Text Mining & Natural Language Processing
Pei Lee, ICDE 2014, Chicago, IL, USA
Team 7 → Final Presentation
State of the Union Address 2013
Breaking it down into 5 steps
Presentation transcript:

Georgiana Ifrim, Bichen Shi, Igor Brigadir Insight Centre for Data Analytics University College Dublin Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering SNOW Data Challenge April 8, 2014

Outline Background Method Proposed Method Details Results Future Work

Background Social media outlets (e.g., Twitter) play an increasing role in the cycle of news production Journalists use Twitter for news selection and presentation Twitter: An endless, real-time, global stream of news Large scale and very noisy (redundant, messy content) Challenge: Extract (close to real-time) newsworthy topics/event/stories from the Twitter stream, in a format usable by news professionals (e.g., topic-timestamp, topic-headline, topic- tags, tweet-ids, photo-urls)

Challenge From this: #Obama #follow #followme #followforfollow #followme #follower #followers #alwaysfollowback #followbackalways #teamfollowback I VOTED !!! #OBAMA #TeaamObama !!!! Om 12u zou de eerste uitslag binnen zijn. Nu nog steeds niks. Dit trek ik niet. Wekker over 4u en we kijken dan wel. #obama #forward My President is Black ★★★★★ ▄▄▄▄▄▄▄▄▄▄ ★★★★★ ▄▄▄▄▄▄▄▄▄▄ ★★★★★ ▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ #Obama2012 #Retweet voted at I love you! #Forward Romney Romney Romney Romney Romney Romney Romney Romney!!!!!!!!!!!!!!!!!!!!! #RomneyRyan2012

Challenge To this: Obama wins Vermont Romney wins Kentucky Bernie Sanders wins Senate seat in Vermont Romney wins Indiana

Method Proposed 1. Aggressive Data Filtering (to remove noise, to scale) 2. Hierarchical Clustering of Tweets + Dendrogram Cutting (to obtain clusters without need of knowing #clusters a-priori) 3. Ranking of Clusters (to favor news-like topics) 4. Extracting Topic-Headlines (usable information) 5. Re-clustering Topic-Headlines (to remove topic fragmentation) 6. Extracting Final Topics (as presented to the user)

Method Details Software Collecting Twitter streams SNOW Challenge Code (based on Twitter4J API) All other development ( Python2.7 + libraries: scipy, numpy, sklearn, nltk, json Tweet-NLP: CMUTweetTagger (trained on tweets, entity detection) Efficient clustering: fastcluster (C++ lib, interface to Python/R)

Data Collection US Presidential Elections 2012 Collected from tweet ids ( , 23:30 to :51) 1,084,200 raw tweets text (english + non-english, 252MByte) Syria, Ukraine, Bitcoin 2014 Collected from keywords + user ids ( , 17:30 to , 18:15) 1,088,593 raw tweets JSON (english + non-english, 4.3GByte) 943,175 english tweets JSON (3.8GByte) 943,175 tweets text (extract subset of fields from JSON object, 240MByte) Replace re-tweet text with original tweet text

Data Pre-processing Tweet filtering Clean tweet-text. Remove: urls, user mentions, hashtags, punctuation, digits Tokenize remaining text into tokens Rebuild tweet by appending: user mentions + hashtags (#) + text tokens Remove tweets based on structure (remove if too # or too few text tokens) Term filtering Keep only bi-grams + tri-grams occurring in at least a percentage of tweets in time window (e.g., min(10, n_tweets_in_window * )) Tweet-Term Matrix (binary) Remove out-of-vocabulary tweets and very short tweets (with less than 5 tokens) Retains about 20% of the original raw tweet stream (in each time window)

Hierarchical Tweet Clustering Computing tweet pairwise-distance Scale and normalize tweet-term matrix Cosine as distance metric (euclidean similar results); sklearn + scipy Computing hierarchical clustering=> dendrogram fastcluster C++ library (interface to R and Python) Dendrogram cutting Cut at 0.5 distance threshold (better libraries based on topology of dendrogram available in R, e.g., Dynamic Tree Cut: only specify min number of examples in each cluster) One cluster = one potential topic

Hierarchical Tweet Clustering Ranking clusters Retain only clusters with at least 10 tweets (size constraint) Score each cluster based on: Compute cluster-centroid (vector of terms) Get maximum term-score (over all centroid terms) Term score: entity_score * burstiness_score Assign the highest term score as cluster score Normalize cluster score by cluster size Entity score = 2.5 (identify entity-terms with CMUTweetTagger) Burstiness score = df-idf_t, with t=4 (prior work on Bngram) Interesting extensions to cluster score: article_score, tweet_importance based on trustworthiness or clout of users issuing tweet

Hierarchical Tweet (Re)Clustering Selecting topic-headlines Take top-20 ranked clusters as potential topics Select first (time-wise) tweet in each cluster as topic-headline Re-cluster headlines Hierarchical clustering of headlines Score headline-clusters using max score headline Rank headline-clusters, take top-10 Final topics Select first (published) headline in each cluster, present raw tweet (less url) to user Gather all distinct keywords of headlines in headline-cluster to create topic-tags Tweet ids for topic: the ids of corresponding headlines. If headlines do not cluster, only one tweet id

Results Top-10 topics first time window in US stream ( :00 – 00:10) 1. WASHINGTON (AP) - Obama wins Vermont; Romney wins Kentucky. #Election Not a shocker NBC reporting #Romney wins Indiana & Kentucky #Obama wins Vermont 3. Sky News projection: Romney wins Kentucky. #election AP RACE CALL: Democrat Peter Shumlin wins governor race in Vermont. #Election CNN Virginia exit poll: Obama 49\%, Romney 49\% #election Mitt Romney Losing in Massachusetts a state that he governed. Why vote for him when his own people don't want him? #Obama Twitter is gonna be live and popping when Obama wins! #Obama INDIANA RESULTS: Romney projected winner #election If Obama wins I'm going to celebrate... If Romney wins I'm going to watch Sesame Street one last time #Obama #election2012 important that Romney won INdependents in Virginia by 11 pts. With parties about even, winning Inds is key

Results Top-10 topics first time window in Syria stream ( :00 – 18:15) 1. The new, full Godzilla trailer has roared online 2. At half-time Borussia Dortmund lead Zenit St Petersburg Ukraine Currency Hits Record Low Amid Uncertainty: Ukrainian currency, the hryvnia, hits all-time low against Ooh, my back! Why workers' aches pains are hurting the UK economy 5. Uganda: how campaigners are preparing to counter the anti-gay bill 6. JPost photographer snaps what must be the most inadvertantly hilarious political picture of the decade 7. Fans gather outside Ghostbusters firehouse in N.Y.C. to pay tribute to Harold Ramis 8. Man survives a shooting because the Bible in his top pocket stopped two bullets 9. Ukraine's toppling craze reaches even legendary Russian commander, who fought Napoleon 10. Newcastle City Hall. Impressive booking first from bottom on the left...

Discussion Parameter choices Filtering parameters dependent on window size (nr of tweets in window) Unigrams vs N-grams (N>1) Bi-grams + N-grams good for content + scalability Cluster ranking (Normalized) Df-idf_t seems a good choice, but cluster-score may benefit from using tweet importance (based on user importance) Topic Precision (~80%, based on googling topic-headlines) On average about 8-9 out of 10 headlines are published news Efficiency Aspect System takes about 0.5min per 15min slot (scales well for larger time slots)

Conclusion Encouraging results in using Twitter stream as a news aggregator (truly global) Both sides now: media outlets (CNN, BBC, Reuters, AP) and regular people post updates on (breaking) stories We need a good topic-benchmark to refine techniques (e.g., comprehensive set of ground truth topics)

Future Work Improve retrieval of newsworthy stories -E.g., ‘This is what happens when you put two pit bulls in a photo booth’, vs ‘Ukraine currency hits record low amid uncertainty’ -May depend on type of stories we are after (BBC vs Sun) -Tweet/user importance filtering may help -News streamed in same time frame may help (vocabulary selection) Fragmentation due to breaking news stories -Same story discussed from different angles: Lee Rigby murders: Michael Adebolajo given whole-life jail term Lee Rigby murder sentence expected shortly. Pictured: the scene outside the Old Bailey in London Judge Mr Justice Sweeney says behaviour of Lee #Rigby's killers was "sickening and pitiless" -Combination of tweet and term clustering may help (e.g., cluster headlines in term rather than tweet space)

Thank You! Open source code:

Different Newspapers in UK In the TV comedy seriesYes Minister, fictional Prime Minister Jim Hacker explains to his staff the readership of the main newspapers:Yes Minister, fictional Prime Minister Jim Hacker explains to his staff the readership of the main newspapers: “The Daily Mirror is read by people who think they run the country, The Guardian is read by people who think they ought to run the country, The Times is read by people who actually do run the country, The Daily Mail is read by the wives of the people who run the country, The Financial Times is read by people who own the country, The Morning Star is read by people who think the country ought to be run by another country, and The Daily Telegraph is read by people who think it is.”, Source: Wikipedia