Alvin CHAN Kay CHEUNG Alex YING Relationship between Twitter Events and Real-life
Introduction “ “ Twitter is your window to the world. Every day…500 million users500 million tweets
Real-life Twitter
Introduction User Engagement Platform to share real- world events Litter understanding about people engagement in real- world events Event Detection Primary source of news content Hard to spot useful information from so many tweets.
Introduction Predict the (i) presence, and (ii) degree of the user’s engagement 643 real-world events User Engagement Aggressive data preprocessing Hierarchical clustering of tweets Time-dependent n-gram Cluster ranking Headlines re-clustering Event Detection
Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering
Methodology – Data Source 1. US Presidential Elections in Nov 2012, 23: Nov 2012, 06:30 2. Ukraine, Syria and the Bitcoin in Feb 2014, 17:30 – 26 Feb 2014, 18:15.
Methodology – Data Pre-processing and filtering Aggressive Filtering RemoveTweetsVocabulary
Methodology – Data Pre-processing and filtering Removal of URLs, user mentions, hashtags, digits and punctuation Tokenization by white spaceRemoval of stop words
Methodology – Data Pre-processing and filtering Structure- based filtering > 2 user mentions > 2 hashtags < 4 text tokens
Methodology – Data Pre-processing and filtering Vocabulary filtering Bi-gramTri-gram
Hierarchical clustering – Step 1 Computing hierarchical clustering: fastcluster library in python Scale and normalize the tweet-term matrix Compute tweet pairwise distance
Hierarchical clustering – Step 2 Higher threshold Different topics in the same cluster Lower threshold Same topic in lots of different clusters, i.e. topic fragmentation Cutting the dendrogram at a 0.5 distance threshold
Hierarchical clustering – Step 3
Hierarchical clustering – Step 4 Selecting topic headlines by the clusters’ size Re-clustering headlines to avoid topic fragmentation For each selected topic, select the headline with the earliest publication time
Results Analysis: Tweet Length and Structure Tweet length at least53 Tweet-term matrix 3,2583,777 Terms588
Results Analysis: Unigrams vs Bi-grams/Tri-grams Vocabulary Bi-grams and Tri-grams Uni-gram Tweet-term matrix Terms588482
Results Analysis: Topic Precision (Stream 1) Accuracy : 100% Ground TruthDetected Topic Headline Obama wins Vermont WASHINGTON (AP) - Obama wins Vermont; Romney wins Kentucky. #Election2012 Romney wins Indiana Not a shocker NBC reporting #Romney wins Indiana & Kentucky #Obama wins Vermont Romney wins Kentucky Sky News projection: Romney wins Kentucky. #election2012
Results Analysis: Topic Precision (Stream 2) Googled for the first 100 detected topics 80% of detected topics are published as news
Implications Advantage Simplicity and efficiency, runs in less than an hour Strong filtering of tweets and terms seems to lead to efficient and clean results Limitation Topic fragmentation, where topics get repeated across several clusters Overcome the heavy noise aspect of Twitter content
Predicting User Engagement on Twitter with Real-World Events
5 questions to address Does a person post tweets about an event because they are interested in the topic pertaining to that event? Are they instead engaged because their friends are also posting tweets about it?Perhaps they are just a very active user of Twitter?Is their engagement a reflection of the fact that this is a local event? How and to what extent do the different topics of events affect the degree of a user’s engagement?
Dataset 2.7 billion English tweets, applies automated event detection algorithm 7468 real-world event clusters Annotators to read sample tweets from each event cluster Inferred the geolocations for 643 events clusters Twitter users based on the 643 events Predicted location by location inference algorithm All tweets posted by each user in most recent 6 months preceding their first engagement with any of the 643 events
The Statistical Model - Dependent Variables Presence of engagement Existence of at least one tweet that references to a particular event on Twitter Binary measure ( 1:engaged; 0:not engaged ) Degree of engagement Number of tweets that a user post regarding to the event Continuous measure
The Statistical Model - Predictor Variables Twitter activities Tweet’s content Twitter user types Geolocation Social network structure 17 variables, 5 major types
Result – Prediction of Presence Standardize the measures ➜ logistic regression ➜ Predict user’s engagement
Prediction of Presence Twitter activity Total tweets posted by a user prior to her event engagement ✔ Lower directed /Higher broadcast communication ✔ Ratio of hashtags used ✔ Ratio of retweets ✔ Tweet content Topical interest ✘ Twitter user type Informer ✔ Meformer ✘
Prediction of Presence Geolocation ✘ Social network Number of news friends ✔ Number of friends/followers and neighbors ✘
Result – Prediction of Degree Linear regression Participation levels in past ➜ participation levels in final Most significant predictor Number of posts from the users’ friend prior to the user’s engagement ✔ User’s network size ✘
Prediction of Degree w.r.t Different Topics Linear regression again Allow only 1 label for a given event
Prediction of Degree w.r.t Different Topics Topical interest Politics, Business and Sports events ✔✔ Entertainment ✔ Following News friends vs Friends News friends ➜ Politics & Business, Technology & Science, Sports, Entertainment Friends ➜ Local, Odd Geolocation Sports, Local ✔
5 questions to address Does a person post tweets about an event because they are interested in the topic pertaining to that event? Are they instead engaged because their friends are also posting tweets about it?Perhaps they are just a very active user of Twitter?Is their engagement a reflection of the fact that this is a local event? How and to what extent do the different topics of events affect the degree of a user’s engagement?
Answers to the 5 questions Does a person post tweets about an event because they are interested in the topic pertaining to that event? Yes, increase in significance of correlation between content of tweets related to events in specific topics and the user’s engagement
Answers to the 5 questions Are they instead engaged because their friends are also posting tweets about it? Yes, conditioned on the type of event (local events and odd news)
Answers to the 5 questions Perhaps they are just a very active user of Twitter? Yes, more active users are more likely to be interested and engaged in a new event
Answers to the 5 questions Is their engagement a reflection of the fact that this is a local event? Depends on the kind of event, yes for Sports and Local events
Answers to the 5 questions How and to what extent do the different topics of events affect the degree of a user’s engagement? Politics & Business, Technology & Science, Sports events depend more on content of past tweets Local, Odd events depend more on user’s social network
Implications Limitation Just allot events into a single category Did not consider people’s personality Did not consider that there exist different kinds of target users
Conclusion User Engagement Users’ prior activities and social network structure are good predictors for presence and degree Content of tweets and geographic proximity provide additional predictive power Event Detection Many topics are published as news User can trace the news back to its original tweet
THANK YOU