Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of Colorado Boulder, ATLAS Institute Leysia Palen University of Colorado Boulder, Computer Science Grace Muzny University of Washington, Computer Science
Social Media & Mass Disruption Events
Sociologists of disaster: After a disaster event, people will converge on the scene to, among other things, offer help
Spontaneous volunteers
Mass Disruption = Mass Convergence Social Media & Mass Disruption Events
Opportunities for Digital Convergence
Citizen Reporting
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Volume
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Noise Volume
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume Crowd Work!
Signal
Noise? Signal
Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Signal
Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Signal
Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information First hand info New info coming in to the space for the first time Signal
Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Derivative Behavior Re-sourced Info Reposts Links/URLs Network Connections First hand info New info coming in to the space for the first time Signal
Original Information Derivative Behavior First hand info useto find New info coming in to the space for the first time Re-sourced Info Reposts Links/URLs Network Connections Signal
follow Original Information Derivative useto find Signal
follow Original Information Derivative useto find Signal
follow Original Information Derivative useto find Signal
follow Original Information Derivative Collaborative Filtering useto find Signal
follow Original Information Derivative Collaborative Filtering Crowd Work useto find Signal
Learning from the Crowd: A Collaborative Filter for Identifying Locals
‣ Background Why Identify Locals? Empirical Study on Crowd Work during Egypt Protests ‣ Test Machine Learning Solution for Identifying Locals Event - Occupy Wall Street in NYC Data Collection & Analysis Findings ‣ Discussion Leveraging Crowd Work From Empirical Work to Computational Solutions Learning from the Crowd: A Collaborative Filter for Identifying Locals
Why Help Identify Locals? ‣ Citizen Reporting: first hand info can contribute to situational awareness ‣ Info not already in the larger information space ‣ Digital volunteers often work to identify and create lists of on-the-ground Twitterers
‣ Crisis events vs. protest events Why Help Identify Locals?
‣ Crisis events vs. protest events ‣ Tunisia Protests - activists tweeting from the ground were a valuable source of info for journalists (Lohan, 2011) ‣ Egypt Protests - protestors on the ground were actively fostering solidarity from the remote crowd (Starbird and Palen, 2012) Why Help Identify Locals?
‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 Why Help Identify Locals?
‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 ‣ OWS Protests: Remote supporters aggregated and published lists of those on the @brettchamberlin #TakeWallStreet #OurWallStreet #OccupyWallStreet We have 20 livetweeters for this list. Are there #takewallstreet #OurWallStreet #needsoftheoccupiers Why Help Identify Locals?
Empirical Study of Crowd Work during Political Protests
‣ something something else ‣ some more Collected #egypt #jan25 tweets 2,229,129 tweets 338,895 Twitterers Identified most-RTed Twitterers Determined location for sample Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
‣ Crowd may work to identify on-the-ground Twitterers Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis. Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis.
Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior
Move from Empirical Work to Computational Solution Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior
Event: Occupy Wall Street Protests September 15-21, 2011 NYC site - Zuccotti Park
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API Search TermsSearch API WindowStreaming API Window #occupywallstreet #dayofrage Sept 15 1pm - Sept 17 11amSept 17 11am - Sept 20 6:45pm #takewallstreet #sep17 #sept17 Sept 15 1pm - Sept 17 1:45pmSept 17 1:45pm - Sept 20 6:45pm #ourwallstreet Sept 18 9:38am - Sept 18 10:05amSept 18 10:05am - Sept 20 6:45pm
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample Twitterers ‣ using a tweet-based sampling strategy
Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions
Twitterer-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers
Tweet-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions
‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample Twitterers ‣ using a tweet-based sampling strategy Data Collection and Sampling
Location Coding ‣ 2385 Twitterers
Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests
Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9
‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9 Location Coding 4.5%
Filtering with a Support Vector Machine
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data trained SVM
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM labeled data
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data SVMs accept real-valued features and have been shown to work well on high-dimensional, noisy data (Schölkopf B., 2004) labeled training data unlabeled data trained SVM labeled data
Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier
Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets
Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets Flat Profile Features Statuses count (log) Initial followers count (log) Initial friends count (log) # of RTs a % of stream # of tweets for event (log) Description changed during event Location changed during event
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier data
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc...
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total Per fold22710 or 11
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total Per fold22710 or 11 stratified
A Majority Class Classifier ‣ A naive approach yields high overall accuracy %
A Majority Class Classifier ‣ A naive approach yields high overall accuracy % Not On-ground Tweeters On-ground Tweeters
A Majority Class Classifier ‣ A naive approach yields high overall accuracy % Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified
A Majority Class Classifier ‣ A naive approach yields high overall accuracy % Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified 99.9% 4.6%
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification We want a better line...
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification Asymmetric Soft Margins
Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%
Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%
Findings Not On-ground Tweeters Correctly classified Incorrectly classified 77.6% 67.9% On-ground
Findings Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%
Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%
Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2% 4.7% 14.2% Original data set Data filtered through the SVM
Discussion
‣ Tripling the ratio of signal to noise Discussion
‣ Tripling the ratio of signal to noise - in world of digital volunteers Discussion
‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones Discussion
‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests Discussion
‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests ‣ Demonstrates value of using empirical work to inform computational solutions Discussion
Project EPIC University of Colorado, Boulder US National Science Foundation Grants IIS & IIS NSF Graduate Fellowship Thank you.