Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of.

Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of Colorado Boulder, ATLAS Institute Leysia Palen University of Colorado Boulder, Computer Science Grace Muzny University of Washington, Computer Science

Social Media & Mass Disruption Events

Sociologists of disaster: After a disaster event, people will converge on the scene to, among other things, offer help

Spontaneous volunteers

Mass Disruption = Mass Convergence Social Media & Mass Disruption Events

Opportunities for Digital Convergence

Citizen Reporting

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Volume

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Noise Volume

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume Crowd Work!

Signal

Noise? Signal

Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Signal

Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Signal

Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information First hand info New info coming in to the space for the first time Signal

Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Derivative Behavior Re-sourced Info Reposts Links/URLs Network Connections First hand info New info coming in to the space for the first time Signal

Original Information Derivative Behavior First hand info useto find New info coming in to the space for the first time Re-sourced Info Reposts Links/URLs Network Connections Signal

RT @mention follow Original Information Derivative Behavior @mention useto find Signal

RT @mention follow Original Information Derivative Behavior @mention Collaborative Filtering useto find Signal

RT @mention follow Original Information Derivative Behavior @mention Collaborative Filtering Crowd Work useto find Signal

Learning from the Crowd: A Collaborative Filter for Identifying Locals

‣ Background Why Identify Locals? Empirical Study on Crowd Work during Egypt Protests ‣ Test Machine Learning Solution for Identifying Locals Event - Occupy Wall Street in NYC Data Collection & Analysis Findings ‣ Discussion Leveraging Crowd Work From Empirical Work to Computational Solutions Learning from the Crowd: A Collaborative Filter for Identifying Locals

Why Help Identify Locals? ‣ Citizen Reporting: first hand info can contribute to situational awareness ‣ Info not already in the larger information space ‣ Digital volunteers often work to identify and create lists of on-the-ground Twitterers

‣ Crisis events vs. protest events Why Help Identify Locals?

‣ Crisis events vs. protest events ‣ Tunisia Protests - activists tweeting from the ground were a valuable source of info for journalists (Lohan, 2011) ‣ Egypt Protests - protestors on the ground were actively fostering solidarity from the remote crowd (Starbird and Palen, 2012) Why Help Identify Locals?

‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit assistance @jeffrae: We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 Why Help Identify Locals?

‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit assistance @jeffrae: We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 ‣ OWS Protests: Remote supporters aggregated and published lists of those on the ground @CassProphet: Follow on-scene @AACina @Jeffrae @DhaniBagels @Korgasm_ @brettchamberlin #TakeWallStreet #OurWallStreet #OccupyWallStreet #yeswecamp @djjohnso: We have 20 livetweeters for this list. Are there others? @djjohnso/occupywallstreetlive #takewallstreet #OurWallStreet #needsoftheoccupiers Why Help Identify Locals?

Empirical Study of Crowd Work during Political Protests

‣ something something else ‣ some more Collected #egypt #jan25 tweets 2,229,129 tweets 338,895 Twitterers Identified most-RTed Twitterers Determined location for sample Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis. Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis.

Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior

Move from Empirical Work to Computational Solution Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior

Event: Occupy Wall Street Protests September 15-21, 2011 NYC site - Zuccotti Park

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API Search TermsSearch API WindowStreaming API Window #occupywallstreet #dayofrage Sept 15 1pm - Sept 17 11amSept 17 11am - Sept 20 6:45pm #takewallstreet #sep17 #sept17 Sept 15 1pm - Sept 17 1:45pmSept 17 1:45pm - Sept 20 6:45pm #ourwallstreet Sept 18 9:38am - Sept 18 10:05amSept 18 10:05am - Sept 20 6:45pm

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample - 2385 Twitterers ‣ using a tweet-based sampling strategy

Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions

Twitterer-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers

Tweet-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions

‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample - 2385 Twitterers ‣ using a tweet-based sampling strategy Data Collection and Sampling

Location Coding ‣ 2385 Twitterers

Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests

Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9

‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9 Location Coding 4.5%

Filtering with a Support Vector Machine

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data trained SVM

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM labeled data

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data SVMs accept real-valued features and have been shown to work well on high-dimensional, noisy data (Schölkopf B., 2004) labeled training data unlabeled data trained SVM labeled data

Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier

Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets

Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets Flat Profile Features Statuses count (log) Initial followers count (log) Initial friends count (log) # of RTs a % of stream # of tweets for event (log) Description changed during event Location changed during event

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier data

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc...

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total2270106 Per fold22710 or 11

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total2270106 Per fold22710 or 11 stratified

A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6%

A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters

A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified

A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified 99.9% 4.6%

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification We want a better line...

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification Asymmetric Soft Margins

Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%

Findings Not On-ground Tweeters Correctly classified Incorrectly classified 77.6% 67.9% On-ground

Findings Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%

Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%

Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2% 4.7% 14.2% Original data set Data filtered through the SVM

Discussion

‣ Tripling the ratio of signal to noise Discussion

‣ Tripling the ratio of signal to noise - in world of digital volunteers Discussion

‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones Discussion

‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests Discussion

‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests ‣ Demonstrates value of using empirical work to inform computational solutions Discussion

Project EPIC University of Colorado, Boulder US National Science Foundation Grants IIS-0546315 & IIS-0910586 NSF Graduate Fellowship Thank you.

Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of.

Similar presentations

Presentation on theme: "Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of.

Similar presentations

Presentation on theme: "Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of."— Presentation transcript:

Similar presentations

About project

Feedback