Download presentation
Presentation is loading. Please wait.
Published byShavonne Owen Modified over 9 years ago
1
Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of Colorado Boulder, ATLAS Institute Leysia Palen University of Colorado Boulder, Computer Science Grace Muzny University of Washington, Computer Science
2
Social Media & Mass Disruption Events
4
Sociologists of disaster: After a disaster event, people will converge on the scene to, among other things, offer help
5
Spontaneous volunteers
6
Mass Disruption = Mass Convergence Social Media & Mass Disruption Events
7
Opportunities for Digital Convergence
8
Citizen Reporting
9
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting
10
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Volume
11
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Noise Volume
12
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume
13
Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume Crowd Work!
14
Signal
16
Noise? Signal
18
Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Signal
19
Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Signal
20
Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information First hand info New info coming in to the space for the first time Signal
21
Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Derivative Behavior Re-sourced Info Reposts Links/URLs Network Connections First hand info New info coming in to the space for the first time Signal
22
Original Information Derivative Behavior First hand info useto find New info coming in to the space for the first time Re-sourced Info Reposts Links/URLs Network Connections Signal
23
RT @mention follow Original Information Derivative Behavior @mention useto find Signal
24
RT @mention follow Original Information Derivative Behavior @mention useto find Signal
25
RT @mention follow Original Information Derivative Behavior @mention useto find Signal
26
RT @mention follow Original Information Derivative Behavior @mention Collaborative Filtering useto find Signal
27
RT @mention follow Original Information Derivative Behavior @mention Collaborative Filtering Crowd Work useto find Signal
28
Learning from the Crowd: A Collaborative Filter for Identifying Locals
29
‣ Background Why Identify Locals? Empirical Study on Crowd Work during Egypt Protests ‣ Test Machine Learning Solution for Identifying Locals Event - Occupy Wall Street in NYC Data Collection & Analysis Findings ‣ Discussion Leveraging Crowd Work From Empirical Work to Computational Solutions Learning from the Crowd: A Collaborative Filter for Identifying Locals
30
Why Help Identify Locals? ‣ Citizen Reporting: first hand info can contribute to situational awareness ‣ Info not already in the larger information space ‣ Digital volunteers often work to identify and create lists of on-the-ground Twitterers
31
‣ Crisis events vs. protest events Why Help Identify Locals?
32
‣ Crisis events vs. protest events ‣ Tunisia Protests - activists tweeting from the ground were a valuable source of info for journalists (Lohan, 2011) ‣ Egypt Protests - protestors on the ground were actively fostering solidarity from the remote crowd (Starbird and Palen, 2012) Why Help Identify Locals?
33
‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit assistance @jeffrae: We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 Why Help Identify Locals?
34
‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit assistance @jeffrae: We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 ‣ OWS Protests: Remote supporters aggregated and published lists of those on the ground @CassProphet: Follow on-scene @AACina @Jeffrae @DhaniBagels @Korgasm_ @brettchamberlin #TakeWallStreet #OurWallStreet #OccupyWallStreet #yeswecamp @djjohnso: We have 20 livetweeters for this list. Are there others? @djjohnso/occupywallstreetlive #takewallstreet #OurWallStreet #needsoftheoccupiers Why Help Identify Locals?
35
Empirical Study of Crowd Work during Political Protests
36
‣ something something else ‣ some more Collected #egypt #jan25 tweets 2,229,129 tweets 338,895 Twitterers Identified most-RTed Twitterers Determined location for sample Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
37
‣ Crowd may work to identify on-the-ground Twitterers Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
38
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
39
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
40
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
41
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution
42
‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis. Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis.
43
Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior
44
Move from Empirical Work to Computational Solution Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior
45
Event: Occupy Wall Street Protests September 15-21, 2011 NYC site - Zuccotti Park
46
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API
47
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API Search TermsSearch API WindowStreaming API Window #occupywallstreet #dayofrage Sept 15 1pm - Sept 17 11amSept 17 11am - Sept 20 6:45pm #takewallstreet #sep17 #sept17 Sept 15 1pm - Sept 17 1:45pmSept 17 1:45pm - Sept 20 6:45pm #ourwallstreet Sept 18 9:38am - Sept 18 10:05amSept 18 10:05am - Sept 20 6:45pm
48
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers
49
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change
50
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information
51
Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample - 2385 Twitterers ‣ using a tweet-based sampling strategy
52
Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions
53
Twitterer-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers
54
Tweet-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions
55
‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample - 2385 Twitterers ‣ using a tweet-based sampling strategy Data Collection and Sampling
56
Location Coding ‣ 2385 Twitterers
57
Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests
58
Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9
59
‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9 Location Coding 4.5%
60
Filtering with a Support Vector Machine
61
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data
62
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data trained SVM
63
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM
64
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM labeled data
65
Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data SVMs accept real-valued features and have been shown to work well on high-dimensional, noisy data (Schölkopf B., 2004) labeled training data unlabeled data trained SVM labeled data
66
Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier
67
Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets
68
Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets Flat Profile Features Statuses count (log) Initial followers count (log) Initial friends count (log) # of RTs a % of stream # of tweets for event (log) Description changed during event Location changed during event
69
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier
70
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier data
71
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data
72
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data
73
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data
74
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data
75
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc...
76
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total2270106 Per fold22710 or 11
77
10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total2270106 Per fold22710 or 11 stratified
78
A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6%
79
A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters
80
A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified
81
A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified 99.9% 4.6%
82
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification
83
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification
84
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification We want a better line...
85
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification
86
Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification Asymmetric Soft Margins
87
Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%
88
Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%
89
Findings Not On-ground Tweeters Correctly classified Incorrectly classified 77.6% 67.9% On-ground
90
Findings Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%
91
Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%
92
Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2% 4.7% 14.2% Original data set Data filtered through the SVM
93
Discussion
94
‣ Tripling the ratio of signal to noise Discussion
95
‣ Tripling the ratio of signal to noise - in world of digital volunteers Discussion
96
‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones Discussion
97
‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests Discussion
98
‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests ‣ Demonstrates value of using empirical work to inform computational solutions Discussion
99
Project EPIC University of Colorado, Boulder US National Science Foundation Grants IIS-0546315 & IIS-0910586 NSF Graduate Fellowship Thank you.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.