Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of.

Similar presentations


Presentation on theme: "Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of."— Presentation transcript:

1 Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of Colorado Boulder, ATLAS Institute Leysia Palen University of Colorado Boulder, Computer Science Grace Muzny University of Washington, Computer Science

2 Social Media & Mass Disruption Events

3

4 Sociologists of disaster: After a disaster event, people will converge on the scene to, among other things, offer help

5 Spontaneous volunteers

6 Mass Disruption = Mass Convergence Social Media & Mass Disruption Events

7 Opportunities for Digital Convergence

8 Citizen Reporting

9 Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting

10 Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Volume

11 Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Noise Volume

12 Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume

13 Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume Crowd Work!

14 Signal

15

16 Noise? Signal

17

18 Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Signal

19 Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Signal

20 Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information First hand info New info coming in to the space for the first time Signal

21 Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Derivative Behavior Re-sourced Info Reposts Links/URLs Network Connections First hand info New info coming in to the space for the first time Signal

22 Original Information Derivative Behavior First hand info useto find New info coming in to the space for the first time Re-sourced Info Reposts Links/URLs Network Connections Signal

23 RT @mention follow Original Information Derivative Behavior @mention useto find Signal

24 RT @mention follow Original Information Derivative Behavior @mention useto find Signal

25 RT @mention follow Original Information Derivative Behavior @mention useto find Signal

26 RT @mention follow Original Information Derivative Behavior @mention Collaborative Filtering useto find Signal

27 RT @mention follow Original Information Derivative Behavior @mention Collaborative Filtering Crowd Work useto find Signal

28 Learning from the Crowd: A Collaborative Filter for Identifying Locals

29 ‣ Background Why Identify Locals? Empirical Study on Crowd Work during Egypt Protests ‣ Test Machine Learning Solution for Identifying Locals Event - Occupy Wall Street in NYC Data Collection & Analysis Findings ‣ Discussion Leveraging Crowd Work From Empirical Work to Computational Solutions Learning from the Crowd: A Collaborative Filter for Identifying Locals

30 Why Help Identify Locals? ‣ Citizen Reporting: first hand info can contribute to situational awareness ‣ Info not already in the larger information space ‣ Digital volunteers often work to identify and create lists of on-the-ground Twitterers

31 ‣ Crisis events vs. protest events Why Help Identify Locals?

32 ‣ Crisis events vs. protest events ‣ Tunisia Protests - activists tweeting from the ground were a valuable source of info for journalists (Lohan, 2011) ‣ Egypt Protests - protestors on the ground were actively fostering solidarity from the remote crowd (Starbird and Palen, 2012) Why Help Identify Locals?

33 ‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit assistance @jeffrae: We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 Why Help Identify Locals?

34 ‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit assistance @jeffrae: We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 ‣ OWS Protests: Remote supporters aggregated and published lists of those on the ground @CassProphet: Follow on-scene @AACina @Jeffrae @DhaniBagels @Korgasm_ @brettchamberlin #TakeWallStreet #OurWallStreet #OccupyWallStreet #yeswecamp @djjohnso: We have 20 livetweeters for this list. Are there others? @djjohnso/occupywallstreetlive #takewallstreet #OurWallStreet #needsoftheoccupiers Why Help Identify Locals?

35 Empirical Study of Crowd Work during Political Protests

36 ‣ something something else ‣ some more Collected #egypt #jan25 tweets 2,229,129 tweets 338,895 Twitterers Identified most-RTed Twitterers Determined location for sample Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

37 ‣ Crowd may work to identify on-the-ground Twitterers Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

38 ‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

39 ‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

40 ‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

41 ‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

42 ‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis. Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis.

43 Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior

44 Move from Empirical Work to Computational Solution Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior

45 Event: Occupy Wall Street Protests September 15-21, 2011 NYC site - Zuccotti Park

46 Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API

47 Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API Search TermsSearch API WindowStreaming API Window #occupywallstreet #dayofrage Sept 15 1pm - Sept 17 11amSept 17 11am - Sept 20 6:45pm #takewallstreet #sep17 #sept17 Sept 15 1pm - Sept 17 1:45pmSept 17 1:45pm - Sept 20 6:45pm #ourwallstreet Sept 18 9:38am - Sept 18 10:05amSept 18 10:05am - Sept 20 6:45pm

48 Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers

49 Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change

50 Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information

51 Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample - 2385 Twitterers ‣ using a tweet-based sampling strategy

52 Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions

53 Twitterer-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers

54 Tweet-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions

55 ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample - 2385 Twitterers ‣ using a tweet-based sampling strategy Data Collection and Sampling

56 Location Coding ‣ 2385 Twitterers

57 Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests

58 Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9

59 ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9 Location Coding 4.5%

60 Filtering with a Support Vector Machine

61 Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data

62 Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data trained SVM

63 Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM

64 Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM labeled data

65 Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data SVMs accept real-valued features and have been shown to work well on high-dimensional, noisy data (Schölkopf B., 2004) labeled training data unlabeled data trained SVM labeled data

66 Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier

67 Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets

68 Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets Flat Profile Features Statuses count (log) Initial followers count (log) Initial friends count (log) # of RTs a % of stream # of tweets for event (log) Description changed during event Location changed during event

69 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier

70 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier data

71 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

72 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

73 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

74 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

75 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc...

76 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total2270106 Per fold22710 or 11

77 10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total2270106 Per fold22710 or 11 stratified

78 A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6%

79 A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters

80 A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified

81 A Majority Class Classifier ‣ A naive approach yields high overall accuracy - 95.6% Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified 99.9% 4.6%

82 Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification

83 Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification

84 Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification We want a better line...

85 Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification

86 Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification Asymmetric Soft Margins

87 Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%

88 Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%

89 Findings Not On-ground Tweeters Correctly classified Incorrectly classified 77.6% 67.9% On-ground

90 Findings Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%

91 Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%

92 Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2% 4.7% 14.2% Original data set Data filtered through the SVM

93 Discussion

94 ‣ Tripling the ratio of signal to noise Discussion

95 ‣ Tripling the ratio of signal to noise - in world of digital volunteers Discussion

96 ‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones Discussion

97 ‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests Discussion

98 ‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests ‣ Demonstrates value of using empirical work to inform computational solutions Discussion

99 Project EPIC University of Colorado, Boulder US National Science Foundation Grants IIS-0546315 & IIS-0910586 NSF Graduate Fellowship Thank you.


Download ppt "Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of."

Similar presentations


Ads by Google