Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of.

Slides:



Advertisements
Similar presentations
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Advertisements

Influence and Passivity in Social Media Daniel M. Romero, Wojciech Galuba, Sitaram Asur, and Bernardo A. Huberman Social Computing Lab, HP Labs.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
BEHAVIORAL PREDICTION OF TWITTER USERS BASED ON TEXTUAL INFORMATION Shiyao Wang.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.
VISIT: Virtual Intelligent System for Informing Tourists Kevin Meehan Intelligent Systems Research Centre Supervisors: Dr. Kevin Curran, Dr. Tom Lunney,
We Know #Tag: Does the Dual Role Affect Hashtag Adoption? Lei Yang 1, Tao Sun 2, Ming Zhang 2, Qiaozhu Mei 1 1 School of Information, the University.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
A Survey of Mobile Phone Sensing Michael Ruffing CS 495.
3 ème Journée Doctorale G&E, Bordeaux, Mars 2015 Wei FENG Geo-Resources and Environment Lab, Bordeaux INP (Bordeaux Institute of Technology), France Supervisor:
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
Accuracy Assessment. 2 Because it is not practical to test every pixel in the classification image, a representative sample of reference points in the.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Active Learning for Class Imbalance Problem
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
Extreme Re-balancing for SVMs and other classifiers Presenter: Cui, Shuoyang 2005/03/02 Authors: Bhavani Raskutti & Adam Kowalczyk Telstra Croporation.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Exploring Metropolitan Dynamics with an Agent- Based Model Calibrated using Social Network Data Nick Malleson & Mark Birkin School of Geography, University.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Filtering and Recommendation INST 734 Module 9 Doug Oard.
Collating Social Network Profiles. Objective 2 System.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Microblogs: Information and Social Network Huang Yuxin.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
{ Collective Impact- Building More Effective Partnerships Amanda Mancuso, MPH Strategy Management & Collective Impact Consultant Insightformation.
Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones,
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
CISC Machine Learning for Solving Systems Problems Presented by: Sandeep Dept of Computer & Information Sciences University of Delaware Detection.
Prediction of Influencers from Word Use Chan Shing Hei.
Hawaii Clean Energy Initiative Online Presence. Social Media Best Practices Leverage Networks Generate “noise” Influence Search Expand Reach.
Presentation Title Department of Computer Science A More Principled Approach to Machine Learning Michael R. Smith Brigham Young University Department of.
IMPROVING ACTIVE LEARNING METHODS USING SPATIAL INFORMATION IGARSS 2011 Edoardo Pasolli Univ. of Trento, Italy Farid Melgani Univ.
Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization Shubhanshu Mishra 1, Jana Diesner 1, Jason Byrne 2, Elizabeth.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Network Community Behavior to Infer Human Activities.
Social Media Primer. Social Media is Great For: Building awareness and attracting new business Fostering community Providing helpful content and information.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
HAWAII CLEAN ENERGY INITIATIVE ONLINE PRESENCE Cover goes here.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
By: Kirk Winans Advisors: Professor Striegnitz and Professor Seri Interdepartmental Computer Science/Political Science Project.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Unsupervised Streaming Feature Selection in Social Media
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Big Data Processing of School Shooting Archives
Data Science Credibility: Evaluating What’s Been Learned
Using Social Media to Enhance Emergency Situation Awareness
Assistant Professor Public Administration and Policy Political Science
The important use of Twitter in the Educators’ World
Twitter Augmented Android Malware Detection
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Dieudo Mulamba November 2017
Clustering tweets and webpages
Information Retrieval
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions Kate Starbird University of Colorado Boulder, ATLAS Institute Leysia Palen University of Colorado Boulder, Computer Science Grace Muzny University of Washington, Computer Science

Social Media & Mass Disruption Events

Sociologists of disaster: After a disaster event, people will converge on the scene to, among other things, offer help

Spontaneous volunteers

Mass Disruption = Mass Convergence Social Media & Mass Disruption Events

Opportunities for Digital Convergence

Citizen Reporting

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Volume

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Noise Volume

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume

Opportunities for Digital Convergence Challenges of Digital Convergence Citizen Reporting Misinformation & Disinformation Noise Volume Crowd Work!

Signal

Noise? Signal

Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Signal

Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Signal

Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information First hand info New info coming in to the space for the first time Signal

Starbird, K., Palen, L., Hughes, A.L., & Vieweg, S. (2010). Chatter on The Red: What Hazards Threat Reveals about the Social Life of Microblogged Information. CSCW 2010 Original Information Derivative Behavior Re-sourced Info Reposts Links/URLs Network Connections First hand info New info coming in to the space for the first time Signal

Original Information Derivative Behavior First hand info useto find New info coming in to the space for the first time Re-sourced Info Reposts Links/URLs Network Connections Signal

follow Original Information Derivative useto find Signal

follow Original Information Derivative useto find Signal

follow Original Information Derivative useto find Signal

follow Original Information Derivative Collaborative Filtering useto find Signal

follow Original Information Derivative Collaborative Filtering Crowd Work useto find Signal

Learning from the Crowd: A Collaborative Filter for Identifying Locals

‣ Background Why Identify Locals? Empirical Study on Crowd Work during Egypt Protests ‣ Test Machine Learning Solution for Identifying Locals Event - Occupy Wall Street in NYC Data Collection & Analysis Findings ‣ Discussion Leveraging Crowd Work From Empirical Work to Computational Solutions Learning from the Crowd: A Collaborative Filter for Identifying Locals

Why Help Identify Locals? ‣ Citizen Reporting: first hand info can contribute to situational awareness ‣ Info not already in the larger information space ‣ Digital volunteers often work to identify and create lists of on-the-ground Twitterers

‣ Crisis events vs. protest events Why Help Identify Locals?

‣ Crisis events vs. protest events ‣ Tunisia Protests - activists tweeting from the ground were a valuable source of info for journalists (Lohan, 2011) ‣ Egypt Protests - protestors on the ground were actively fostering solidarity from the remote crowd (Starbird and Palen, 2012) Why Help Identify Locals?

‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 Why Help Identify Locals?

‣ Occupy Wall Street (OWS) Protests: Protestors on the ground wanted to publicize their numbers, foster solidarity with the crowd, and solicit We could really use a generator down here at Zuccotii Park. Can anyone help? #occupyWallStreet #takewallst #Sept17 ‣ OWS Protests: Remote supporters aggregated and published lists of those on the @brettchamberlin #TakeWallStreet #OurWallStreet #OccupyWallStreet We have 20 livetweeters for this list. Are there #takewallstreet #OurWallStreet #needsoftheoccupiers Why Help Identify Locals?

Empirical Study of Crowd Work during Political Protests

‣ something something else ‣ some more Collected #egypt #jan25 tweets 2,229,129 tweets 338,895 Twitterers Identified most-RTed Twitterers Determined location for sample Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution

‣ Crowd may work to identify on-the-ground Twitterers ‣ Identified several recommendation and user behavior features that had significant relationships to being “on the ground” ‣ More times retweeted = more likely to be on the ground ‣ More unique retweets = more likely to be on the ground ‣ More followers at beginning of event = less likely to be on the ground Learning from the Crowd Empirical Study of Crowd Work during 2011 Egypt Revolution Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis. Feature not available in tweet metadata. Identified through qualitative analysis, then calculated and evaluated through quantitative analysis.

Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior

Move from Empirical Work to Computational Solution Goal: Test Viability of a Machine Learning Solution to Identify Locals using Crowd Recommendation Behavior

Event: Occupy Wall Street Protests September 15-21, 2011 NYC site - Zuccotti Park

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API Search TermsSearch API WindowStreaming API Window #occupywallstreet #dayofrage Sept 15 1pm - Sept 17 11amSept 17 11am - Sept 20 6:45pm #takewallstreet #sep17 #sept17 Sept 15 1pm - Sept 17 1:45pmSept 17 1:45pm - Sept 20 6:45pm #ourwallstreet Sept 18 9:38am - Sept 18 10:05amSept 18 10:05am - Sept 20 6:45pm

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information

Data Collection and Sampling ‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample Twitterers ‣ using a tweet-based sampling strategy

Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions

Twitterer-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers

Tweet-based Sampling Strategy Number of Occupy-related tweets Number of Twitterers Sampling Users from Heavy-Tailed Distributions

‣ 270,508 Tweets - Search API, Streaming API ‣ 53,296 Total Twitterers ‣ 23,847 Twitterers sent >= 2 tweets ‣ allowing us to capture profile change ‣ Tweets from Streaming API contain Twitter profile information ‣ 10% sample Twitterers ‣ using a tweet-based sampling strategy Data Collection and Sampling

Location Coding ‣ 2385 Twitterers

Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests

Location Coding ‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9

‣ 2385 Twitterers ‣ Identify those tweeting from “the ground” in NYC OWS protests Location Total # of Twitterers Total2385 Ground & Tweeting Ground Info (Group A)106 Not Ground or Not Tweeting Ground Info (Group B)2270 Unknown – Excluded9 Location Coding 4.5%

Filtering with a Support Vector Machine

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data trained SVM

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data labeled training data unlabeled data trained SVM labeled data

Support Vector Machines (SVMs) ‣ Supervised machine learning algorithm—works with labeled training data to learn how to classify new data SVMs accept real-valued features and have been shown to work well on high-dimensional, noisy data (Schölkopf B., 2004) labeled training data unlabeled data trained SVM labeled data

Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier

Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets

Features ‣ Feature selection is important to prevent over-fitting and to produce an accurate classifier Recommendation Features Follower growth Follower growth as % of initial # Follower growth / Friend growth Listed growth Listed growth as % of initial # Times retweeted (log) Times RTed (log) / Initial followers (log) Times RTed (log) / # of tweets (log) # unique tweets RTed / # of tweets Flat Profile Features Statuses count (log) Initial followers count (log) Initial friends count (log) # of RTs a % of stream # of tweets for event (log) Description changed during event Location changed during event

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier data

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc...

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total Per fold22710 or 11

10-Fold Cross Validation ‣ Technique for splitting up labeled data for use in training and validating the classifier validation data Training data etc... Not On- ground On-ground Total Per fold22710 or 11 stratified

A Majority Class Classifier ‣ A naive approach yields high overall accuracy %

A Majority Class Classifier ‣ A naive approach yields high overall accuracy % Not On-ground Tweeters On-ground Tweeters

A Majority Class Classifier ‣ A naive approach yields high overall accuracy % Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified

A Majority Class Classifier ‣ A naive approach yields high overall accuracy % Not On-ground Tweeters On-ground Tweeters Correctly classified Incorrectly classified 99.9% 4.6%

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification We want a better line...

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification

Unbalanced Data Sets ‣ Without compensation for the unbalanced nature of the data, the classifier tends towards majority classification Asymmetric Soft Margins

Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%

Findings Not On- ground On-groundOverall Accuracies67.9%77.6%77.2%

Findings Not On-ground Tweeters Correctly classified Incorrectly classified 77.6% 67.9% On-ground

Findings Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%

Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2%

Findings ‣ The classifier tripled the signal to noise ratio Not On- ground On-groundOverall Accuracies77.6%67.9%77.2% 4.7% 14.2% Original data set Data filtered through the SVM

Discussion

‣ Tripling the ratio of signal to noise Discussion

‣ Tripling the ratio of signal to noise - in world of digital volunteers Discussion

‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones Discussion

‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests Discussion

‣ Tripling the ratio of signal to noise - in world of digital volunteers ‣ Crowd work - recommendation features can help identify locals this study isolated recommendation and user behavior to demonstrate efficacy of including those strategies ideal: combine textual features w/ recommendation ones ‣ Darker side: demonstrates power of the crowd to “give away” the identities of participants in political protests ‣ Demonstrates value of using empirical work to inform computational solutions Discussion

Project EPIC University of Colorado, Boulder US National Science Foundation Grants IIS & IIS NSF Graduate Fellowship Thank you.