Rewarding Crowdsourced Workers Panos Ipeirotis New York University and Google Joint work with: Jing Wang, Foster Provost, Josh Attenberg, and Victor Sheng;

Slides:

Advertisements

Similar presentations

Panos Ipeirotis Stern School of Business

Advertisements

Quality Management on Amazon Mechanical Turk Panos Ipeirotis Foster Provost Jing Wang New York University.

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.

Chapter 12. LABOUR McGraw-Hill/IrwinCopyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12.

MCMC estimation in MlwiN

Auto-Moto Financial Services- The Old Process

Chapter 5 Appendix Indifference Curves

Proving a Premise – Chi Square Inferential statistics involve randomly drawing samples from populations, and making inferences about the total population.

Your boss asks… How many of these things do we have to sell before we start making money? Use your arrow keys to navigate the slides.

Scoring Terminology Used in Assessment in Special Education

Chapter 7 Hypothesis Testing

Samples The means of these samples

Chapter 7, Sample Distribution

Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.

Text Categorization.

Probability and Samples: The Distribution of Sample Means

Io Enterprises Orientation Overlake Mining Company.

Do Now Do you have extra credit to turn in? Get out pink review paper.

Chapter 4 Evaluating Classification and Predictive Performance

Store UNDER DESK 6.

© Risk Decisions, 2009 Getting a Return On Investment from Risk Management Val Jonas CEO, Risk Decisions Group.

Basics of Statistical Estimation

Commonly Used Distributions

Performance Evaluation Sponsored Search Markets Giovanni Neglia INRIA – EPI Maestro 4 February 2013.

Incentivize Crowd Labeling under Budget Constraint

Imbalanced data David Kauchak CS 451 – Fall 2013.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis Stern School of Business New York University Joint work with: Jing.

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University Title Page.

Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.

Games of Prediction or Things get simpler as Yoav Freund Banter Inc.

SAMPLING DISTRIBUTIONS

Fighting Fire With Fire: Crowdsourcing Security Solutions on the Social Web Christo Wilson Northeastern University

Machine Learning Week 2 Lecture 1.

Chapter 7 – K-Nearest-Neighbor

Model Evaluation Metrics for Performance Evaluation

Data Mining Techniques Outline

Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,

Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers Panos Ipeirotis New York University Joint work with Jing.

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Joint work with Foster Provost & Panos Ipeirotis New York University.

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:

Detection, Classification and Tracking in a Distributed Wireless Sensor Network Presenter: Hui Cao.

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.

Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

© 2010 Pearson Prentice Hall. All rights reserved Chapter Sampling Distributions 8.

Sampling Distributions Statistics Introduction Let’s assume that the IQ in the population has a mean (  ) of 100 and a standard deviation (  )

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Adventures in Crowdsourcing Panos Ipeirotis Stern School of Business New York University Thanks to: Jing Wang, Marios Kokkodis, Foster Provost, Josh Attenberg,

Inference About Variables Part IV Review

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Roc curves By Vittoria Cozza, matr

Presentation transcript:

Rewarding Crowdsourced Workers Panos Ipeirotis New York University and Google Joint work with: Jing Wang, Foster Provost, Josh Attenberg, and Victor Sheng; A Computer Scientist in a Business School

Example: Build an Web Page Classifier Need a large number of labeled sites for training Get people to look at sites and label them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr Mechanical Turk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr Mechanical Turk: 2500 websites/hr, cost: $12/hr

Challenges We do not know the true category for the objects – Available only after (costly) manual inspection We do not know quality of the workers We want to label objects with true categories We want (need?) to know the quality of the workers

1.Initialize correct label for each object (e.g., use majority vote) 2.Estimate error rates for workers (using correct labels) 3.Estimate correct labels (using error rates, weight worker votes according to quality) 4.Go to Step 2 and iterate until convergence 1.Initialize correct label for each object (e.g., use majority vote) 2.Estimate error rates for workers (using correct labels) 3.Estimate correct labels (using error rates, weight worker votes according to quality) 4.Go to Step 2 and iterate until convergence Expectation Maximization Estimation Iterative process to estimate worker error rates

Challenge: From Confusion Matrixes to Quality Scores How to check if a worker is a spammer using the confusion matrix? (hint: error rate not enough) How to check if a worker is a spammer using the confusion matrix? (hint: error rate not enough) Confusion matrix for spammer worker P[X X]=0.847%P[X G]=99.153% P[G X]=0.053%P[G G]=99.947% Confusion matrix for good worker P[X X]=99.847%P[X G]=0.153% P[G X]=4.053%P[G G]=95.947%

Challenge 1: Spammers are lazy and smart! Confusion matrix for spammer P[X X]=0% P[X G]=100% P[G X]=0% P[G G]=100% Confusion matrix for good worker P[X X]=80%P[X G]=20% P[G X]=20%P[G G]=80% Spammers figure out how to fly under the radar… In reality, we have 85% G sites and 15% X sites Error rate of spammer = 0% * 85% + 100% * 15% = 15% Error rate of good worker = 85% * 20% + 85% * 20% = 20% False negatives: Spam workers pass as legitimate

Challenge 2: Humans are biased! Error rates for legitimate (but biased) employee P[G G]=20.0%P[G P]=80.0%P[G R]=0.0%P[G X]=0.0% P[P G]=0.0%P[P P]=0.0%P[P R]=100.0%P[P X]=0.0% P[R G]=0.0%P[R P]=0.0%P[R R]=100.0%P[R X]=0.0% P[X G]=0.0%P[X P]=0.0%P[X R]=0.0%P[X X]=100.0% Error rates for legitimate (but biased) employee P[G G]=20.0%P[G P]=80.0%P[G R]=0.0%P[G X]=0.0% P[P G]=0.0%P[P P]=0.0%P[P R]=100.0%P[P X]=0.0% P[R G]=0.0%P[R P]=0.0%P[R R]=100.0%P[R X]=0.0% P[X G]=0.0%P[X P]=0.0%P[X R]=0.0%P[X X]=100.0% We have 85% G sites, 5% P sites, 5% R sites, 5% X sites Error rate of spammer (all G) = 0% * 85% + 100% * 15% = 15% Error rate of biased worker = 80% * 85% + 100% * 5% = 73% False positives: Legitimate workers appear to be spammers (important note: bias is not just a matter of ordered classes)

Solution: Fix bias first, compute error rate afterwards When biased worker says G, it is 100% G When biased worker says P, it is 100% G When biased worker says R, it is 50% P, 50% R When biased worker says X, it is 100% X Small ambiguity for R-rated votes but other than that, fine! Error Rates for legitimate (but biased) employee P[G G]=20.0%P[G P]=80.0%P[G R]=0.0%P[G X]=0.0% P[P G]=0.0%P[P P]=0.0%P[P R]=100.0%P[P X]=0.0% P[R G]=0.0%P[R P]=0.0%P[R R]=100.0%P[R X]=0.0% P[X G]=0.0%P[X P]=0.0%P[X R]=0.0%P[X X]=100.0% Error Rates for legitimate (but biased) employee P[G G]=20.0%P[G P]=80.0%P[G R]=0.0%P[G X]=0.0% P[P G]=0.0%P[P P]=0.0%P[P R]=100.0%P[P X]=0.0% P[R G]=0.0%P[R P]=0.0%P[R R]=100.0%P[R X]=0.0% P[X G]=0.0%P[X P]=0.0%P[X R]=0.0%P[X X]=100.0%

When spammer says G, it is 25% G, 25% P, 25% R, 25% X When spammer says P, it is 25% G, 25% P, 25% R, 25% X When spammer says R, it is 25% G, 25% P, 25% R, 25% X When spammer says X, it is 25% G, 25% P, 25% R, 25% X [note: assume equal priors] The results are highly ambiguous. No information provided! Error Rates for spammer P[G G]=100.0%P[G P]=0.0% P[G R]=0.0%P[G X]=0.0% P[P G]=100.0% P[P P]=0.0% P[P R]=0.0% P[P X]=0.0% P[R G]=100.0% P[R P]=0.0%P[R R]=0.0% P[R X]=0.0% P[X G]=100.0% P[X P]=0.0%P[X R]=0.0%P[X X]=0.0% Error Rates for spammer P[G G]=100.0%P[G P]=0.0% P[G R]=0.0%P[G X]=0.0% P[P G]=100.0% P[P P]=0.0% P[P R]=0.0% P[P X]=0.0% P[R G]=100.0% P[R P]=0.0%P[R R]=0.0% P[R X]=0.0% P[X G]=100.0% P[X P]=0.0%P[X R]=0.0%P[X X]=0.0% Solution: Fix bias first, compute error rate afterwards

[***Assume misclassification cost equal to 1, solution generalizes to arbitrary misclassification costs across categories] High cost: probability spread across classes Low cost: probability mass concentrated in one class Assigned LabelCorresponding Soft LabelLabel Cost Spammer: G 0.75 Good worker: P 0.0 Expected Misclassification Cost

Quality Score A spammer is a worker who assigns labels randomly, regardless of what the true class is. Quality score useful for ranking workers Unaffected by systematic biases Scalar, so no need to examine confusion matrices Quality Score: A scalar measure of quality

Quality Score Challenges Thresholding has wrong incentive structure: Decent (but still useful) workers remain unused If you are above the threshold, no need to improve Uncertainty: The quality score is not really a fixed number Fluctuations in payment are puzzling for workers Best to have only increases in payment Question: How to pay workers?

Two Types of Workers Divide workers into two groups –Qualified Workers The quality satisfies the target quality –Unqualified Workers The qualify fails to meet the target quality levels

–p: the price paid to all qualified workers –f W (w): the pdf distribution of worker reservation wage –F W (w): the cdf distribution of worker reservation wage –R: the fixed price paid by external client for each qualified object A Simple Pricing Model for Qualified Workers

15 f W (w): LogNormal(3,1), Selling price R=50 Example f W (w) Revenue Optimal Worker Salary = 21

The Value of Unqualified Workers Number of Workers Classification Cost We need ~9 workers to achieve the required quality Binary classification (1:1) Worker confusion matrix Accept classification cost <=0.1 Value: 1/9

–p: the price paid to all qualified workers –f W (w): the pdf distribution of worker reservation wage F W (w): the cdf distribution of worker reservation wage –Adjust for the presence of unqualified workers, and each unqualified worker counts as 1/k of a qualified one –R: the fixed price paid by external client for each qualified object A Pricing Model for Workers

Optimal Prices p*p* p*p* p * /3 p * /9

Quality Score Challenges Thresholding has wrong incentive structure: Decent (but still useful) workers remain unused If you are above the threshold, no need to improve Uncertainty: The quality score is not really a fixed number Fluctuations in payment are puzzling for workers Best to have only increases in payment Question: How to pay workers?

Bayesian Estimates for Uncertainty P[0 0]=Beta(2,1) P[0 1]=Beta(1,2) P[1 0]=Beta(1,2) P[0 0]=Beta(2,1) P[0 0]=Beta(101,1) P[0 1]=Beta(1,101) P[1 0]=Beta(1,101) P[0 0]=Beta(101,1) Worker A Worker B

Example of the piece-rate payment of a worker Real-Time Payment and Reimbursement Fair Payment # Tasks Infinity Piece-rate Payment (cents)

Example of the piece-rate payment of a worker 10 Real-Time Payment and Reimbursement Fair Payment Potential Bonus # Tasks Infinity Piece-rate Payment (cents) Payment Number of Tasks Piece-rate Payment

1020 Real-Time Payment and Reimbursement Fair Payment Potential Bonus # Tasks Infinity Piece-rate Payment (cents) Example of the piece-rate payment of a worker Payment Reimbursement Number of Tasks Piece-rate Payment

1020 Real-Time Payment and Reimbursement Fair Payment 30 Potential Bonus # Tasks Infinity Piece-rate Payment (cents) Example of the piece-rate payment of a worker Payment Reimbursement Payment Reimbursement Piece-rate Payment

1020 Real-Time Payment and Reimbursement Fair Payment 3040 Potential Bonus # Tasks Infinity Piece-rate Payment (cents) Example of the piece-rate payment of a worker Payment Reimbursement Payment Reimbursement Payment Reimbursement Piece-rate Payment

Synthetic Experiment Setup N=10,000 tasks R=200 cents Cost <= 0.01 Labeling Process: Workers –Arrival frequency: every 600 seconds –Number of each arrival: 10 workers –Submitting speed: 30 seconds per task The evaluation criterion is Unit Time Profit:

(Synthetic) Experimental Setup Scatter Plot of Confusion Matrix, Reservation Wage Qualify level and reservation wage independently distributed Qualify level and reservation wage positively correlated

Experimental Results Average Profit per Second

Workers reacting to bad rewards/scores Score-based feedback leads to strange interactions: The angry, has-been-burnt-too-many-times worker: F*** YOU! I am doing everything correctly and you know it! Stop trying to reject me with your stupid scores! The overachiever worker: What am I doing wrong?? My score is 92% and I want to have 100% 29

National Academy of Sciences Dec 2010 Frontiers of Science conference 30 Your workers behave like my mice! An unexpected connection…

31 Your workers behave like my mice! Eh?

32 Your workers want to use only their motor skills, not their cognitive skills

The Biology Fundamentals 33 Brain functions are biologically expensive (20% of total energy consumption in humans) Motor skills are more energy efficient than cognitive skills (e.g., walking) Brain tends to delegate easy tasks to part of the neural system that handles motor skills

An unexpected connection at the NAS Frontiers of Science conf. 34 Your workers want to use only their motor skills, not their cognitive skills Makes sense

An unexpected connection at the NAS Frontiers of Science conf. 35 And here is how I train my mice to behave…

The Mice Experiment Cognitive Solve maze Find pellet 36 Motor Push lever three times Pellet drops

How to Train the Mice? 37 Confuse motor skills! Reward cognition! I should try this the moment that I get back to my room

Punishing Workers Motor Skills Punish bad answers with frustration of motor skills (e.g., add delays between tasks) – Loading image, please wait… – Image did not load, press here to reload – 404 error. Return the HIT and accept again Make this probabilistic to keep feedback implicit 38

Rewarding (?) Cognitive Effort Reward good answers by rewarding the cognitive part of the brain – Introduce variety – Introduce novelty – Give new tasks fast – Show score improvements faster (but not the opposite) – Show optimistic score estimates 39

40

Experiments Web page classification Image tagging & URL collection 41

Experimental Summary (I) Spammer workers quickly abandon – No need to display scores, or ban – Low quality submissions from ~60% to ~3% – Half-life of low-quality from 100+ HITs to less than 5 Good workers unaffected – No significant effect on participation of workers with good performance – Lifetime of participants unaffected – Longer response times (after removing the intervention delays; that was puzzling) 42

Experimental Summary (II) Remember, scheme was for training the mice… 15%-20% of the spammers start submitting good work! ???? 43

Two key questions Why response time was slower for some good workers? Why some low quality workers start working well? ???? 44

System 1: Automatic actions System 2: Intelligent actions 45

System 1 Tasks 46

System 2 Tasks 47

48 Status: Usage of System 1 (Automatic) Status: Usage of System 2 (Intelligent) Not Performing Well? Disrupt and Engage System 2 Performing Well? Check if System 1 can Handle, remove System 2 stimuli Performing Well? Not Performing Well? Hell/Slow ban Out

Thanks! Q & A?