Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.

Slides:



Advertisements
Similar presentations
Panos Ipeirotis Stern School of Business
Advertisements

Quality Management on Amazon Mechanical Turk Panos Ipeirotis Foster Provost Jing Wang New York University.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.
Rewarding Crowdsourced Workers Panos Ipeirotis New York University and Google Joint work with: Jing Wang, Foster Provost, Josh Attenberg, and Victor Sheng;
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Classification: Alternative Techniques
Big Data Stupid Decisions The Importance Of Measuring What We Should Be Measuring Stern School of Business, New York University “A Computer.
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis Stern School of Business New York University Joint work with: Jing.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University Title Page.
Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
Evaluation.
Ensemble Learning what is an ensemble? why use an ensemble?
Crowdsourcing research data UMBC ebiquity,
Evaluation.
Ensemble Learning: An Introduction
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Three kinds of learning
CSSE463: Image Recognition Day 31 Due tomorrow night – Project plan Due tomorrow night – Project plan Evidence that you’ve tried something and what specifically.
Introduction to machine learning
Crowdsourcing Quality Management and other stories Panos Ipeirotis New York University & Tagasauris.
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Face Detection using the Viola-Jones Method
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers Panos Ipeirotis New York University Joint work with Jing.
Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.
Data Structures & Algorithms and The Internet: A different way of thinking.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)
A Technical Approach to Minimizing Spam Mallory J. Paine.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Joint work with Foster Provost & Panos Ipeirotis New York University.
Machine Learning Introduction Study on the Coursera All Right Reserved : Andrew Ng Lecturer:Much Database Lab of Xiamen University Aug 12,2014.
Universit at Dortmund, LS VIII
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.
© 2009 Amazon.com, Inc. or its Affiliates. Amazon Mechanical Turk New York City Meet Up September 1, 2009 WELCOME!
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
The Practice of Statistics Third Edition Chapter 11: Testing a Claim Copyright © 2008 by W. H. Freeman & Company Daniel S. Yates.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Adventures in Crowdsourcing Panos Ipeirotis Stern School of Business New York University Thanks to: Jing Wang, Marios Kokkodis, Foster Provost, Josh Attenberg,
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Classification with Perceptrons Reading:
ECE 5424: Introduction to Machine Learning
CS 4/527: Artificial Intelligence
Data Mining Practical Machine Learning Tools and Techniques
Computational Learning Theory
Computational Learning Theory
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

“A Computer Scientist in a Business School” “A Computer Scientist in a Business School” Panos Ipeirotis - Introduction  New York University, Stern School of Business

4

5

Example: Build an “Adult Web Site” Classifier  Need a large number of hand-labeled sites  Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr

Amazon Mechanical Turk: Paid Crowdsourcing

Example: Build an “Adult Web Site” Classifier  Need a large number of hand-labeled sites  Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr

Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)

Improve Data Quality through Repeated Labeling  Get multiple, redundant labels using multiple workers  Pick the correct label based on majority vote  Probability of correctness increases with number of workers  Probability of correctness increases with quality of workers 1 worker 70% correct 1 worker 70% correct 11 workers 93% correct 11 workers 93% correct

11-vote Statistics  MTurk: 227 websites/hr, cost: $12/hr  Undergrad: 200 websites/hr, cost: $15/hr 11-vote Statistics  MTurk: 227 websites/hr, cost: $12/hr  Undergrad: 200 websites/hr, cost: $15/hr Single Vote Statistics  MTurk: 2500 websites/hr, cost: $12/hr  Undergrad: 200 websites/hr, cost: $15/hr Single Vote Statistics  MTurk: 2500 websites/hr, cost: $12/hr  Undergrad: 200 websites/hr, cost: $15/hr But Majority Voting is Expensive

Using redundant votes, we can infer worker quality  Look at our spammer friend ATAMRO447HWJQ together with other 9 workers Our “friend” ATAMRO447HWJQ mainly marked sites as G. Obviously a spammer…  We can compute error rates for each worker Error rates for ATAMRO447HWJQ  P[X → X]=9.847%P[X → G]=90.153%  P[G → X]=0.053%P[G → G]=99.947%

Rejecting spammers and Benefits Random answers error rate = 50% Average error rate for ATAMRO447HWJQ: 45.2%  P[X → X]=9.847%P[X → G]=90.153%  P[G → X]=0.053%P[G → G]=99.947% Action: REJECT and BLOCK Results:  Over time you block all spammers  Spammers learn to avoid your HITS  You can decrease redundancy, as quality of workers is higher

After rejecting spammers, quality goes up  Spam keeps quality down  Without spam, workers are of higher quality  Need less redundancy for same quality  Same quality of results for lower cost With spam 1 worker 70% correct With spam 1 worker 70% correct With spam 11 workers 93% correct With spam 11 workers 93% correct Without spam 1 worker 80% correct Without spam 1 worker 80% correct Without spam 5 workers 94% correct Without spam 5 workers 94% correct

Correcting biases  Classifying sites as G, PG, R, X  Sometimes workers are careful but biased  Classifies G → P and P → R  Average error rate for ATLJIK76YH1TF: too high Is she a spammer? Error Rates for CEO of AdSafe P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for CEO of AdSafe P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

Correcting biases  For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias  True error-rate ~ 9% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

Too much theory? Demo and Open source implementation available at:  Input: –Labels from Mechanical Turk –Cost of incorrect labelings (e.g., X  G costlier than G  X)  Output: –Corrected labels –Worker error rates –Ranking of workers according to their quality  Beta version, more improvements to come!  Suggestions and collaborations welcomed!

Scaling Crowdsourcing: Use Machine Learning  Human labor is expensive, even when paying cents  Need to scale crowdsourcing  Basic idea: Build a machine learning model and use it instead of humans Data from existing crowdsourced answers Data from existing crowdsourced answers New Case Automatic Model (through machine learning) Automatic Answer Automatic Answer

20 Tradeoffs for Automatic Models: Effect of Noise  Get more data  Improve model accuracy  Improve data quality  Improve classification Example Case: Porn or not? Data Quality = 50% Data Quality = 60% Data Quality = 80% Data Quality = 100%

Confident Automatic Model (through machine learning) Scaling Crowdsourcing: Iterative training  Use machine when confident, humans otherwise  Retrain with new human input → improve model → reduce need for humans Get human(s) to answer New Case Not confident Automatic Answer Automatic Answer Data from existing crowdsourced answers Data from existing crowdsourced answers

22 Tradeoffs for Automatic Models: Effect of Noise  Get more data  Improve model accuracy  Improve data quality  Improve classification Example Case: Porn or not? Data Quality = 50% Data Quality = 60% Data Quality = 80% Data Quality = 100%

Not confident Confident Automatic Model (through machine learning) Scaling Crowdsourcing: Iterative training, with noise  Use machine when confident, humans otherwise  Ask as many humans as necessary to ensure quality Get human(s) to answer New Case Automatic Answer Automatic Answer Confident for quality? Not confident for quality? Data from existing crowdsourced answers Data from existing crowdsourced answers

Thank you! Questions? “A Computer Scientist in a Business School” “A Computer Scientist in a Business School”