Quality Management on Amazon Mechanical Turk Panos Ipeirotis Foster Provost Jing Wang New York University.

Slides:



Advertisements
Similar presentations
Panos Ipeirotis Stern School of Business
Advertisements

The t Test for Two Independent Samples
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.
Extra Credit Challenge III Chapter 18 How it works Your group is your own table If you participate and avoid disqualification, you automatically get.
Global Value Numbering using Random Interpretation Sumit Gulwani George C. Necula CS Department University of California, Berkeley.
Created by Terri Street; inspired by the work of Jennifer Wagner; revisions courtesy of Stephanie Novack. © 2000, 2009 Background images courtesy of abc.com.
Fill in missing numbers or operations
Name: Date: Read temperatures on a thermometer Independent / Some adult support / A lot of adult support
© Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems Introduction.
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
1 Network-Level Spam Detection Nick Feamster Georgia Tech.
Helping TCP Work at Gbps Cheng Jin the FAST project at Caltech
1/2, 1/6, 1/6, 1/6 1)If you spin once, what is the probability of getting each dollar amount (fractions)? 2) If you spin twice, what is the probability.
CalculatorsStatistics Number Sense EquationsFunctions
Bayesian inference Lee Harrison York Neuroimaging Centre 01 / 05 / 2009.
Introduction to Monte Carlo Markov chain (MCMC) methods
I can interpret intervals on partially numbered scales and record readings accurately ? 15 ? 45 ? 25 ? 37 ? 53 ? 64 Each little mark.
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Percent of Change.
Lecture 4 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Psychometrics 101: Foundational Knowledge for Testing Professionals Steve Saladin, Ph.D. University of Idaho.
Rewarding Crowdsourced Workers Panos Ipeirotis New York University and Google Joint work with: Jing Wang, Foster Provost, Josh Attenberg, and Victor Sheng;
Experimental Measurements and their Uncertainties
Pressure = force area = 12 N 25 cm 2 = 0.48 N/cm 2.
Machine Learning: Intro and Supervised Classification
Active Shape Models Suppose we have a statistical shape model –Trained from sets of examples How do we use it to interpret new images? Use an “Active Shape.
Mark Dixon, School of Computing SOFT 120Page 1 5. Passing Parameters by Reference.
OUR GAME- BASED APPROACH FOR GR. 4, 5, 6 Leveling up your Typing “How do I earn more typing choices?”
Real vs. Nominal GDP Unit Seven, Lesson Two Economics Unit Seven, Lesson Two Economics.
QR026 High Sensitivity VME Tuner Performance Data
Distance, speed and acceleration
May 2002 Child Restraint Systems A Field study of Misuse Helena Menezes José Dias.
Classification Classification Examples
ChooseMove=16  19 V =30 points. LegalMoves= 16  19, or SimulateMove = 11  15, or …. 16  19,
Overview of this week Debugging tips for ML algorithms
Imbalanced data David Kauchak CS 451 – Fall 2013.
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis Stern School of Business New York University Joint work with: Jing.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University Title Page.
Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Learning Bayesian Networks
Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.
Adversarial Information Retrieval The Manipulation of Web Content.
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,
Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers Panos Ipeirotis New York University Joint work with Jing.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.
CS 391L: Machine Learning: Ensembles
Data Structures & Algorithms and The Internet: A different way of thinking.
Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Joint work with Foster Provost & Panos Ipeirotis New York University.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.
© 2009 Amazon.com, Inc. or its Affiliates. Amazon Mechanical Turk New York City Meet Up September 1, 2009 WELCOME!
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Topics In Social Computing (67810) Module 3 – Designing Social Systems & Mechanisms Crowdsourcing.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Adventures in Crowdsourcing Panos Ipeirotis Stern School of Business New York University Thanks to: Jing Wang, Marios Kokkodis, Foster Provost, Josh Attenberg,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
An Iterative Approach to Discriminative Structure Learning
network of simple neuron-like computing elements
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Presentation transcript:

Quality Management on Amazon Mechanical Turk Panos Ipeirotis Foster Provost Jing Wang New York University

Example: Build an Adult Web Site Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general audience), or X (restricted/porn) Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr MTurk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr MTurk: 2500 websites/hr, cost: $12/hr

Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)

Using redundant votes, we can infer worker quality Look at our spammer friend ATAMRO447HWJQ together with other 9 workers Using redundancy, we can compute error rates for each worker

–Initialize error rate for each worker –Estimate most likely label for each object (e.g., majority vote) Using labels, revise error rates for workers Using error rates, compute labels (weighted combination, according to quality) Iterate until convergence –Initialize error rate for each worker –Estimate most likely label for each object (e.g., majority vote) Using labels, revise error rates for workers Using error rates, compute labels (weighted combination, according to quality) Iterate until convergence Repeated Labeling, EM, and Confusion Matrices Use the Dawid & Skene (1979) algorithm to estimate worker quality and correct label Our friend ATAMRO447HWJQ marked all sites as G. Seems like a spammer… Error rates for ATAMRO447HWJQ P[X X]=0.847%P[X G]=99.153% P[G X]=0.053%P[G G]=99.947%

Challenge: From Confusion Matrixes to Spam Scores The Dawid & Skene algorithm generates confusion matrixes for workers How to see if a worker is a spammer using the confusion matrix?

Challenge 1: Spammers put everything in majority class Confusion matrix for spammer P[X X]=0% P[X G]=100% P[G X]=0% P[G G]=100% Confusion matrix for good worker P[X X]=80%P[X G]=20% P[G X]=20%P[G G]=80% In reality, we have 85% G sites and 15% X sites Errors of spammer = 0% * 85% + 100% * 15% = 15% Error rate of good worker = 85% * 20% + 85% * 20% = 20%

Challenge 2: Some legitimate workers are biased Error Rates for Worker: ATLJIK76YH1TF P[G G]=20.0%P[G P]=80.0%P[G R]=0.0%P[G X]=0.0% P[P G]=0.0%P[P P]=0.0%P[P R]=100.0%P[P X]=0.0% P[R G]=0.0%P[R P]=0.0%P[R R]=100.0%P[R X]=0.0% P[X G]=0.0%P[X P]=0.0%P[X R]=0.0%P[X X]=100.0% Error Rates for Worker: ATLJIK76YH1TF P[G G]=20.0%P[G P]=80.0%P[G R]=0.0%P[G X]=0.0% P[P G]=0.0%P[P P]=0.0%P[P R]=100.0%P[P X]=0.0% P[R G]=0.0%P[R P]=0.0%P[R R]=100.0%P[R X]=0.0% P[X G]=0.0%P[X P]=0.0%P[X R]=0.0%P[X X]=100.0% In reality, we have 85% G sites, 5% P sites, 5% R sites, 5% X sites Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15% Error rate of biased worker = 80% * 85% + 100% * 5% = 73%

Solution: Reverse errors first, then compute error rate When biased worker says G, it is 100% G When biased worker says P, it is 100% G When biased worker says R, it is 50% P, 50% R When biased worker says X, it is 100% X Small ambiguity for R votes but other than that, fine! Error Rates for biased worker: ATLJIK76YH1TF P[G G]=20.0%P[G P]=80.0%P[G R]=0.0%P[G X]=0.0% P[P G]=0.0%P[P P]=0.0%P[P R]=100.0%P[P X]=0.0% P[R G]=0.0%P[R P]=0.0%P[R R]=100.0%P[R X]=0.0% P[X G]=0.0%P[X P]=0.0%P[X R]=0.0%P[X X]=100.0% Error Rates for biased worker: ATLJIK76YH1TF P[G G]=20.0%P[G P]=80.0%P[G R]=0.0%P[G X]=0.0% P[P G]=0.0%P[P P]=0.0%P[P R]=100.0%P[P X]=0.0% P[R G]=0.0%P[R P]=0.0%P[R R]=100.0%P[R X]=0.0% P[X G]=0.0%P[X P]=0.0%P[X R]=0.0%P[X X]=100.0%

Solution: Reverse errors first, then compute error rate When spammer says G, it is 25% G, 25% P, 25% R, 25% X When spammer says P, it is 25% G, 25% P, 25% R, 25% X When spammer says R, it is 25% G, 25% P, 25% R, 25% X When spammer says X, it is 25% G, 25% P, 25% R, 25% X The results are highly ambiguous. No information provided! Error Rates for spammer: ATAMRO447HWJQ P[G G]=100.0%P[G P]=0.0% P[G R]=0.0%P[G X]=0.0% P[P G]=100.0% P[P P]=0.0% P[P R]=0.0% P[P X]=0.0% P[R G]=100.0% P[R P]=0.0%P[R R]=0.0% P[R X]=0.0% P[X G]=100.0% P[X P]=0.0%P[X R]=0.0%P[X X]=0.0% Error Rates for spammer: ATAMRO447HWJQ P[G G]=100.0%P[G P]=0.0% P[G R]=0.0%P[G X]=0.0% P[P G]=100.0% P[P P]=0.0% P[P R]=0.0% P[P X]=0.0% P[R G]=100.0% P[R P]=0.0%P[R R]=0.0% P[R X]=0.0% P[X G]=100.0% P[X P]=0.0%P[X R]=0.0%P[X X]=0.0%

Computing Cost of Soft Labels Posterior soft labels with probability mass in a single class are good Posterior soft labels with probability mass spread across classes are bad

Experimental Results 500 web pages in G, P, R, X 100 workers per page 339 workers Lots of noise! 95% accuracy with majority vote Naïve spam detection: 1% of labels dropped, accuracy 95% Our spam detection: 30% of labels dropped, accuracy 99.8%

Code and Demo available Demo and Open source implementation available at: Input: –Labels from Mechanical Turk –Cost of incorrect labelings (e.g., X G costlier than G X) Output: –Corrected labels –Worker error rates –Ranking of workers according to their quality

Overflow Slides

Label Quality vs #labels per example

Worker quality estimation error vs #labels per worker