Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,

Slides:



Advertisements
Similar presentations
Panos Ipeirotis Stern School of Business
Advertisements

Quality Management on Amazon Mechanical Turk Panos Ipeirotis Foster Provost Jing Wang New York University.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.
Rewarding Crowdsourced Workers Panos Ipeirotis New York University and Google Joint work with: Jing Wang, Foster Provost, Josh Attenberg, and Victor Sheng;
Incentivize Crowd Labeling under Budget Constraint
Imbalanced data David Kauchak CS 451 – Fall 2013.
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis Stern School of Business New York University Joint work with: Jing.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University Title Page.
Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.
Evaluation.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Induction of Decision Trees
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Three kinds of learning
ROC Curves.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Today Logistic Regression Decision Trees Redux Graphical Models
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,
by B. Zadrozny and C. Elkan
Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers Panos Ipeirotis New York University Joint work with Jing.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.
Propensity Score Matching and Variations on the Balancing Test Wang-Sheng Lee Melbourne Institute of Applied Economic and Social Research The University.
Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
A Technical Approach to Minimizing Spam Mallory J. Paine.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Joint work with Foster Provost & Panos Ipeirotis New York University.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
C ROWD C ENTRALITY David Karger Sewoong Oh Devavrat Shah MIT and UIUC.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.
© 2009 Amazon.com, Inc. or its Affiliates. Amazon Mechanical Turk New York City Meet Up September 1, 2009 WELCOME!
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Linear Classification with Perceptrons
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Machine Learning Concept Learning General-to Specific Ordering
John Lafferty Andrew McCallum Fernando Pereira
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
ELEC 303 – Random Signals Lecture 17 – Hypothesis testing 2 Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 2, 2009.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Review Statistical inference and test of significance.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Adventures in Crowdsourcing Panos Ipeirotis Stern School of Business New York University Thanks to: Jing Wang, Marios Kokkodis, Foster Provost, Josh Attenberg,
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

Evaluating Classifiers
Modeling Annotator Accuracies for Supervised Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 4/527: Artificial Intelligence
Data Mining Practical Machine Learning Tools and Techniques
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Roc curves By Vittoria Cozza, matr
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost, and Victor Sheng

3

4

5

A few of the tasks in the past Detect pages that discuss swine flu – Pharmaceutical firm had drug “treating” (off-label) swine flu – FDA prohibited pharmaceutical company to display drug ad in pages about swine flu – Two days to build and go live Big fast-food chain does not want ad to appear: – In pages that discuss the brand (99% negative sentiment) – In pages discussing obesity – Three days to build and go live 6

7 Need to build models fast Using Mechanical Turk: To find URLs for a given topic (hate speech, gambling, alcohol abuse, guns, bombs, celebrity gossip, etc etc) To classify URLs into appropriate categories

Example: Build an “Adult Web Site” Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr

9

Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)

Redundant votes, infer quality Look at our spammer friend ATAMRO447HWJQ together with other 9 workers  Using redundancy, we can compute error rates for each worker

1.Initialize“correct” label for each object (e.g., use majority vote) 2.Estimate error rates for workers (using “correct” labels) 3.Estimate “correct” labels (using error rates, weight worker votes according to quality) 4.Go to Step 2 and iterate until convergence 1.Initialize“correct” label for each object (e.g., use majority vote) 2.Estimate error rates for workers (using “correct” labels) 3.Estimate “correct” labels (using error rates, weight worker votes according to quality) 4.Go to Step 2 and iterate until convergence Algorithm of (Dawid & Skene, 1979) [and many recent variations on the same theme] Iterative process to estimate worker error rates Our friend ATAMRO447HWJQ marked almost all sites as G. Seems like a spammer… Error rates for ATAMRO447HWJQ P[G → G]=99.947% P[G → X]=0.053% P[X → G]=99.153% P[X → X]=0.847%

Challenge: From Confusion Matrixes to Quality Scores The algorithm generates “confusion matrixes” for workers How to check if a worker is a spammer using the confusion matrix? (hint: error rate not enough) How to check if a worker is a spammer using the confusion matrix? (hint: error rate not enough) Error rates for ATAMRO447HWJQ  P[X → X]=0.847%P[X → G]=99.153%  P[G → X]=0.053%P[G → G]=99.947%

Challenge 1: Spammers are lazy and smart! Confusion matrix for spammer  P[X → X]=0% P[X → G]=100%  P[G → X]=0% P[G → G]=100% Confusion matrix for good worker  P[X → X]=80%P[X → G]=20%  P[G → X]=20%P[G → G]=80% Spammers figure out how to fly under the radar… In reality, we have 85% G sites and 15% X sites Errors of spammer = 0% * 85% + 100% * 15% = 15% Error rate of good worker = 85% * 20% + 85% * 20% = 20% False negatives: Spam workers pass as legitimate

Challenge 2: Humans are biased! Error rates for CEO of AdSafe P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error rates for CEO of AdSafe P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%  In reality, we have 85% G sites, 5% P sites, 5% R sites, 5% X sites  Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15%  Error rate of biased worker = 80% * 85% + 100% * 5% = 73% False positives: Legitimate workers appear to be spammers

Solution: Reverse errors first, compute error rate afterwards When biased worker says G, it is 100% G When biased worker says P, it is 100% G When biased worker says R, it is 50% P, 50% R When biased worker says X, it is 100% X Small ambiguity for “R-rated” votes but other than that, fine! Error Rates for CEO of AdSafe P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for CEO of AdSafe P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

When spammer says G, it is 25% G, 25% P, 25% R, 25% X When spammer says P, it is 25% G, 25% P, 25% R, 25% X When spammer says R, it is 25% G, 25% P, 25% R, 25% X When spammer says X, it is 25% G, 25% P, 25% R, 25% X [note: assume equal priors] The results are highly ambiguous. No information provided! Error Rates for spammer: ATAMRO447HWJQ P[G → G]=100.0%P[G → P]=0.0% P[G → R]=0.0%P[G → X]=0.0% P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0% P[R → G]=100.0% P[R → P]=0.0%P[R → R]=0.0% P[R → X]=0.0% P[X → G]=100.0% P[X → P]=0.0%P[X → R]=0.0%P[X → X]=0.0% Error Rates for spammer: ATAMRO447HWJQ P[G → G]=100.0%P[G → P]=0.0% P[G → R]=0.0%P[G → X]=0.0% P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0% P[R → G]=100.0% P[R → P]=0.0%P[R → R]=0.0% P[R → X]=0.0% P[X → G]=100.0% P[X → P]=0.0%P[X → R]=0.0%P[X → X]=0.0% Solution: Reverse errors first, compute error rate afterwards

High cost when “corrected labels” have probability spread across classes Low cost when “corrected labels” have probability mass in one class Quality Scores Quality = 1 – Cost(worker) / Cost(spammer) Cost: 0.75, Quality=0% Cost: , Quality=97.3% Cost of “soft” label with cost c ij for labeling as j an i Cost (spammer) = Replace p i with Prior i

Gold Testing? 3 labels per example 2 categories, 50/50 Quality range: 0.55:0.05: labelers

Gold Testing? 5 labels per example 2 categories, 50/50 Quality range: 0.55:0.05: labelers

Gold Testing? 10 labels per example 2 categories, 50/50 Quality range: 0.55:0.05: labelers

Empirical Results and Observations 500 web pages in G, P, R, X, manually labeled 100 workers per page 339 workers Lots of noise on MTurk. 100 votes per page: – 95% accuracy with majority (only! -- Note massive amount of redundancy) – 99.8% accuracy after modeling worker quality Blocking based on error rate: Only 1% of labels dropped Blocking based on quality scores: 30% of labels dropped – Majority vote gives same results after spammer removal

Practical Deployment Lessons Lesson 1: After kicking out spammers, majority vote undistinguishable from more advanced method s (but good to know the noise level, as we will see) Lesson 2: Gold testing useful for spam detection but not for much else (wasting resources on workers that do not stick) Lesson 3: Slight benefit with having levels of workers – Level 2: Super-duper (80% and above) [not that many, however] – Level 1: Good (60% and above) – Level 0: Spammers and untrusted

Example: Build an “Adult Web Site” Classifier Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) But we are not going to label the whole Internet…  Expensive  Slow But we are not going to label the whole Internet…  Expensive  Slow

25 Quality and Classification Performance Noisy labels lead to degraded task performance Labeling quality increases  classification quality increases Quality = 50% Quality = 60% Quality = 80% Quality = 100% Single-labeler quality (probability of assigning correctly a binary label)

26 Majority Voting and Label Quality P=0.4 P=0.5 P=0.6 P=0.7 P=0.8 P=0.9 P=1.0 Ask multiple labelers, keep majority label as “true” label Quality is probability of being correct P is probability of individual labeler being correct

27 Tradeoffs: More data or better data? Get more examples  Improve classification Get more labels  Improve label quality  Improve classification Quality = 50% Quality = 60% Quality = 80 % Quality = 100% KDD 2008

28 Basic Results With high quality labelers (80% and above): One worker per case (more data better) With low quality labelers (~60%) Multiple workers per case (to improve quality) (See also earlier presentation by Kumar and Lease)

29 Selective Repeated-Labeling We do not need to label everything the same way Key observation: we have additional information to guide selection of data for repeated labeling  the current multiset of labels Example: {+,-,+,-,-,+} vs. {+,+,+,+,+,+}

30 Natural Candidate: Entropy Entropy is a natural measure of label uncertainty: E({+,+,+,+,+,+})=0 E({+,-, +,-, -,+ })=1 Strategy: Get more labels for high-entropy label multisets

31 What Not to Do: Use Entropy Entropy: Improves at first, hurts in long run

Why not Entropy In the presence of noise, entropy will be high even with many labels Entropy is scale invariant {3+, 2-} has same entropy as {600+, 400-}, i.e.,

33 Compute Uncertainty per Example Let’s estimate first how difficult to label each example – What is the probability of example receiving correct label Observe +’s and –’s and estimate what is the probability that a worker assigns a + or a - This is the quality of labelers for the example Estimating the parameter of a Bernoulli, using Bayesian estimation

Labeler Quality for Example Quality=0.7 (unknown) 5 labels (3+, 2-) Entropy ~

Labeler Quality for Example Quality=0.7 (unknown) 10 labels (7+, 3-) Entropy ~

Labeler Quality for Example Quality=0.7 (unknown) 20 labels (14+, 6-) Entropy ~

37 Computing Label Uncertainty Once we know the (stochastic) quality of the workers, we can estimate Pr(+|pos, neg) Allocate labels to cases with Pr(+|pos, neg) close to 0.5 Estimate the prior Pr(+) by iterating

38 Another strategy : Model Uncertainty (MU) Learning models of the data provides an alternative source of information about label certainty Model uncertainty: get more labels for instances that cause model uncertainty Intuition? – for modeling: why improve training data quality if model already is certain there? – for data quality, low-certainty “regions” may be due to incorrect labeling of corresponding instances Models Examples Self-healing process KDD 2008

Adult content classification 39

Too much theory? Open source implementation available at: Input: – Labels from Mechanical Turk – Cost of incorrect labelings (e.g., X  G costlier than G  X) Output: – Corrected labels – Worker error rates – Ranking of workers according to their quality Beta version, more improvements to come! Suggestions and collaborations welcomed!

Thanks! Q & A?