Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers Panos Ipeirotis New York University Joint work with Jing.

Slides:

Advertisements

Similar presentations

Panos Ipeirotis Stern School of Business

Advertisements

Quality Management on Amazon Mechanical Turk Panos Ipeirotis Foster Provost Jing Wang New York University.

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Rewarding Crowdsourced Workers Panos Ipeirotis New York University and Google Joint work with: Jing Wang, Foster Provost, Josh Attenberg, and Victor Sheng;

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.

Learning Algorithm Evaluation

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis Stern School of Business New York University Joint work with: Jing.

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University Title Page.

Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Three kinds of learning

Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Radial Basis Function Networks

Online Learning Algorithms

1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,

Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.

Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)

Bug Localization with Machine Learning Techniques Wujie Zheng

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Joint work with Foster Provost & Panos Ipeirotis New York University.

Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

C ROWD C ENTRALITY David Karger Sewoong Oh Devavrat Shah MIT and UIUC.

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.

© 2009 Amazon.com, Inc. or its Affiliates. Amazon Mechanical Turk New York City Meet Up September 1, 2009 WELCOME!

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Ensemble Methods in Machine Learning

John Lafferty Andrew McCallum Fernando Pereira

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Adventures in Crowdsourcing Panos Ipeirotis Stern School of Business New York University Thanks to: Jing Wang, Marios Kokkodis, Foster Provost, Josh Attenberg,

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Modeling Annotator Accuracies for Supervised Learning

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Classification with Perceptrons Reading:

Bayesian Models in Machine Learning

Introduction to Data Mining, 2nd Edition

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost, and Victor Sheng

3

4

5

6 Outsourcing machine learning preprocessing Traditionally, modeling teams have invested substantial internal resources in data formulation, information extraction, cleaning, and other preprocessing Now, we can outsource preprocessing tasks, such as labeling, feature extraction, verifying information extraction, etc. – using Mechanical Turk, oDesk, etc.Mechanical Turk – quality may be lower than expert labeling (much?) – but low costs can allow massive scale

Example: Build an “Adult Web Site” Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn)

8

Example: Build an “Adult Web Site” Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr

10 Noisy labels can be problematic Many tasks rely on high-quality labels for objects: – webpage classification for safe advertising – learning predictive models – searching for relevant information – finding duplicate database records – image recognition/labeling – song categorization – sentiment analysis Noisy labels can lead to degraded task performance

11 Quality and Classification Performance Labeling quality increases  classification quality increases P = 50% P = 60% P = 80% P = 100% P: Single-labeler quality (probability of assigning correctly a binary label)

Solutions Get better labelers – Often beyond our control or too expensive Get more labelers per item – Our focus 12

13 Majority Voting and Label Quality P=0.4 P=0.5 P=0.6 P=0.7 P=0.8 P=0.9 P=1.0 Ask multiple labelers, keep majority label as “true” label Quality is probability of being correct P is probability of individual labeler being correct

14 Tradeoffs for Modeling Get more examples  Improve classification Get more labels  Improve label quality  Improve classification P = 0.5 P = 0.6 P = 0.8 P = 1.0

15 Basic Labeling Strategies Single Labeling – Get as many data points as possible – One label each Round-robin Repeated Labeling – Repeatedly label data points, giving next label to the one with the fewest so far

16 Round Robin vs. Single: Low Noise p= 0.8, labeling quality #examples =50 FRR (50 examples) Single With low noise/few examples, more (single labeled) examples better

17 Round Robin vs. Single: High Noise p= 0.6, labeling quality #examples =100 FRR (100 examples) SL High noise/many examples, repeated labeling better

18 Selective Repeated-Labeling We have seen so far: – With enough examples and noisy labels, getting multiple labels is better than single-labeling – When we consider extra cost for getting the unlabeled part the benefit is magnified Can we do better than the basic strategies? Key observation: we have additional information to guide selection of data for repeated labeling  the current multiset of labels Example: {+,-,+,-,-,+} vs. {+,+,+,+,+,+}

19 Natural Candidate: Entropy Entropy is a natural measure of label uncertainty: E({+,+,+,+,+,+})=0 E({+,-, +,-, -,+ })=1 Strategy: Get more labels for high-entropy label multisets

20 What Not to Do: Use Entropy Entropy: Improves at first, hurts in long run

Why not Entropy In the presence of noise, entropy will be high even with many labels Entropy is scale invariant {3+, 2-} has same entropy as {600+, 400-}, i.e.,

22 Estimating Label Uncertainty (LU) Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs} Use Bayesian estimate of Bernoulli: Beta prior + update Uncertainty = tail of beta distribution S LU Beta probability density function

Label Uncertainty p=0.7 5 labels (3+, 2-) Entropy ~ 0.97 CDF  =

Label Uncertainty p= labels (7+, 3-) Entropy ~ 0.88 CDF  =

Label Uncertainty p= labels (14+, 6-) Entropy ~ 0.88 CDF  =

Label Uncertainty vs. Round Robin 26 similar results across a dozen data sets Remember: GRR already better than single labeling

Why LU is a hack?

28 More sophisticated label uncertainty Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs} Estimated Beta distribution = quality of labelers for the example Estimate the prior Pr(+) by iterating

More sophisticated LU improves labeling quality under class imbalance and fixes some pesky LU learning curve glitches 31 Both techniques perform essentially optimally with balanced classes

32 Another strategy : Model Uncertainty (MU) Learning models of the data provides an alternative source of information about label certainty – (a random forest for the results to come) Model uncertainty: get more labels for instances that cause model uncertainty Intuition? – for modeling: why improve training data quality if model already is certain there? – for data quality, low-certainty “regions” may be due to incorrect labeling of corresponding instances Models Examples Self-healing process [Brodley et al, 99]

33 Yet another strategy : Label & Model Uncertainty (LMU) Label and model uncertainty (LMU): avoid examples where either strategy is certain

Label Quality Number of labels (waveform, p=0.6) Labeling quality UNFMU LULMU Label Uncertainty Uniform, round robin Label + Model Uncertainty Model Uncertainty alone also improves quality

35 Model Quality Label & Model Uncertainty Across 12 domains, LMU is always better than GRR. LMU is statistically significantly better than LU and MU.

Adult content classification 36

Example: Build an “Adult Web Site” Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr

Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)

Redundant votes, infer quality Look at our spammer friend ATAMRO447HWJQ together with other 9 workers  Using redundancy, we can compute error rates for each worker

1.Initialize“correct” label for each object (e.g., use majority vote) 2.Estimate error rates for workers (using “correct” labels) 3.Estimate “correct” labels (using error rates, weight worker votes according to quality) 4.Go to Step 2 and iterate until convergence 1.Initialize“correct” label for each object (e.g., use majority vote) 2.Estimate error rates for workers (using “correct” labels) 3.Estimate “correct” labels (using error rates, weight worker votes according to quality) 4.Go to Step 2 and iterate until convergence Repeated Labeling, EM, and Confusion Matrices (Dawid & Skene, 1979) Iterative process to estimate worker error rates Our friend ATAMRO447HWJQ marked almost all sites as G. Seems like a spammer… Error rates for ATAMRO447HWJQ P[G → G]=99.947% P[G → X]=0.053% P[X → G]=99.153% P[X → X]=0.847%

Challenge: From Confusion Matrixes to Quality Scores The algorithm generates “confusion matrixes” for workers How to check if a worker is a spammer using the confusion matrix? (hint: error rate not enough) How to check if a worker is a spammer using the confusion matrix? (hint: error rate not enough) Error rates for ATAMRO447HWJQ  P[X → X]=0.847%P[X → G]=99.153%  P[G → X]=0.053%P[G → G]=99.947%

Challenge 1: Spammers are lazy and smart! Confusion matrix for spammer  P[X → X]=0% P[X → G]=100%  P[G → X]=0% P[G → G]=100% Confusion matrix for good worker  P[X → X]=80%P[X → G]=20%  P[G → X]=20%P[G → G]=80% Spammers figure out how to fly under the radar… In reality, we have 85% G sites and 15% X sites Errors of spammer = 0% * 85% + 100% * 15% = 15% Error rate of good worker = 85% * 20% + 85% * 20% = 20% False negatives: Spam workers pass as legitimate

Challenge 2: Humans are biased! Error rates for CEO of AdSafe P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error rates for CEO of AdSafe P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%  In reality, we have 85% G sites, 5% P sites, 5% R sites, 5% X sites  Errors of spammer (all in G) = 0% * 85% + 100% * 15% = 15%  Error rate of biased worker = 80% * 85% + 100% * 5% = 73% False positives: Legitimate workers appear to be spammers

Solution: Reverse errors first, compute error rate afterwards When biased worker says G, it is 100% G When biased worker says P, it is 100% G When biased worker says R, it is 50% P, 50% R When biased worker says X, it is 100% X Small ambiguity for “R-rated” votes but other than that, fine! Error Rates for biased worker P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for biased worker P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

When spammer says G, it is 25% G, 25% P, 25% R, 25% X When spammer says P, it is 25% G, 25% P, 25% R, 25% X When spammer says R, it is 25% G, 25% P, 25% R, 25% X When spammer says X, it is 25% G, 25% P, 25% R, 25% X [note: assume equal priors] The results are highly ambiguous. No information provided! Error Rates for spammer: ATAMRO447HWJQ P[G → G]=100.0%P[G → P]=0.0% P[G → R]=0.0%P[G → X]=0.0% P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0% P[R → G]=100.0% P[R → P]=0.0%P[R → R]=0.0% P[R → X]=0.0% P[X → G]=100.0% P[X → P]=0.0%P[X → R]=0.0%P[X → X]=0.0% Error Rates for spammer: ATAMRO447HWJQ P[G → G]=100.0%P[G → P]=0.0% P[G → R]=0.0%P[G → X]=0.0% P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0% P[R → G]=100.0% P[R → P]=0.0%P[R → R]=0.0% P[R → X]=0.0% P[X → G]=100.0% P[X → P]=0.0%P[X → R]=0.0%P[X → X]=0.0% Solution: Reverse errors first, compute error rate afterwards

 Cost of “soft” label with cost c ij for labeling as j an object of class i  “Soft” labels with probability mass in a single class are good  “Soft” labels with probability mass spread across classes are bad Cost (spammer) = Replace p i with Prior i Computing Quality Scores Quality = 1 – Cost(worker) / Cost(spammer)

Experimental Results 500 web pages in G, P, R, X 100 workers per page (just to evaluate effect of more labels) 339 workers Lots of noise! 95% accuracy with majority vote (only!) Dropped all workers with quality score 50% or below Error rate: 1% of labels dropped Quality scores: 30% of labels dropped, accuracy 99.8% Note massive amount of redundancy and very conservative spam rejection

Too much theory? Open source implementation available at: Input: – Labels from Mechanical Turk – Cost of incorrect labelings (e.g., X  G costlier than G  X) Output: – Corrected labels – Worker error rates – Ranking of workers according to their quality Beta version, more improvements to come! Suggestions and collaborations welcomed!

Workers reacting to quality scores Score-based feedback leads to strange interactions: The angry, has-been-burnt-too-many-times worker: “F*** YOU! I am doing everything correctly and you know it! Stop trying to reject me with your stupid ‘scores’!” The overachiever: “What am I doing wrong?? My score is 92% and I want to have 100%” 49

An unexpected connection at the NAS “Frontiers of Science” conf. 50 Your spammers behave like my mice!

An unexpected connection at the NAS “Frontiers of Science” conf. 51 Your spammers behave like my mice! Eh?

An unexpected connection at the NAS “Frontiers of Science” conf. 52 Your spammers want to engage their brain only for motor skills and not for cognitive skills Yeah, makes sense…

An unexpected connection at the NAS “Frontiers of Science” conf. 53 And here is how I train my mice to behave… I should try this the moment that I get back to my room

Implicit Feedback with Frustration Punish bad answers with frustration (e.g, delays between tasks) – “Loading image, please wait…” – “Image did not load, press here to reload” – “Network error. Return the HIT and accept again” Reward good answers by giving next task immediately, with no problems →Make this probabilistic to keep feedback implicit 54

First result Spammer workers quickly abandon Good workers keep labeling Bad: Spammer bots unaffected How to frustrate a bot? – Give it a CAPTHCA 55

Second result (more impressive) Remember, Don was training the mice… 15% of the spammers start submitting good work! Putting cognitive effort is more beneficial … Key trick: Learn to test workers on-the-fly – Model performance using Dirichlet compound multinomial distribution (details offline) 56

Thanks! Q & A?

Why does Model Uncertainty (MU) work? MU score distributions for correctly labeled (blue) and incorrectly labeled (purple) cases

59 Why does Model Uncertainty (MU) work? Models Examples Self-healing process Self-healing MU “active learning” MU

What if different labelers have different qualities? (Sometimes) quality of multiple noisy labelers is better than quality of best labeler in set here, 3 labelers: p-d, p, p+d 60

Soft Labeling vs. Majority Voting MV: majority voting ME: soft labeling

Related topic Estimating (and using) the labeler quality – for multilabeled data: Dawid & Skene 1979; Raykar et al. JMLR 2010; Donmez et al. KDD09 – for single-labeled data with variable-noise labelers: Donmez & Carbonell 2008; Dekel & Shamir 2009a,b – to eliminate/down-weight poor labelers: Dekel & Shamir, Donmez et al.; Raykar et al. (implicitly) – and correct labeler biases: Ipeirotis et al. HCOMP-10 – Example-conditional labeler performance Yan et al. 2010a,b Using learned model to find bad labelers/labels: Brodley & Friedl 1999; Dekel & Shamir, Us (I’ll discuss) 62