Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University Title Page.

Slides:



Advertisements
Similar presentations
Panos Ipeirotis Stern School of Business
Advertisements

Quality Management on Amazon Mechanical Turk Panos Ipeirotis Foster Provost Jing Wang New York University.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.
Quantitative and Scientific Reasoning Standard n Students must demonstrate the math skills needed to enter the working world right out of high school or.
Rewarding Crowdsourced Workers Panos Ipeirotis New York University and Google Joint work with: Jing Wang, Foster Provost, Josh Attenberg, and Victor Sheng;
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Data Mining Classification: Alternative Techniques
Tool Overview Input Section Output Section Run Button Tab Description 1)Title 2)Dashboard – it represents summary of inputs & outputs. All inputs should.
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis Stern School of Business New York University Joint work with: Jing.
Introduction to Mechanized Labor Marketplaces: Mechanical Turk Uichin Lee KAIST KSE.
Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.
Amazon Mechanical Turk (Mturk) What is MTurk? – Crowdsourcing Internet marketplace that utilizes human intelligence to perform tasks that computers are.
Introduction Client Motivations  Tasks Categories Crowd Motivation Pros & Cons Quality Management Scale up with Machine Learning Workflows for Complex.
Evaluation.
Crowdsourcing research data UMBC ebiquity,
Evaluation.
Ensemble Learning: An Introduction
1 Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King This presentation will probably involve.
Three kinds of learning
Crowdsourcing = Crowd + Outsourcing “soliciting solutions via open calls to large-scale communities”
CSSE463: Image Recognition Day 31 Due tomorrow night – Project plan Due tomorrow night – Project plan Evidence that you’ve tried something and what specifically.
Hub Queue Size Analyzer Implementing Neural Networks in practice.
Task and Workflow Design I KSE 801 Uichin Lee. TurKit: Human Computation Algorithms on Mechanical Turk Greg Little, Lydia B. Chilton, Rob Miller, and.
Radial Basis Function Networks
Crowdsourcing Quality Management and other stories Panos Ipeirotis New York University & Tagasauris.
Higher Biology Unit 1: Cell biology Unit 2: Genetics & Adaptations
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,
Group practice in problem design and problem solving
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
DCT 1123 PROBLEM SOLVING & ALGORITHMS INTRODUCTION TO PROGRAMMING.
Lecture 2: Introduction to Machine Learning
Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis New York University Joint work with Jing Wang, Foster Provost,
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
Get Another Label? Improving Data Quality and Machine Learning Using Multiple, Noisy Labelers Panos Ipeirotis New York University Joint work with Jing.
Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011.
BUS304 – Data Collection1 Chapter 1 Data Collection  Descriptive Statistics  Tools that collect, present and describe data Collecting Data Characterizing.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Panos Ipeirotis Stern School of Business New York University Joint.
Data Structures & Algorithms and The Internet: A different way of thinking.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)
Design Process 중앙대학교 전자전기공학부. Design for Electrical and Computer Engineers 2. Design Process  Engineering : Problem solving through specialized scientific.
A Technical Approach to Minimizing Spam Mallory J. Paine.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Joint work with Foster Provost & Panos Ipeirotis New York University.
Universit at Dortmund, LS VIII
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.
© 2009 Amazon.com, Inc. or its Affiliates. Amazon Mechanical Turk New York City Meet Up September 1, 2009 WELCOME!
Hypotheses tests for means
Making Decisions uCode: October Review What are the differences between: o BlueJ o Java Computer objects represent some thing or idea in the real.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Conditions for Constructing a Confidence Interval
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
COIT29222 Structured Programming 1 COIT29222-Structured Programming Lecture Week 02  Reading: Textbook(4 th Ed.), Chapter 2 Textbook (6 th Ed.), Chapters.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
The Practice of Statistics Third Edition Chapter 11: Testing a Claim Copyright © 2008 by W. H. Freeman & Company Daniel S. Yates.
Adventures in Crowdsourcing Panos Ipeirotis Stern School of Business New York University Thanks to: Jing Wang, Marios Kokkodis, Foster Provost, Josh Attenberg,
Reading: R. Schapire, A brief introduction to boosting
Title: Questionnaires
Classification with Perceptrons Reading:
Data Mining Practical Machine Learning Tools and Techniques
Algorithm Discovery and Design
Presentation transcript:

Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University Title Page

Panos Ipeirotis - Introduction New York University, Stern School of Business “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu

Example: Build an “Adult Web Site” Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr

Amazon Mechanical Turk: Paid Crowdsourcing

Example: Build an “Adult Web Site” Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general audience) PG (parental guidance) R (restricted) X (porn) Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr MTurk: 2500 websites/hr, cost: $12/hr

labeled X (porn) sites as G (general audience) Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)

Improve Data Quality through Repeated Labeling Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote 11 workers 93% correct 1 worker 70% correct Probability of correctness increases with number of workers Probability of correctness increases with quality of workers

Using redundant votes, we can infer worker quality Look at our spammer friend ATAMRO447HWJQ together with other 9 workers We can compute error rates for each worker Error rates for ATAMRO447HWJQ P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947% Our “friend” ATAMRO447HWJQ mainly marked sites as G. Obviously a spammer…

Rejecting spammers and Benefits Random answers error rate = 50% Average error rate for ATAMRO447HWJQ: 49.6% P[X → X]=0.847% P[X → G]=99.153% P[G → X]=0.053% P[G → G]=99.947% Action: REJECT and BLOCK Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher

Demo and Open source implementation available at: Too much theory? Demo and Open source implementation available at: http://qmturk.appspot.com Input: Labels from Mechanical Turk Some “gold” data (optional) Cost of incorrect labelings (e.g., XG costlier than GX) Output: Corrected labels Worker error rates Ranking of workers according to their quality

How to handle free-form answers? Q: “My task does not have discrete answers….” A: Break into two HITs: “Create” HIT “Vote” HIT Vote HIT controls quality of Creation HIT Redundancy controls quality of Voting HIT Catch: If “creation” very good, in voting workers just vote “yes” Solution: Add some random noise (e.g. misspell) Creation HIT (e.g. transcribe caption) Voting HIT: Correct or not? Example: Collect URLs

But my free-form is not just right or wrong… Describe this “Create” HIT “Improve” HIT “Compare” HIT Creation HIT (e.g. describe the image) Improve HIT (e.g. improve description) Compare HIT (voting) Which is better? TurkIt toolkit: http://groups.csail.mit.edu/uid/turkit/

version 1: A parial view of a pocket calculator together with some coins and a pen. version 2: A view of personal items a calculator, and some gold and copper coins, and a round tip pen, these are all pocket and wallet sized item used for business, writting, calculating prices or solving math problems and purchasing items. version 3: A close-up photograph of the following items: A CASIO multi- function calculator. A ball point pen, uncapped. Various coins, apparently European, both copper and gold. Seems to be a theme illustration for a brochure or document cover treating finance, probably personal finance. version 4: …Various British coins; two of £1 value, three of 20p value and one of 1p value. … version 8: “A close-up photograph of the following items: A CASIO multi-function, solar powered scientific calculator. A blue ball point pen with a blue rubber grip and the tip extended. Six British coins; two of £1 value, three of 20p value and one of 1p value. Seems to be a theme illustration for a brochure or document cover treating finance - probably personal finance."

Future: Break big task to simple ones and build workflow Running experiment: Crowdsource big tasks (e.g., tourist guide) My Boss is a Robot (mybossisarobot.com) Nikki Kittur (Carnegie Mellon) + Jim Giles (New Scientist) Identify sights worth checking out (one tip per worker) Vote and rank Brief tips for each monument (one tip per worker) Aggregate tips in meaningful summary Iterate to improve…

“A Computer Scientist in a Business School” Thank you! Questions? “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu

Is she a spammer? Correcting biases Classifying sites as G, PG, R, X Sometimes workers are careful but biased Classifies G → P and P → R Average error rate : too high Error Rates for CEO of company detecting offensive content (and parent) P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0% P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0% P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0% P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0% Is she a spammer?

Correcting biases Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0% P[G → P]=80.0% P[G → R]=0.0% P[G → X]=0.0% P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0% P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0% P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0% For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias True error-rate ~ 9%

Scaling Crowdsourcing: Use Machine Learning Human labor is expensive, even when paying cents Need to scale crowdsourcing Basic idea: Build a machine learning model and use it instead of humans Data from existing crowdsourced answers New Case Automatic Model (through machine learning) Automatic Answer

Tradeoffs for Automatic Models: Effect of Noise Get more data  Improve model accuracy Improve data quality  Improve classification Example Case: Porn or not? Data Quality = 100% Data Quality = 80% Data Quality = 60% key: gradient in accuracy over example X quality space complicated for various reasons, one of which is: labeling different examples will have different benefits Data Quality = 50% 22 22

Scaling Crowdsourcing: Iterative training Use machine when confident, humans otherwise Retrain with new human input → improve model → reduce need for humans Automatic Answer Confident New Case Automatic Model (through machine learning) Not confident Get human(s) to answer Data from existing crowdsourced answers

Tradeoffs for Automatic Models: Effect of Noise Get more data  Improve model accuracy Improve data quality  Improve classification Example Case: Porn or not? Data Quality = 100% Data Quality = 80% Data Quality = 60% key: gradient in accuracy over example X quality space complicated for various reasons, one of which is: labeling different examples will have different benefits Data Quality = 50% 24 24

Scaling Crowdsourcing: Iterative training, with noise Use machine when confident, humans otherwise Ask as many humans as necessary to ensure quality Automatic Answer Confident New Case Automatic Model (through machine learning) Not confident for quality? Not confident Data from existing crowdsourced answers Get human(s) to answer Confident for quality?