CrowdFlow Integrating Machine Learning with Mechanical Turk for Speed-Cost-Quality Flexibility Alex Quinn, Ben Bederson, Tom Yeh, Jimmy Lin
Human Computation Things HUMANS can do Things COMPUTERS can do Translation Photo tagging Face recognition Human detection Speech recognition Text analysis Planning
Human Computation Things HUMANS can do Things COMPUTERS can do Translation Photo tagging Face recognition Speech recognition Human detection Text analysis Planning
Example: Human detection
Trade-off space Quality Speed, Affordability Computers Human Workers (traditional) Human Computation
Trade-off space Quality Speed, Affordability Computers Human Computation Human Workers (traditional)
Man-Computer Symbiosis Automation with human post-correction Supervised machine learning humans computer speed cost quality computer humans speed cost quality
Man-Computer Symbiosis CrowdFlow Automation with human post-correction Supervised machine learning humans computer speed cost quality humans computer speed cost quality computer humans speed cost quality
Mechanical Turk
Human Detection – Starting point
Human Detection – Task
Human Detection – Results Quality Speed, Affordability 60%90% 119 images took 3 hrs 50 mins and cost $2.38
Human Detection – Scenarios Quality Speed, Affordability 60%90% 1000 photos at 72% accuracy would take 12 hrs 20 mins and cost $ images took 3 hrs 50 mins and cost $2.38
Vision: Richer model Input with computer results Validator Appraiser Fixer Worker Correct Incorrect Fix Start over Output
Lessons Learned Design for overall needs/constraints Practical advice: Pay consistently and reasonably Reject only work that is definitely cheating Build in fair cheating deterrence from the start Keep instructions short, but always clear Contact: Alex Quinn
Cheating Earlier naïve experiment: 2000 reviews classified by 3 Turkers each 91% of work was cheated by 9 bad Turkers
Cheating Deterrence Mix in task instances with known answers Keep track of each worker’s accuracy Warning after 10 HITs of <70% accuracy Block after 20 HITs of <70% accuracy Thresholds are problem-specific Other mechanisms Approve payment only after inspection Filter workers based on approval record
Ideal Pricing Pay proportional to Turker effort Choose a reasonable hourly rate Example: Confirming correct answer: 10 seconds Fixing an incorrect answer: 60 seconds Answering from scratch: 50 seconds If machine < 80%, bypass machine results Need to adjust for human accuracy!
Sentiment Polarity – Example 1 “Skim each movie review and decide whether it is positive or negative....” ○ positive ○ negative
Sentiment Polarity – Results 1083 movie reviews grouped into 361 HITs Cost: $ ¢ per movie review (5¢ per HIT) Time: 8 hours 7 minutes 27 seconds per movie review Human accuracy: 90% Machine accuracy: 83.5%
Sentiment Polarity – Scenarios Given: 100,000 movie reviews Cost constraint: $1000 Expect: Humans do 66,714; machines do the rest 78% combined accuracy 18 days, 17 hours, 40 minutes
Review: Monotrans Quality Affordability Machine Translation Machine Translation Professional Bilingual Human Participation Amateur Bilingual Human Participation Monolingual Human Participation Monolingual Human Participation