Download presentation
Presentation is loading. Please wait.
Published bySheryl Dalton Modified over 9 years ago
1
Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)
2
“A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu Panos Ipeirotis - Introduction New York University, Stern School of Business
3
Example: Build an Adult Web Site Classifier Need a large number of hand-labeled sites Get people to look at sites and classify them as: G (general), PG (parental guidance), R (restricted), X (porn) Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr MTurk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost: $15/hr MTurk: 2500 websites/hr, cost: $12/hr
4
Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)
5
Improve Data Quality through Repeated Labeling Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote Probability of correctness increases with number of workers Probability of correctness increases with quality of workers 1 worker 70% correct 1 worker 70% correct 11 workers 93% correct 11 workers 93% correct
6
11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost: $15/hr 11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost: $15/hr Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost: $15/hr Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost: $15/hr But Majority Voting is Expensive
7
Using redundant votes, we can infer worker quality Look at our spammer friend ATAMRO447HWJQ together with other 9 workers Our “friend” ATAMRO447HWJQ mainly marked sites as G. Obviously a spammer… We can compute error rates for each worker Error rates for ATAMRO447HWJQ P[X → X]=9.847%P[X → G]=90.153% P[G → X]=0.053%P[G → G]=99.947%
8
Rejecting spammers and Benefits Random answers error rate = 50% Average error rate for ATAMRO447HWJQ: 45.2% P[X → X]=9.847%P[X → G]=90.153% P[G → X]=0.053%P[G → G]=99.947% Action: REJECT and BLOCK Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher
9
After rejecting spammers, quality goes up Spam keeps quality down Without spam, workers are of higher quality Need less redundancy for same quality Same quality of results for lower cost With spam 1 worker 70% correct With spam 1 worker 70% correct With spam 11 workers 93% correct With spam 11 workers 93% correct Without spam 1 worker 80% correct Without spam 1 worker 80% correct Without spam 5 workers 94% correct Without spam 5 workers 94% correct
10
Correcting biases Classifying sites as G, PG, R, X Sometimes workers are careful but biased Classifies G → P and P → R Average error rate for ATLJIK76YH1TF: 45.0% Is ATLJIK76YH1TF a spammer? Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%
11
Correcting biases For ATLJIK76YH1TF, we simply need to compute the “non-recoverable” error-rate (technical details omitted) Non-recoverable error-rate for ATLJIK76YH1TF: 9% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%
12
Too much theory? Open source implementation available at: http://code.google.com/p/get-another-label/ Input: –Labels from Mechanical Turk –Cost of incorrect labelings (e.g., X G costlier than G X) Output: –Corrected labels –Worker error rates –Ranking of workers according to their quality Alpha version, more improvements to come! Suggestions and collaborations welcomed!
13
Thank you! Questions? “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.