Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)

“A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu Panos Ipeirotis - Introduction  New York University, Stern School of Business

Example: Build an Adult Web Site Classifier  Need a large number of hand-labeled sites  Get people to look at sites and classify them as: G (general), PG (parental guidance), R (restricted), X (porn) Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr Cost/Speed Statistics  Undergrad intern: 200 websites/hr, cost: $15/hr  MTurk: 2500 websites/hr, cost: $12/hr

Bad news: Spammers! Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience) Worker ATAMRO447HWJQ labeled X (porn) sites as G (general audience)

Improve Data Quality through Repeated Labeling  Get multiple, redundant labels using multiple workers  Pick the correct label based on majority vote  Probability of correctness increases with number of workers  Probability of correctness increases with quality of workers 1 worker 70% correct 1 worker 70% correct 11 workers 93% correct 11 workers 93% correct

11-vote Statistics  MTurk: 227 websites/hr, cost: $12/hr  Undergrad: 200 websites/hr, cost: $15/hr 11-vote Statistics  MTurk: 227 websites/hr, cost: $12/hr  Undergrad: 200 websites/hr, cost: $15/hr Single Vote Statistics  MTurk: 2500 websites/hr, cost: $12/hr  Undergrad: 200 websites/hr, cost: $15/hr Single Vote Statistics  MTurk: 2500 websites/hr, cost: $12/hr  Undergrad: 200 websites/hr, cost: $15/hr But Majority Voting is Expensive

Using redundant votes, we can infer worker quality  Look at our spammer friend ATAMRO447HWJQ together with other 9 workers Our “friend” ATAMRO447HWJQ mainly marked sites as G. Obviously a spammer…  We can compute error rates for each worker Error rates for ATAMRO447HWJQ  P[X → X]=9.847%P[X → G]=90.153%  P[G → X]=0.053%P[G → G]=99.947%

Rejecting spammers and Benefits Random answers error rate = 50% Average error rate for ATAMRO447HWJQ: 45.2%  P[X → X]=9.847%P[X → G]=90.153%  P[G → X]=0.053%P[G → G]=99.947% Action: REJECT and BLOCK Results:  Over time you block all spammers  Spammers learn to avoid your HITS  You can decrease redundancy, as quality of workers is higher

After rejecting spammers, quality goes up  Spam keeps quality down  Without spam, workers are of higher quality  Need less redundancy for same quality  Same quality of results for lower cost With spam 1 worker 70% correct With spam 1 worker 70% correct With spam 11 workers 93% correct With spam 11 workers 93% correct Without spam 1 worker 80% correct Without spam 1 worker 80% correct Without spam 5 workers 94% correct Without spam 5 workers 94% correct

Correcting biases  Classifying sites as G, PG, R, X  Sometimes workers are careful but biased  Classifies G → P and P → R  Average error rate for ATLJIK76YH1TF: 45.0% Is ATLJIK76YH1TF a spammer? Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

Correcting biases  For ATLJIK76YH1TF, we simply need to compute the “non-recoverable” error-rate (technical details omitted)  Non-recoverable error-rate for ATLJIK76YH1TF: 9% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0% Error Rates for Worker: ATLJIK76YH1TF P[G → G]=20.0%P[G → P]=80.0%P[G → R]=0.0%P[G → X]=0.0% P[P → G]=0.0%P[P → P]=0.0%P[P → R]=100.0%P[P → X]=0.0% P[R → G]=0.0%P[R → P]=0.0%P[R → R]=100.0%P[R → X]=0.0% P[X → G]=0.0%P[X → P]=0.0%P[X → R]=0.0%P[X → X]=100.0%

Too much theory? Open source implementation available at: http://code.google.com/p/get-another-label/  Input: –Labels from Mechanical Turk –Cost of incorrect labelings (e.g., X  G costlier than G  X)  Output: –Corrected labels –Worker error rates –Ranking of workers according to their quality  Alpha version, more improvements to come!  Suggestions and collaborations welcomed!

Thank you! Questions? “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu “A Computer Scientist in a Business School” http://behind-the-enemy-lines.blogspot.com/ Email: panos@nyu.edu

Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)

Similar presentations

Presentation on theme: "Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)

Similar presentations

Presentation on theme: "Spam? No, thanks! Panos Ipeirotis – New York University ProPublica, Apr 1 st 2010 (Disclaimer: No jokes included)"— Presentation transcript:

Similar presentations

About project

Feedback