How Crowdsourcable is Your Task? Carsten Eickhoff Arjen P. de Vries WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011), Hong Kong, China, February 9–12, 2011.
2 The Crowdsourcing Boom Crowdsourcing, a Tale of Great Romance A Journey to the Dark Side of Crowdsourcing Is all Lost? Conclusions O Outline
3 Billions of judgements are being crowdsourced each year CrowdFlower – Judgement volume doubled ( ) Significant numbers of research publications rely on crowdsourcing to create scientific resources...but is it actually reliable? I The Crowdsourcing Boom
4 The Crowdsourcing Boom Crowdsourcing, a Tale of Great Romance A Journey to the Dark Side of Crowdsourcing Is all Lost? Conclusions O Outline
5 Summer 2008 How do I quickly get a large number of judgements? Task: Message grouping for discourse understanding Crowdsourcing produced very reliable results I Crowdsourcing – A Tale of Great Romance
6 Summer 2008 How do I quickly get a large number of judgements? Task: Message grouping for discourse understanding Crowdsourcing produced very reliable results I Crowdsourcing – A Tale of Great Romance
7 Fall 2008 Crowdsourcing has become a standard data source The excitement wears off I Crowdsourcing – A Tale of Great Romance
8 A dark and cold day in late autumn 2009 You need judgements for yet another experiment I Crowdsourcing – A Tale of Great Romance
9 A dark and cold day in late autumn 2009 You need judgements for yet another experiment You get cheated! I Crowdsourcing – A Tale of Great Romance
10 A dark and cold day in late autumn 2009 You need judgements for yet another experiment You get cheated! Again and again... I Crowdsourcing – A Tale of Great Romance
11 The Crowdsourcing Boom Crowdsourcing, a Tale of Great Romance A Journey to the Dark Side of Crowdsourcing Is all Lost? Conclusions O Outline
12 O A Journey to the Dark Side Task-based overview What is it that malicious workers do? Do we have remedies?
13 Task: Closed class questions Possible cheat: uniform answering (all yes/no) Possible cheat: arbitrary answers Remedy: Good gold standard data helps Pitfall: Cheaters who think about the task at hand can cause a lot of trouble (e.g. relevance judgements) I A Journey to the Dark Side
14 Task: Open class questions Possible cheat (1): Copy and paste standard text Possible cheat (2): Copy and paste domain-specific text Remedy: (1) is easy to detect. (2) is problematic I A Journey to the Dark Side
15 Task: Internal quality control Possible cheat: artificially boost your own confidence Possible cheat: even worse, do so in a network Remedy: We need a better confidence measure than prior acceptance rate Pitfall: Due to the large scale of HITs it is hard to find a reliable confidence measure I A Journey to the Dark Side
16 Task: External quality control Setup: redirect workers to your own site and let them do the HITs there Possible cheat: make up confirmation token Possible cheat: re-use genuine token Possible cheat: claim that you did not get a token Remedy: all of the above are easy to detect I A Journey to the Dark Side
17 The Crowdsourcing Boom Crowdsourcing, a Tale of Great Romance A Journey to the Dark Side of Crowdsourcing Is all Lost? Conclusions O Outline
18 E Is all Lost? Posterior detection and filtering of cheaters works reliably But we waste resources (money, time, nerves..) Can we discourage cheaters from doing our HIT in the first place?
19 E Is all Lost? Which HIT types do cheaters like? The Summer 2008 HIT hardly attracted any cheaters The one in Autumn was swamped by them The Summer task required a lot of creativity whereas the Autumn one was a straightforward relevance judgement
20 E Is all Lost? Hypothesis: “If the HIT conveys the impression of requiring creativity, cheaters are less likely to take it.” 2 HIT types – Suitability for children – Standard relevance judgements
21 E Task/Interface Design
22 C Crowd Filtering
23 F Conclusion The share of malicious workers can be significantly reduced by making your task: Innovative Creative Non-repetitive Crowd Filtering can help to reduce the share of malicious workers at the cost of higher completion time. Previous acceptance rate is not a robust predictor of worker reliability
24 V Thank You!
25 V Questions, Remarks, Concerns?