Powered by the people High quality data, no cost to end user Better than outsourcing “Crowd sourcing ” : the latest buzz word
How are you? आप कैसे हो ? Statistical Machine Translation (SMT) Crowd Sourcing
Parallel Data A SMT system learns phrasal translation correspondences from parallel data
~ 1 million sentences needed to build a good SMT system Human translation is very costly, unaffordable Judicial Domain: Translation very important to expedite cases Is Crowdsourcing the solution?
Groups of size 4 Each group to collect 5000 translations using crowdsourcing Source Language : English; Target Language : Hindi
A good, user-friendly interface for translation How to attract the crowd? Facebook, Orkut, etc.? Quality check, spam detection!!
साम (Explain, appeal to their logic, win over by dialogue) दाम (Pay and acquire, each group will be provided ` 1000) दंड (Penalty in the form of marks for not meeting deadlines, targets) भेद (Divide and rule, pitted against your classmates, conflict of interest on social networks)
Perfectly valid Hindi sentence but no relation with source sentence Complete junk Syntactic/Grammatical errors Google Translate
Gold data (i.e. correct translations) available with us Crowd data will be compared with gold data Penalty for wrong translations (Your spam detection is not working well!!) The first group to submit 5000 correct sentences gets bonus points Each group will provide a detailed account of their expenditure
No false promises “Translate 1 sentence and win a SUV!!” Stick to your promises “Promise a free t-shirt, give a free t-shirt!!” Avoid monetary transactions, give goodies instead