Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India
Song Lyrics
Reviews and Forums
Facebook and Twitter
And lot more
Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt, Morocco,…) Persian Indian sub-continental languages (IL & Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)
Code Mixing Transliteration Errors, Contraction
Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni suhanee Results: Only Roman transliterated documents Challenge: Spelling variations tandee hawa ye chandny soohaany
Cross-script and Multi-script Monolingual IR in transliterated space Query: thandee hava yeh chandni OR ठंडी हवा ये चाँदनी Results: Both Roman transliterated or in native script Challenge: Transliteration
Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and Devanagari) and English documents
Mono-script Monolingual IR Transliterated query in Roman Transliterated documents in Roman Cross-script Monolingual IR Transliterated query in Roman Transliterated documents in native script Multi-script Monolingual IR Query in Roman or native script Documents in Roman and native scripts
Language identification of transliterated queries, documents, code-mixed text kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML Transliteration Forward: കഴിക്കാന് kazhikkan Backward: kazhikkan കഴിക്കാന്
word pairs each in Bengali, Telugu, and Hindi (labeled with language tags) unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics More data under preparation from FaceBook on mixture of various languages. Looking for partners to extend!
Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics Looking for partners to extend it to other (Indian) Languages Other domains?
Thank you!
Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological analyzers Anything else?
We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing These are just some initial ideas that came up from our experiences If you are interested please let me know