Download presentation
Presentation is loading. Please wait.
Published byFrederica Morris Modified over 9 years ago
1
Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India
2
Song Lyrics
3
Reviews and Forums
4
Facebook and Twitter
5
And lot more
6
Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt, Morocco,…) Persian Indian sub-continental languages (IL & Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)
7
Code Mixing Transliteration Errors, Contraction
8
Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni suhanee Results: Only Roman transliterated documents Challenge: Spelling variations tandee hawa ye chandny soohaany
9
Cross-script and Multi-script Monolingual IR in transliterated space Query: thandee hava yeh chandni OR ठंडी हवा ये चाँदनी Results: Both Roman transliterated or in native script Challenge: Transliteration
10
Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and Devanagari) and English documents
11
Mono-script Monolingual IR Transliterated query in Roman Transliterated documents in Roman Cross-script Monolingual IR Transliterated query in Roman Transliterated documents in native script Multi-script Monolingual IR Query in Roman or native script Documents in Roman and native scripts
12
Language identification of transliterated queries, documents, code-mixed text kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML Transliteration Forward: കഴിക്കാന് kazhikkan Backward: kazhikkan കഴിക്കാന്
13
20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags) 35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics More data under preparation from FaceBook on mixture of various languages. Looking for partners to extend!
14
Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics Looking for partners to extend it to other (Indian) Languages Other domains?
15
Thank you! monojitc@microsoft.com
16
Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological analyzers Anything else?
17
We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing These are just some initial ideas that came up from our experiences If you are interested please let me know
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.