Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Similar presentations


Presentation on theme: "Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India."— Presentation transcript:

1 Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

2 Song Lyrics

3 Reviews and Forums

4 Facebook and Twitter

5 And lot more

6  Many languages that use non-Roman script  Arabic (Saudi Arabia, UAE, Egypt, Morocco,…)  Persian  Indian sub-continental languages (IL & Dzongkha, Nepalese, Sinhala)  Thai, Vietnamese  Cyrillic (Russian, Ukrainian)  Chinese, Japanese, Korean (rare)

7 Code Mixing Transliteration Errors, Contraction

8 Mono-script Monolingual IR in transliterated space  Query: thandee hava yeh chandni suhanee  Results: Only Roman transliterated documents  Challenge: Spelling variations  tandee hawa ye chandny soohaany

9 Cross-script and Multi-script Monolingual IR in transliterated space  Query: thandee hava yeh chandni OR ठंडी हवा ये चाँदनी  Results: Both Roman transliterated or in native script  Challenge: Transliteration

10  Cross-script and Cross-lingual IR  Query: death of mareech and subahoo  Document: Hindi (Transliterated and Devanagari) and English documents

11 Mono-script Monolingual IR Transliterated query in Roman Transliterated documents in Roman Cross-script Monolingual IR Transliterated query in Roman Transliterated documents in native script Multi-script Monolingual IR Query in Roman or native script Documents in Roman and native scripts

12  Language identification of transliterated queries, documents, code-mixed text kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML  Transliteration  Forward: കഴിക്കാന് ‍  kazhikkan  Backward: kazhikkan  കഴിക്കാന് ‍

13  20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)  35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics  More data under preparation from FaceBook on mixture of various languages.  Looking for partners to extend!

14  Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics  Looking for partners to extend it to other (Indian) Languages  Other domains?

15 Thank you! monojitc@microsoft.com

16  Lexicons  Pronunciation lexicons  G2P for some languages  Stemmers and morphological analyzers  Anything else?

17  We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing  These are just some initial ideas that came up from our experiences  If you are interested please let me know


Download ppt "Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India."

Similar presentations


Ads by Google