Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Slides:



Advertisements
Similar presentations
 Yaniv Feinberg Senior SDE Microsoft Corporation  Erik Fortune Principal Dev Lead Microsoft Corporation PC52.
Advertisements

Text #ICANN50. Text #ICANN50 IDN Variant TLD Program GNSO Update Saturday 21 June 2014.
FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 1 Department of Computer Science, Gujarat.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User.
7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages Sowmya V.B. *, Monojit Choudhury *, Kalika Bali *, Tirthankar.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur.
Intercultural understanding and ACARA In the Australian Curriculum, students develop intercultural understanding as they learn to value their own cultures,
Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.
1 Updated as of 1 July 2014 Issues of the day at ICANN Internationalized Domain Names (IDNs) KISA-ICANN Language Localisation Project Module 2.3.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
Resource for Librarians and Teachers of ESOL; English as a Second Language Lending Library.
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
Why do we study English? Form 9, unit 6.
INNOVATION IN GOVERNMENT AN OXYMORON? WHATS NEXT! Josh Rice Chief Technology Officer, Public Sector Microsoft.
Different countries - different people. Different countries - different people.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal GarainIndian Statistical Institute, Kolkata Tamaltaru PalIndian Statistical Institute, Kolkata.
Building on the Nation’s Strength: Heritage Language Speakers, a National Resource Olga Kagan, Director, National Heritage Language Resource Center Language.
APAA 59th Council Meeting Workshop Intellectual Property from a global perspective: What it takes to make it click? Venue: Makati Shangri La Hotel, Manila,
© SOUTH-WESTERN THOMSONINTERNATIONAL BUSINESS LESSON3-1 GOALS  Describe influences of culture on global business activities.  Explain the role of subcultures.
Indian Language Initiatives at LDC Denise DiPersio
Pen Research Jay Pittman Development Lead Tablet PC Handwriting Recognition Microsoft Corporation Jay Pittman Development Lead Tablet PC Handwriting Recognition.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
AP Geography What’s a Nation… Nation: A tightly knit group of people possessing bonds of language, ethnicity, religion, and other shared cultural.
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
Cultural differences should not separate us from each other, but rather cultural diversity brings a collective strength that can benefit all of humanity.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
� Teaching Languages February 00. � Teaching Languages Agenda Teaching Languages - Rolly Sussex Uni Qld Case Study - Mike Fardon Uni WA Language support.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Third Conference December, Basm 28 Years >450,000 Terminologies >250 Scientific field.
+ Piara Waters Primary School - Our Language Community Orientation Lesson 1.
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India
Evolu tion lex ica lear ning wo rd NL P mo del  no de net wo rk syn tax com plex sem anti ed ge ba ng la  PAPA D zul u Social Computing for.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
What can Parents Do to Help Their Children Learn?.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
STEAM Content and Alt Format. STEAM – step up from STEM Science Technology Engineering Arts Math.
 Language! Where the language is used, how they are grouped, why distributed that way.
Closing Session  FIRE shared task  Results of yesterday’s experiments  Open discussion and Your Feedback.
ENGLISH AROUND THE WORLD.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
OHIO’S IMMIGRANTS. Ohio has been blessed with a large number of immigrants from various parts of the world.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Company Profile September 2015 Feb E&R for Translation Services 1.
Creating Your 1 st Web Page. Tags Refers to anything between on a webpage Most appear in pairs surrounding content Some appear as empty tags (no closing.
Span, Spanish America, and the Spanish Language Chapter 1.
Content Objective: Students will calculate the number of people in each category using the data provided. Language Objective: Students will discuss the.
Around the Room with Maps. Population & Resources.
LanguagesLanguages. What is language? A human system of communication that uses arbitrary signals such as voice sounds, gestures, or written symbols.
School based task 6—task c Glossary of survival English Author:GaoMiao Yang Suqing.
Quran in Hindi ( हिन्दी कुरान ) Recite The Holy Qur’an Daily Now With this App Android Quran Application Solely designed for “ Indian Muslims ”
OLPC Localization Strategy Proposal Edward Cherlin Earth Treasury.
Tel: Fax: P.O. Box: 22392, Dubai - UAE
Some facts about geography
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
itranslit (Indic Transliteration Tool)
We Translate… You Market!!
Multilingual Indexes for Detection and Translation
Retrospective 2017 & Future Plans
Unicode Implementation in the Yale Catalog
Lets Look at UASG / IDN / EAI
Presentation transcript:

Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Song Lyrics

Reviews and Forums

Facebook and Twitter

And lot more

 Many languages that use non-Roman script  Arabic (Saudi Arabia, UAE, Egypt, Morocco,…)  Persian  Indian sub-continental languages (IL & Dzongkha, Nepalese, Sinhala)  Thai, Vietnamese  Cyrillic (Russian, Ukrainian)  Chinese, Japanese, Korean (rare)

Code Mixing Transliteration Errors, Contraction

Mono-script Monolingual IR in transliterated space  Query: thandee hava yeh chandni suhanee  Results: Only Roman transliterated documents  Challenge: Spelling variations  tandee hawa ye chandny soohaany

Cross-script and Multi-script Monolingual IR in transliterated space  Query: thandee hava yeh chandni OR ठंडी हवा ये चाँदनी  Results: Both Roman transliterated or in native script  Challenge: Transliteration

 Cross-script and Cross-lingual IR  Query: death of mareech and subahoo  Document: Hindi (Transliterated and Devanagari) and English documents

Mono-script Monolingual IR Transliterated query in Roman Transliterated documents in Roman Cross-script Monolingual IR Transliterated query in Roman Transliterated documents in native script Multi-script Monolingual IR Query in Roman or native script Documents in Roman and native scripts

 Language identification of transliterated queries, documents, code-mixed text kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML  Transliteration  Forward: കഴിക്കാന് ‍  kazhikkan  Backward: kazhikkan  കഴിക്കാന് ‍

 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)  unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics  More data under preparation from FaceBook on mixture of various languages.  Looking for partners to extend!

 Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics  Looking for partners to extend it to other (Indian) Languages  Other domains?

Thank you!

 Lexicons  Pronunciation lexicons  G2P for some languages  Stemmers and morphological analyzers  Anything else?

 We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing  These are just some initial ideas that came up from our experiences  If you are interested please let me know