CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi.

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Units of specialized knowledge* “A unit of specialized knowledge (SKU) is a unit that represents specialized knowledge at the content level, and communicates.
Example Database English-German Dictionary
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Stemming, tagging and chunking Text analysis short of parsing.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Morphology I. Basic concepts and terms Derivational processes
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Chapter Section A: Verb Basics Section B: Pronoun Basics Section C: Parallel Structure Section D: Using Modifiers Effectively The Writer’s Handbook: Grammar.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,
Kalyani Patel K.S.School of Business Management,Gujarat University.
Morphology For Marathi POS-Tagger Veena Dixit 11/ 10 /2005.
English Lexicology Morphological Structure of English Words Week 3: Mar. 10, 2009 Instructor: Liu Hongyong.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi*
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Role of NLP in Linguistics Dipti Misra Sharma Language Technologies Research Centre International Institute of Information Technology Hyderabad.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Morphological typology
Natural Language Processing Chapter 2 : Morphology.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Web Service Exchange Protocols Preliminary Proposal ISO TC37 SC4 WG1 2 September 2013 Pisa, Italy.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
1 The grammatical categories of words and their inflections Kuiper and Allan Chapter 2.1.
Date of Inception: 21st July 2012
Lecture 2 Introduction to Programming
Machine Learning in Natural Language Processing
Token generation - stemming
SANSKRIT ANALYZING SYSTEM
A Link Grammar for an Agglutinative Language
Presentation transcript:

CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi

Architecture

Example Flow

Modules Extensively Improved  Morph Analyzer  Lexical Transfer

Marathi MA changes  The morph was modified to resolve the issues found in testing the Morph's output.  The resources were updated by adding new roots to the Lexicon and by creating several new SRRs.  This covers all the words in the Marathi wordnet.  Revised TAM labels.  Developed methods for Handling of Taddhitas (i.e. words derived from nouns, adjectives and adverbs) and compounds, but not integrated into ILMT pipeline.  Current accuracy is 95% on ILMT data.  The stand-alone morphological analyzer also reports the derivational process.

Marathi Compounding  In linguistics, a Compound Word is a lexeme that consists of more than one stem. They are a kind of MWE’s.  Easier to predict properties then MWE’s.  मामामामी {mamamami} {uncle-aunt (maternal)} (a noun).  Mostly Marathi has only 2 stems with rare 3 stem cases.  भाऊबहीण {bhaubahin}{brother-sister} has a Hindi equivalent भाई - बेहेन {bhai-behen}.  Individual components are directly translated.  Advantage for close languages like Marathi and Hindi.

Problem Definition  Given a word containing two components (and hence roots) a and b, inflected and appended with suffixes, identify each one and provide linguistic information and category of compound word:  Field 1 :  Field 2 : ;…;.  fs af means ‘feature structure in abbreviated form’.  CGNPTAM means ‘grammatical category, gender, number, person, tense, aspect and modality’.  Fincat: Grammatical category of the resultant word.  If no features then give only the root words with short description.

Taxonomy of Compounds Compound words Words with both components meaningful No duplication मामामामी {mamamami} {uncle-aunt (maternal)} Partial Reduplication गरमागरम {garmagaram} {very hot} Words with only one component meaningful First component meaningful अर्धामुर्धा {aardha murdha} {halfway} Second component meaningful शिसपेन्सिल {sispencil} {penpencil} Negation words अयोग्य {ayogya} {inappropriate} Reduplication words तुकडेतुकडे {tukdetukde} {in pieces} Sandhi words अत्यूर्जा {atyurja} {too much energy} Echo Words रस्तोरस्ती {rastorasti} {every road}

Results N o. Type Input word count Split count Analysed count Percentage correctly analysed 1 Both roots distinct and meaningful % 2 Partial Reduplication % 3 Only first root meaningful % 4 Only second root meaningful % 5 Negation words12 100% 6 Reduplication words % 7 Echo words54 100% 8 Sandhi words %

Marathi Synset Linkage  Total number of synsets for which words were Cross-linked: 18,000  Now reflected in the bilingual dictionary used for lexical transfer  Total Marathi Synsets :  Total unique words :  Total linked Synsets : 23967

Corpus Statistics Tourism size =240,000 words Healthsize=255,000 words Generalsize=30.7 million words (news domain) POS Corpus annotated (tagged and cross-checked) General domain: 2,63,037 words Tourism domain: 1,36,640 words Health domain: 44,202 words(Set 1) Health domain: 21,345 words(Set 2)

Lexical Transfer Module changes  The dictionary currently has 316 Akhyata pairs, 68 Kridanta pairs, and 40 entries for irregular mappings.  A number of bugs involving the transfer of the base forms of verbs have been eliminated.  Bugs related a sudden crash in the system due to improper coding have been eliminated.  Lexical transfer module now selects the first synset in sequence corresponding to the given word.  Transfer of ordinals, conjunctions etc. also have been included.  The features of the NER module are now being properly utilized for the transliteration of the necessary named entities.

Current Status  Results by CDAC Pune  For Health:  Comprehensibility/Adequacy : 81%  Fluency : 53%  For Tourism:  Comprehensibility/Adequacy : 78%  Fluency : 52%

Evaluation Method S5: Number of score 5 Sentences, S4: Number of score 4 sentences, S3: Number of score 3 sentences, N: Total Number of sentences Score = Score : 5Correct Translation Score : 4Understandable with minor errors Score : 3Understandable with major errors Score : 2Not Understandable Score : 1Nonsense translation Linguists give a score out of 5 to the sentences without foreknowledge of their meaning. The score tells of the subjective quality of the sentence.

Examples  Extremely Fluent  कॉंग्रेसने बापूजींचा " इव्हेंट ' केला.  Congress made an event about Baapuji  कॉंग्रेस ने बापूजी के इव्हेंट किया ।  Moderately/Syntactically Fluent  आइन्स्टाइन एकदा म्हणाला होता, की नव्या युगातील तरुणाईला बापूजी म्हणजे एखाद्या चमत्कारासारखे वाटतील.  Einstein once said that the youth in the new age will feel that Baapuji is like a miracle  आइन्स्टाइन कभी कहा है, कि नया युग के तरुणाई को बापूजी अर्थात् एकाध चमत्कारसारखे बाँटएंगे ।  Poor Fluency  कारण मुळात ती बापूजींची राहिलेली नाही.  Because it basically did not remain of Baapuji  क्योंकि आदि में वह बापूजी के बचलेली नाह ।

Examples  Exact meaning transfer  येथे काही जातींची माकडे आणि कांगारू दिसतात.  Here we/one (can) see a few species of monkeys and kangaroos  यहाँ कुछ प्रकारों के बंदर और कंगारू दिखते हैं ।  Medium level meaning transfer  कारण ग्रंथ खरेच एक महत्त्वाचे ठिकाण आणि ऊर्जा केंद्रही असते.  Because a book truly is an important place and a source of power  क्योंकि ग्रंथ सचही एक महत्व के स्थान और ऊर्जा केंद्रभी रहता है ।  Complete distortion  स्वाभाविकच गांधींचा शोध - पुनर्शोधही चालूच राहिला.  Naturally, Gandhiji’s search-research (of self and the world) continued  स्वाभाविकही गाँधियों के खोज - पुनर्शोध चलएंगे च राहा ।

Another example of high fluency  पहिल्या टप्प्यासाठी पाचऐवजी सहा रुपये, तर जलद बससाठी सात रुपये भाडे प्रस्तावित आहे.  Pahilya tappyasathi paachaiwaji saha rupaye, tar jalad bussathi saat rupaye bhaade prastavit ahe.  For the first few steps, six instead of five rupees and for fast buses seven rupees have been proposed.  पहले फ़ासले के लिए पाँच के बदले छ रुपये, तो द्रुत बस के लिए सात रुपये किराया प्रस्तावित रह ।

DEMO  ce/admin/login.php ce/admin/login.php  Complete replication of the offline dashboard tool

Current pain points  Fluency is attributed to proper translation of suffixes/case markers/function words.  Marathi has 2 kinds of verb suffixes – Kridantas (Non-Finite) and Akhyatas (Finite).  Verb Chunk label determines which dictionary to look into for suffix translation.  Poor Chunking leads to poor fluency.  Many mistakes in suffix transfer.

Current pain points  Synsets in Wordnet are not ordered by first sense.  First sense WSD not applicable for words not disambiguated by current WSD engine.  This affects comprehensibility.

Action plan for Lexical Transfer Module  Splitting the current transfer module into two parts; one for lexical transfer and the other grammar transfer.  Look into statistical mechanisms for grammar transfer as well as lexical transfer to improve the accuracy.  Including mechanisms to handle the double Vibhaktis reported by the Vibhakti Computation module.

Action plan for MA  Improving the accuracy of the system further by adding new roots and SRR rules.  Revising the FSM rules for Kridantas to eliminate some glaring mistakes.  Creating more rules to handle more and more Taddhitas and compounds and integrating it into the ILMT pipeline  Using other fields in Morph analyzer’s output e.g. a flag to indicate emphatic marker.  Updating the morph to handle the double feature structure of genitive forms.

Other steps  Developing simple parser for Marathi.  Improving Chunker.  Continue linking more Marathi synsets and complete the linkage of current 37,617 Hindi synsets.  Evaluation on Randomly selected Web Documents – about per week – and improving the outputs immediately.