Presentation is loading. Please wait.

Presentation is loading. Please wait.

CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi.

Similar presentations


Presentation on theme: "CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi."— Presentation transcript:

1 CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi

2 Architecture

3 Example Flow

4 Modules Extensively Improved  Morph Analyzer  Lexical Transfer

5 Marathi MA changes  The morph was modified to resolve the issues found in testing the Morph's output.  The resources were updated by adding 16000 new roots to the Lexicon and by creating several new SRRs.  This covers all the words in the Marathi wordnet.  Revised TAM labels.  Developed methods for Handling of Taddhitas (i.e. words derived from nouns, adjectives and adverbs) and compounds, but not integrated into ILMT pipeline.  Current accuracy is 95% on ILMT data.  The stand-alone morphological analyzer also reports the derivational process.

6 Marathi Compounding  In linguistics, a Compound Word is a lexeme that consists of more than one stem. They are a kind of MWE’s.  Easier to predict properties then MWE’s.  मामामामी {mamamami} {uncle-aunt (maternal)} (a noun).  Mostly Marathi has only 2 stems with rare 3 stem cases.  भाऊबहीण {bhaubahin}{brother-sister} has a Hindi equivalent भाई - बेहेन {bhai-behen}.  Individual components are directly translated.  Advantage for close languages like Marathi and Hindi.

7 Problem Definition  Given a word containing two components (and hence roots) a and b, inflected and appended with suffixes, identify each one and provide linguistic information and category of compound word:  Field 1 :  Field 2 : ;…;.  fs af means ‘feature structure in abbreviated form’.  CGNPTAM means ‘grammatical category, gender, number, person, tense, aspect and modality’.  Fincat: Grammatical category of the resultant word.  If no features then give only the root words with short description.

8 Taxonomy of Compounds Compound words Words with both components meaningful No duplication मामामामी {mamamami} {uncle-aunt (maternal)} Partial Reduplication गरमागरम {garmagaram} {very hot} Words with only one component meaningful First component meaningful अर्धामुर्धा {aardha murdha} {halfway} Second component meaningful शिसपेन्सिल {sispencil} {penpencil} Negation words अयोग्य {ayogya} {inappropriate} Reduplication words तुकडेतुकडे {tukdetukde} {in pieces} Sandhi words अत्यूर्जा {atyurja} {too much energy} Echo Words रस्तोरस्ती {rastorasti} {every road}

9 Results N o. Type Input word count Split count Analysed count Percentage correctly analysed 1 Both roots distinct and meaningful 23423122596% 2 Partial Reduplication 25 100% 3 Only first root meaningful 3530 85.7% 4 Only second root meaningful 118872.7% 5 Negation words12 100% 6 Reduplication words 59 100% 7 Echo words54 100% 8 Sandhi words3128 90.3%

10 Marathi Synset Linkage  Total number of synsets for which words were Cross-linked: 18,000  Now reflected in the bilingual dictionary used for lexical transfer  Total Marathi Synsets : 26557  Total unique words : 36394  Total linked Synsets : 23967

11 Corpus Statistics Tourism size =240,000 words Healthsize=255,000 words Generalsize=30.7 million words (news domain) POS Corpus annotated (tagged and cross-checked) General domain: 2,63,037 words Tourism domain: 1,36,640 words Health domain: 44,202 words(Set 1) Health domain: 21,345 words(Set 2)

12 Lexical Transfer Module changes  The dictionary currently has 316 Akhyata pairs, 68 Kridanta pairs, and 40 entries for irregular mappings.  A number of bugs involving the transfer of the base forms of verbs have been eliminated.  Bugs related a sudden crash in the system due to improper coding have been eliminated.  Lexical transfer module now selects the first synset in sequence corresponding to the given word.  Transfer of ordinals, conjunctions etc. also have been included.  The features of the NER module are now being properly utilized for the transliteration of the necessary named entities.

13 Current Status  Results by CDAC Pune  For Health:  Comprehensibility/Adequacy : 81%  Fluency : 53%  For Tourism:  Comprehensibility/Adequacy : 78%  Fluency : 52%

14 Evaluation Method S5: Number of score 5 Sentences, S4: Number of score 4 sentences, S3: Number of score 3 sentences, N: Total Number of sentences Score = Score : 5Correct Translation Score : 4Understandable with minor errors Score : 3Understandable with major errors Score : 2Not Understandable Score : 1Nonsense translation Linguists give a score out of 5 to the sentences without foreknowledge of their meaning. The score tells of the subjective quality of the sentence.

15 Examples  Extremely Fluent  कॉंग्रेसने बापूजींचा " इव्हेंट ' केला.  Congress made an event about Baapuji  कॉंग्रेस ने बापूजी के इव्हेंट किया ।  Moderately/Syntactically Fluent  आइन्स्टाइन एकदा म्हणाला होता, की नव्या युगातील तरुणाईला बापूजी म्हणजे एखाद्या चमत्कारासारखे वाटतील.  Einstein once said that the youth in the new age will feel that Baapuji is like a miracle  आइन्स्टाइन कभी कहा है, कि नया युग के तरुणाई को बापूजी अर्थात् एकाध चमत्कारसारखे बाँटएंगे ।  Poor Fluency  कारण मुळात ती बापूजींची राहिलेली नाही.  Because it basically did not remain of Baapuji  क्योंकि आदि में वह बापूजी के बचलेली नाह ।

16 Examples  Exact meaning transfer  येथे काही जातींची माकडे आणि कांगारू दिसतात.  Here we/one (can) see a few species of monkeys and kangaroos  यहाँ कुछ प्रकारों के बंदर और कंगारू दिखते हैं ।  Medium level meaning transfer  कारण ग्रंथ खरेच एक महत्त्वाचे ठिकाण आणि ऊर्जा केंद्रही असते.  Because a book truly is an important place and a source of power  क्योंकि ग्रंथ सचही एक महत्व के स्थान और ऊर्जा केंद्रभी रहता है ।  Complete distortion  स्वाभाविकच गांधींचा शोध - पुनर्शोधही चालूच राहिला.  Naturally, Gandhiji’s search-research (of self and the world) continued  स्वाभाविकही गाँधियों के खोज - पुनर्शोध चलएंगे च राहा ।

17 Another example of high fluency  पहिल्या टप्प्यासाठी पाचऐवजी सहा रुपये, तर जलद बससाठी सात रुपये भाडे प्रस्तावित आहे.  Pahilya tappyasathi paachaiwaji saha rupaye, tar jalad bussathi saat rupaye bhaade prastavit ahe.  For the first few steps, six instead of five rupees and for fast buses seven rupees have been proposed.  पहले फ़ासले के लिए पाँच के बदले छ रुपये, तो द्रुत बस के लिए सात रुपये किराया प्रस्तावित रह ।

18 DEMO  http://www.cfilt.iitb.ac.in/~ilmt/ilmtinterfa ce/admin/login.php http://www.cfilt.iitb.ac.in/~ilmt/ilmtinterfa ce/admin/login.php  Complete replication of the offline dashboard tool

19 Current pain points  Fluency is attributed to proper translation of suffixes/case markers/function words.  Marathi has 2 kinds of verb suffixes – Kridantas (Non-Finite) and Akhyatas (Finite).  Verb Chunk label determines which dictionary to look into for suffix translation.  Poor Chunking leads to poor fluency.  Many mistakes in suffix transfer.

20 Current pain points  Synsets in Wordnet are not ordered by first sense.  First sense WSD not applicable for words not disambiguated by current WSD engine.  This affects comprehensibility.

21 Action plan for Lexical Transfer Module  Splitting the current transfer module into two parts; one for lexical transfer and the other grammar transfer.  Look into statistical mechanisms for grammar transfer as well as lexical transfer to improve the accuracy.  Including mechanisms to handle the double Vibhaktis reported by the Vibhakti Computation module.

22 Action plan for MA  Improving the accuracy of the system further by adding new roots and SRR rules.  Revising the FSM rules for Kridantas to eliminate some glaring mistakes.  Creating more rules to handle more and more Taddhitas and compounds and integrating it into the ILMT pipeline  Using other fields in Morph analyzer’s output e.g. a flag to indicate emphatic marker.  Updating the morph to handle the double feature structure of genitive forms.

23 Other steps  Developing simple parser for Marathi.  Improving Chunker.  Continue linking more Marathi synsets and complete the linkage of current 37,617 Hindi synsets.  Evaluation on Randomly selected Web Documents – about 20-40 per week – and improving the outputs immediately.


Download ppt "CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi."

Similar presentations


Ads by Google