Download presentation
Presentation is loading. Please wait.
Published byTyrone Townsend Modified over 9 years ago
1
CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi
2
Architecture
3
Example Flow
4
Modules Extensively Improved Morph Analyzer Lexical Transfer
5
Marathi MA changes The morph was modified to resolve the issues found in testing the Morph's output. The resources were updated by adding 16000 new roots to the Lexicon and by creating several new SRRs. This covers all the words in the Marathi wordnet. Revised TAM labels. Developed methods for Handling of Taddhitas (i.e. words derived from nouns, adjectives and adverbs) and compounds, but not integrated into ILMT pipeline. Current accuracy is 95% on ILMT data. The stand-alone morphological analyzer also reports the derivational process.
6
Marathi Compounding In linguistics, a Compound Word is a lexeme that consists of more than one stem. They are a kind of MWE’s. Easier to predict properties then MWE’s. मामामामी {mamamami} {uncle-aunt (maternal)} (a noun). Mostly Marathi has only 2 stems with rare 3 stem cases. भाऊबहीण {bhaubahin}{brother-sister} has a Hindi equivalent भाई - बेहेन {bhai-behen}. Individual components are directly translated. Advantage for close languages like Marathi and Hindi.
7
Problem Definition Given a word containing two components (and hence roots) a and b, inflected and appended with suffixes, identify each one and provide linguistic information and category of compound word: Field 1 : Field 2 : ;…;. fs af means ‘feature structure in abbreviated form’. CGNPTAM means ‘grammatical category, gender, number, person, tense, aspect and modality’. Fincat: Grammatical category of the resultant word. If no features then give only the root words with short description.
8
Taxonomy of Compounds Compound words Words with both components meaningful No duplication मामामामी {mamamami} {uncle-aunt (maternal)} Partial Reduplication गरमागरम {garmagaram} {very hot} Words with only one component meaningful First component meaningful अर्धामुर्धा {aardha murdha} {halfway} Second component meaningful शिसपेन्सिल {sispencil} {penpencil} Negation words अयोग्य {ayogya} {inappropriate} Reduplication words तुकडेतुकडे {tukdetukde} {in pieces} Sandhi words अत्यूर्जा {atyurja} {too much energy} Echo Words रस्तोरस्ती {rastorasti} {every road}
9
Results N o. Type Input word count Split count Analysed count Percentage correctly analysed 1 Both roots distinct and meaningful 23423122596% 2 Partial Reduplication 25 100% 3 Only first root meaningful 3530 85.7% 4 Only second root meaningful 118872.7% 5 Negation words12 100% 6 Reduplication words 59 100% 7 Echo words54 100% 8 Sandhi words3128 90.3%
10
Marathi Synset Linkage Total number of synsets for which words were Cross-linked: 18,000 Now reflected in the bilingual dictionary used for lexical transfer Total Marathi Synsets : 26557 Total unique words : 36394 Total linked Synsets : 23967
11
Corpus Statistics Tourism size =240,000 words Healthsize=255,000 words Generalsize=30.7 million words (news domain) POS Corpus annotated (tagged and cross-checked) General domain: 2,63,037 words Tourism domain: 1,36,640 words Health domain: 44,202 words(Set 1) Health domain: 21,345 words(Set 2)
12
Lexical Transfer Module changes The dictionary currently has 316 Akhyata pairs, 68 Kridanta pairs, and 40 entries for irregular mappings. A number of bugs involving the transfer of the base forms of verbs have been eliminated. Bugs related a sudden crash in the system due to improper coding have been eliminated. Lexical transfer module now selects the first synset in sequence corresponding to the given word. Transfer of ordinals, conjunctions etc. also have been included. The features of the NER module are now being properly utilized for the transliteration of the necessary named entities.
13
Current Status Results by CDAC Pune For Health: Comprehensibility/Adequacy : 81% Fluency : 53% For Tourism: Comprehensibility/Adequacy : 78% Fluency : 52%
14
Evaluation Method S5: Number of score 5 Sentences, S4: Number of score 4 sentences, S3: Number of score 3 sentences, N: Total Number of sentences Score = Score : 5Correct Translation Score : 4Understandable with minor errors Score : 3Understandable with major errors Score : 2Not Understandable Score : 1Nonsense translation Linguists give a score out of 5 to the sentences without foreknowledge of their meaning. The score tells of the subjective quality of the sentence.
15
Examples Extremely Fluent कॉंग्रेसने बापूजींचा " इव्हेंट ' केला. Congress made an event about Baapuji कॉंग्रेस ने बापूजी के इव्हेंट किया । Moderately/Syntactically Fluent आइन्स्टाइन एकदा म्हणाला होता, की नव्या युगातील तरुणाईला बापूजी म्हणजे एखाद्या चमत्कारासारखे वाटतील. Einstein once said that the youth in the new age will feel that Baapuji is like a miracle आइन्स्टाइन कभी कहा है, कि नया युग के तरुणाई को बापूजी अर्थात् एकाध चमत्कारसारखे बाँटएंगे । Poor Fluency कारण मुळात ती बापूजींची राहिलेली नाही. Because it basically did not remain of Baapuji क्योंकि आदि में वह बापूजी के बचलेली नाह ।
16
Examples Exact meaning transfer येथे काही जातींची माकडे आणि कांगारू दिसतात. Here we/one (can) see a few species of monkeys and kangaroos यहाँ कुछ प्रकारों के बंदर और कंगारू दिखते हैं । Medium level meaning transfer कारण ग्रंथ खरेच एक महत्त्वाचे ठिकाण आणि ऊर्जा केंद्रही असते. Because a book truly is an important place and a source of power क्योंकि ग्रंथ सचही एक महत्व के स्थान और ऊर्जा केंद्रभी रहता है । Complete distortion स्वाभाविकच गांधींचा शोध - पुनर्शोधही चालूच राहिला. Naturally, Gandhiji’s search-research (of self and the world) continued स्वाभाविकही गाँधियों के खोज - पुनर्शोध चलएंगे च राहा ।
17
Another example of high fluency पहिल्या टप्प्यासाठी पाचऐवजी सहा रुपये, तर जलद बससाठी सात रुपये भाडे प्रस्तावित आहे. Pahilya tappyasathi paachaiwaji saha rupaye, tar jalad bussathi saat rupaye bhaade prastavit ahe. For the first few steps, six instead of five rupees and for fast buses seven rupees have been proposed. पहले फ़ासले के लिए पाँच के बदले छ रुपये, तो द्रुत बस के लिए सात रुपये किराया प्रस्तावित रह ।
18
DEMO http://www.cfilt.iitb.ac.in/~ilmt/ilmtinterfa ce/admin/login.php http://www.cfilt.iitb.ac.in/~ilmt/ilmtinterfa ce/admin/login.php Complete replication of the offline dashboard tool
19
Current pain points Fluency is attributed to proper translation of suffixes/case markers/function words. Marathi has 2 kinds of verb suffixes – Kridantas (Non-Finite) and Akhyatas (Finite). Verb Chunk label determines which dictionary to look into for suffix translation. Poor Chunking leads to poor fluency. Many mistakes in suffix transfer.
20
Current pain points Synsets in Wordnet are not ordered by first sense. First sense WSD not applicable for words not disambiguated by current WSD engine. This affects comprehensibility.
21
Action plan for Lexical Transfer Module Splitting the current transfer module into two parts; one for lexical transfer and the other grammar transfer. Look into statistical mechanisms for grammar transfer as well as lexical transfer to improve the accuracy. Including mechanisms to handle the double Vibhaktis reported by the Vibhakti Computation module.
22
Action plan for MA Improving the accuracy of the system further by adding new roots and SRR rules. Revising the FSM rules for Kridantas to eliminate some glaring mistakes. Creating more rules to handle more and more Taddhitas and compounds and integrating it into the ILMT pipeline Using other fields in Morph analyzer’s output e.g. a flag to indicate emphatic marker. Updating the morph to handle the double feature structure of genitive forms.
23
Other steps Developing simple parser for Marathi. Improving Chunker. Continue linking more Marathi synsets and complete the linkage of current 37,617 Hindi synsets. Evaluation on Randomly selected Web Documents – about 20-40 per week – and improving the outputs immediately.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.