Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation.

Similar presentations


Presentation on theme: "Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation."— Presentation transcript:

1 Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation

2 Contents Abstract and Motivation Rule Based and Statistical Machine Translation Hybrid System System Architecture Phrase table enhancement using lexical resources Suffix, Infix and Pattern based postprocessing Experiments with Example Sentence Evaluation Conclusion References

3 Abstract and Previous work  MT translate a text from one natural language(such as Bengali) to another (such as Hindi) – Meaning must be restored  Current MT software allows for customization by domain – Improving output by limiting scope  History: 1946: A.D. Booth proposed using digital computers for translation of natural languages. 1954: Georgetown experiment involved MT of 60 Russian languages into English. Claimed 3-5 Years MT would be a solved problem. 1966: ALPAI report 10 years long research has failed to fulfill expectations  Translation Challenges: Decoding the meaning of the source text Re-encoding the meaning in the target language

4 Rule Based and Statistical MT  Statistical MT Uses statistical model with bilingual corpora Provides good quality when large and qualified corpora are available Poor for other domains Fluent and cheaper Bengali-Hindi: 2 month 2 person effort – BLEU Score 0.1745  Rule Based MT Relies on countless built-in linguistic rules and dictionary Good out-of-domain quality and is predictable Lack of fluency, long and costly Bengali-Hindi: 2 years 5 person effort – BLEU Score 0.0424

5 Hybrid System  There is a clear need for a third approach through which Users would reach better translation quality and high performance(Rule based) Less investment – cost and time (Statistical) Bengali-Hindi: BLEU Score 0.2318

6 System Architecture

7 Feeding dictionary into SMT  Lexical entries from Transfer Based system(tourism) is used to increase word alignments in SMT(news)  Dictionary is from another domain  Dictionary contains only words, not phrase s

8 Postprocessing by suffix list  Suffix list (1000)  Monolingual corpuses of same size for source and target languages (500K each)  Some of the suffices which occur more than 1000 times in Bengali corpus and Zero times in Hindi corpus  Some other suffixes which occur more than 5000 times in Bengali corpus and more than 99% of total occurrences in combined corpus occur in Bengali corpus

9 Suffix list Sl. No. SuffixNumber of occurrences in Bengali corpus 1Ya15461 2echhe2899 3ao2053 4chhila2001 5oYA1607 6eo1550 7bhAbe1426 8Yechhe1426 9chhi1165 10Yera1093 11ilena1026 12achhe1004 Sl No. Suf fix Number of occurrences in Bengali Corpus Number of occurrence s in Hindi Corpus 1era29426262 2ei97733 3Ye926319 4iYe554915

10 Infix based postprocessing Multiple suffixes can be attached and they are stacked chhelegulike = chhele + guli + ke Infix in Bengali is translated to infix in Hindi Sl. No.Infix 1dera 2gulo + guli 3na 4iYechha + iYechhe + iYechhi + iYechho

11 Pattern based postprocessing After Suffix and Infix based postprocessing the output is further inspected to find out some error patterns “te” or “ke” suffixes preceded by 5 or more english characters are very rare in Hindi

12 Experiment Resources: Training corpus (12K sentences) of EMILLE-CIIL Development corpus(1K Sentences) of EMILLE-CIIL Test Corpus(100 Sentences) of EMILLE-CIIL Suffix List: 1000 Bengali linguistic suffixes Dictionary: 15,000 parallel synsets of ILMT-DIT Gazetteer list: 50K parallel names of ILMT-DIT Monolingual Corpus: 500K words from SL and TL Systems: Giza++; Moses; Mart; Pharaoh.

13 Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike pariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

14 Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike pariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

15 Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike sApha tarahA se batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

16 Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.Nke sApha tarahA se batA diYA hai kI mAtApitAo.Nke eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

17 Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

18 Evaluation ExperimentsBLEUNIST SMT baseline0.17454.2072 30K dictionary0.17594.2267 50K dictionary0.17124.1631 Suffix based postprocessing0.19334.5062 Infix based postprocessing0.21284.6865 Pattern based postprocessing0.22754.8405

19 BLEU  Automatic, inexpensive, quick and language independent evaluation system  The closer a machine translation output to a professional human reference translation, the better is the BLEU score  Source word can be translated to different word choices  Candidate translation will select one of them  may not match with the reference translation word choice

20 BLEU Cont. Candidate translation Reference translation BLEU 0.2275 Monolingual concept dictionary Modi fied BLEU 0.2318 Improving BLEU score considering the concepts rather than words

21 Conclusion Targeted to postprocess the inflected words which remain unchanged after translation The words which are wrongly translated are not considered A morphological analyzer/generator may be useful By considering the dictionary fluency level is decreased

22 References W. S. Bennett, J. Slocum. 1985. The Irc Machine Translation System. In Comp. Linguist., pp. 11(2-3): 111-121. P. F. Brown, S. D. Pietra, V. J. D. Pietra, R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. In Comp. Linguist., pp. 19(2) 263-312. A. Eisele, C. Federmann, H. Uszkoreit, H. Saint-Amand, M. Kay, M. Jellinghaus, S. Hunsicker, T. Herrmann, Y. Chen. 2008. Hybrid Machine Translation Architectures within and beyond the EuroMatrix project. In Proceedings of the European Machine Translation Conference, pp. 27-34. Ethnologue: Languages of the World, 16th edition, Edited by M. Paul Lewis, 2009. P. Isabelle, C. Goutte, M. Simard. 2007. Domain adaptation of MT systems through automatic post- editing. In Proceedings Of MTSummit XI, pp. 255-261, Copenhagen, Denmark. P. Koehn, F. J. Och, D. Marcu. 2003. Statistical phrase-based translation. In Proceedings Of NAACL- HLT, pp. 48-54, Edmonton, Canada. P. Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proceedings of Association of Machine Translation in the Americas (AMTA-2004). F. J. Och, H. Ney. 2000. Improved Statistical Alignment Models. In proceedings of the 38th Annual Meeting of the ACL, pp. 440-447. F. J. Och, H. Ney. 2004. The Alignment Template Approach to Statistical Machine Translation. In Computational Linguistics Vol. 30 Num. 4, pp. 417-449. K. Papineni, S. Roukos, T. Ward, W. Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In 40th Annual meeting of the ACL, Philadelphia, pp. 311-318. A. Ushioda. 2007. Phrase Alignment for Integration of SMT and RBMT Resources. In MT Summit XI Workshop on Patent Translation Programme. H. Wu, H. Wang. 2004. Improving Statistical Word Alignment with a Rule-Based Machine Translation System, In Proceedings of Coling, pp. 29-35.

23 Thank You


Download ppt "Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation."

Similar presentations


Ads by Google