Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation
Contents Abstract and Motivation Rule Based and Statistical Machine Translation Hybrid System System Architecture Phrase table enhancement using lexical resources Suffix, Infix and Pattern based postprocessing Experiments with Example Sentence Evaluation Conclusion References
Abstract and Previous work MT translate a text from one natural language(such as Bengali) to another (such as Hindi) – Meaning must be restored Current MT software allows for customization by domain – Improving output by limiting scope History: 1946: A.D. Booth proposed using digital computers for translation of natural languages. 1954: Georgetown experiment involved MT of 60 Russian languages into English. Claimed 3-5 Years MT would be a solved problem. 1966: ALPAI report 10 years long research has failed to fulfill expectations Translation Challenges: Decoding the meaning of the source text Re-encoding the meaning in the target language
Rule Based and Statistical MT Statistical MT Uses statistical model with bilingual corpora Provides good quality when large and qualified corpora are available Poor for other domains Fluent and cheaper Bengali-Hindi: 2 month 2 person effort – BLEU Score Rule Based MT Relies on countless built-in linguistic rules and dictionary Good out-of-domain quality and is predictable Lack of fluency, long and costly Bengali-Hindi: 2 years 5 person effort – BLEU Score
Hybrid System There is a clear need for a third approach through which Users would reach better translation quality and high performance(Rule based) Less investment – cost and time (Statistical) Bengali-Hindi: BLEU Score
System Architecture
Feeding dictionary into SMT Lexical entries from Transfer Based system(tourism) is used to increase word alignments in SMT(news) Dictionary is from another domain Dictionary contains only words, not phrase s
Postprocessing by suffix list Suffix list (1000) Monolingual corpuses of same size for source and target languages (500K each) Some of the suffices which occur more than 1000 times in Bengali corpus and Zero times in Hindi corpus Some other suffixes which occur more than 5000 times in Bengali corpus and more than 99% of total occurrences in combined corpus occur in Bengali corpus
Suffix list Sl. No. SuffixNumber of occurrences in Bengali corpus 1Ya echhe2899 3ao2053 4chhila2001 5oYA1607 6eo1550 7bhAbe1426 8Yechhe1426 9chhi Yera ilena achhe1004 Sl No. Suf fix Number of occurrences in Bengali Corpus Number of occurrence s in Hindi Corpus 1era ei Ye iYe554915
Infix based postprocessing Multiple suffixes can be attached and they are stacked chhelegulike = chhele + guli + ke Infix in Bengali is translated to infix in Hindi Sl. No.Infix 1dera 2gulo + guli 3na 4iYechha + iYechhe + iYechhi + iYechho
Pattern based postprocessing After Suffix and Infix based postprocessing the output is further inspected to find out some error patterns “te” or “ke” suffixes preceded by 5 or more english characters are very rare in Hindi
Experiment Resources: Training corpus (12K sentences) of EMILLE-CIIL Development corpus(1K Sentences) of EMILLE-CIIL Test Corpus(100 Sentences) of EMILLE-CIIL Suffix List: 1000 Bengali linguistic suffixes Dictionary: 15,000 parallel synsets of ILMT-DIT Gazetteer list: 50K parallel names of ILMT-DIT Monolingual Corpus: 500K words from SL and TL Systems: Giza++; Moses; Mart; Pharaoh.
Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike pariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.
Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike pariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.
Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike sApha tarahA se batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.
Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.Nke sApha tarahA se batA diYA hai kI mAtApitAo.Nke eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.
Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.
Evaluation ExperimentsBLEUNIST SMT baseline K dictionary K dictionary Suffix based postprocessing Infix based postprocessing Pattern based postprocessing
BLEU Automatic, inexpensive, quick and language independent evaluation system The closer a machine translation output to a professional human reference translation, the better is the BLEU score Source word can be translated to different word choices Candidate translation will select one of them may not match with the reference translation word choice
BLEU Cont. Candidate translation Reference translation BLEU Monolingual concept dictionary Modi fied BLEU Improving BLEU score considering the concepts rather than words
Conclusion Targeted to postprocess the inflected words which remain unchanged after translation The words which are wrongly translated are not considered A morphological analyzer/generator may be useful By considering the dictionary fluency level is decreased
References W. S. Bennett, J. Slocum The Irc Machine Translation System. In Comp. Linguist., pp. 11(2-3): P. F. Brown, S. D. Pietra, V. J. D. Pietra, R. L. Mercer The mathematics of statistical machine translation: Parameter estimation. In Comp. Linguist., pp. 19(2) A. Eisele, C. Federmann, H. Uszkoreit, H. Saint-Amand, M. Kay, M. Jellinghaus, S. Hunsicker, T. Herrmann, Y. Chen Hybrid Machine Translation Architectures within and beyond the EuroMatrix project. In Proceedings of the European Machine Translation Conference, pp Ethnologue: Languages of the World, 16th edition, Edited by M. Paul Lewis, P. Isabelle, C. Goutte, M. Simard Domain adaptation of MT systems through automatic post- editing. In Proceedings Of MTSummit XI, pp , Copenhagen, Denmark. P. Koehn, F. J. Och, D. Marcu Statistical phrase-based translation. In Proceedings Of NAACL- HLT, pp , Edmonton, Canada. P. Koehn Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proceedings of Association of Machine Translation in the Americas (AMTA-2004). F. J. Och, H. Ney Improved Statistical Alignment Models. In proceedings of the 38th Annual Meeting of the ACL, pp F. J. Och, H. Ney The Alignment Template Approach to Statistical Machine Translation. In Computational Linguistics Vol. 30 Num. 4, pp K. Papineni, S. Roukos, T. Ward, W. Zhu BLEU: a Method for Automatic Evaluation of Machine Translation. In 40th Annual meeting of the ACL, Philadelphia, pp A. Ushioda Phrase Alignment for Integration of SMT and RBMT Resources. In MT Summit XI Workshop on Patent Translation Programme. H. Wu, H. Wang Improving Statistical Word Alignment with a Rule-Based Machine Translation System, In Proceedings of Coling, pp
Thank You