Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Patent documentation - comparison of two MT strategies Lene Offersgaard, Claus Povlsen Center for Sprogteknologi, University of Copenhagen

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

Word Sense Disambiguation for Machine Translation Han-Bin Chen

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

TIDES MT Workshop Review. Using Syntax?  ISI-small: –Cross-lingual parsing/decoding Input: Chinese sentence + English lattice built with all possible.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

Kalyani Patel K.S.School of Business Management,Gujarat University.

Machine translation Context-based approach Lucia Otoyo.

Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University

Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen

Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.

Direct Translation Approaches: Statistical Machine Translation

Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

Grammatical Machine Translation Stefan Riezler & John Maxwell.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.

Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.

Korea Maritime and Ocean University NLP Jung Tae LEE

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

NLP. Machine Translation Tree-to-tree – Yamada and Knight Phrase-based – Och and Ney Syntax-based – Och et al. Alignment templates – Och and Ney.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.

Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.

A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.

A method to restrict the blow-up of hypotheses... A method to restrict the blow-up of hypotheses of a non-disambiguated shallow machine translation system.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

Is Neural Machine Translation the New State of the Art?

Approaches to Machine Translation

KantanNeural™ LQR Experiment

Translation of Unknown Words in Low Resource Languages

--Mengxue Zhang, Qingyang Li

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Approaches to Machine Translation

Statistical Machine Translation Papers from COLING 2004

The XMU SMT System for IWSLT 2007

Presentation transcript:

Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation

Contents Abstract and Motivation Rule Based and Statistical Machine Translation Hybrid System System Architecture Phrase table enhancement using lexical resources Suffix, Infix and Pattern based postprocessing Experiments with Example Sentence Evaluation Conclusion References

Abstract and Previous work  MT translate a text from one natural language(such as Bengali) to another (such as Hindi) – Meaning must be restored  Current MT software allows for customization by domain – Improving output by limiting scope  History: 1946: A.D. Booth proposed using digital computers for translation of natural languages. 1954: Georgetown experiment involved MT of 60 Russian languages into English. Claimed 3-5 Years MT would be a solved problem. 1966: ALPAI report 10 years long research has failed to fulfill expectations  Translation Challenges: Decoding the meaning of the source text Re-encoding the meaning in the target language

Rule Based and Statistical MT  Statistical MT Uses statistical model with bilingual corpora Provides good quality when large and qualified corpora are available Poor for other domains Fluent and cheaper Bengali-Hindi: 2 month 2 person effort – BLEU Score  Rule Based MT Relies on countless built-in linguistic rules and dictionary Good out-of-domain quality and is predictable Lack of fluency, long and costly Bengali-Hindi: 2 years 5 person effort – BLEU Score

Hybrid System  There is a clear need for a third approach through which Users would reach better translation quality and high performance(Rule based) Less investment – cost and time (Statistical) Bengali-Hindi: BLEU Score

System Architecture

Feeding dictionary into SMT  Lexical entries from Transfer Based system(tourism) is used to increase word alignments in SMT(news)  Dictionary is from another domain  Dictionary contains only words, not phrase s

Postprocessing by suffix list  Suffix list (1000)  Monolingual corpuses of same size for source and target languages (500K each)  Some of the suffices which occur more than 1000 times in Bengali corpus and Zero times in Hindi corpus  Some other suffixes which occur more than 5000 times in Bengali corpus and more than 99% of total occurrences in combined corpus occur in Bengali corpus

Suffix list Sl. No. SuffixNumber of occurrences in Bengali corpus 1Ya echhe2899 3ao2053 4chhila2001 5oYA1607 6eo1550 7bhAbe1426 8Yechhe1426 9chhi Yera ilena achhe1004 Sl No. Suf fix Number of occurrences in Bengali Corpus Number of occurrence s in Hindi Corpus 1era ei Ye iYe554915

Infix based postprocessing Multiple suffixes can be attached and they are stacked chhelegulike = chhele + guli + ke Infix in Bengali is translated to infix in Hindi Sl. No.Infix 1dera 2gulo + guli 3na 4iYechha + iYechhe + iYechhi + iYechho

Pattern based postprocessing After Suffix and Infix based postprocessing the output is further inspected to find out some error patterns “te” or “ke” suffixes preceded by 5 or more english characters are very rare in Hindi

Experiment Resources: Training corpus (12K sentences) of EMILLE-CIIL Development corpus(1K Sentences) of EMILLE-CIIL Test Corpus(100 Sentences) of EMILLE-CIIL Suffix List: 1000 Bengali linguistic suffixes Dictionary: 15,000 parallel synsets of ILMT-DIT Gazetteer list: 50K parallel names of ILMT-DIT Monolingual Corpus: 500K words from SL and TL Systems: Giza++; Moses; Mart; Pharaoh.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike pariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike pariskArabhAbe batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulagulike sApha tarahA se batA diYA hai kI mAtApitAderake eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.Nke sApha tarahA se batA diYA hai kI mAtApitAo.Nke eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Example Sentence Bengali: AmarA saba skulagulike pariskArabhAbe bale diYechhi ye mAtApitAderake ekaTA likhita nathipatra dena. English: We have told every school clearly that give the parents a written document. SMT (with enhanced phrase table) output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N. Postprocessing output: hama sabhI skulao.N ko sApha tarahA se batA diYA hai kI mAtApitAo.N ko eka likhita dalila de.N.

Evaluation ExperimentsBLEUNIST SMT baseline K dictionary K dictionary Suffix based postprocessing Infix based postprocessing Pattern based postprocessing

BLEU  Automatic, inexpensive, quick and language independent evaluation system  The closer a machine translation output to a professional human reference translation, the better is the BLEU score  Source word can be translated to different word choices  Candidate translation will select one of them  may not match with the reference translation word choice

BLEU Cont. Candidate translation Reference translation BLEU Monolingual concept dictionary Modi fied BLEU Improving BLEU score considering the concepts rather than words

Conclusion Targeted to postprocess the inflected words which remain unchanged after translation The words which are wrongly translated are not considered A morphological analyzer/generator may be useful By considering the dictionary fluency level is decreased

References W. S. Bennett, J. Slocum The Irc Machine Translation System. In Comp. Linguist., pp. 11(2-3): P. F. Brown, S. D. Pietra, V. J. D. Pietra, R. L. Mercer The mathematics of statistical machine translation: Parameter estimation. In Comp. Linguist., pp. 19(2) A. Eisele, C. Federmann, H. Uszkoreit, H. Saint-Amand, M. Kay, M. Jellinghaus, S. Hunsicker, T. Herrmann, Y. Chen Hybrid Machine Translation Architectures within and beyond the EuroMatrix project. In Proceedings of the European Machine Translation Conference, pp Ethnologue: Languages of the World, 16th edition, Edited by M. Paul Lewis, P. Isabelle, C. Goutte, M. Simard Domain adaptation of MT systems through automatic post- editing. In Proceedings Of MTSummit XI, pp , Copenhagen, Denmark. P. Koehn, F. J. Och, D. Marcu Statistical phrase-based translation. In Proceedings Of NAACL- HLT, pp , Edmonton, Canada. P. Koehn Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proceedings of Association of Machine Translation in the Americas (AMTA-2004). F. J. Och, H. Ney Improved Statistical Alignment Models. In proceedings of the 38th Annual Meeting of the ACL, pp F. J. Och, H. Ney The Alignment Template Approach to Statistical Machine Translation. In Computational Linguistics Vol. 30 Num. 4, pp K. Papineni, S. Roukos, T. Ward, W. Zhu BLEU: a Method for Automatic Evaluation of Machine Translation. In 40th Annual meeting of the ACL, Philadelphia, pp A. Ushioda Phrase Alignment for Integration of SMT and RBMT Resources. In MT Summit XI Workshop on Patent Translation Programme. H. Wu, H. Wang Improving Statistical Word Alignment with a Rule-Based Machine Translation System, In Proceedings of Coling, pp

Thank You