Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali
Statistical Machine Translation MOSES (Koehn et al., 2007) Started for European language pairs Now being used for linguistically distant pairs English-Arabic English-Chinese Surging interest in English-Hindi SMT Simple Syntactic and Morphological Processing R. Ananthakrishnan et al., 2008 Global Lexical Selection and Sentence Reconstruction S. Venkatapathy and S. Bangalore, 2006 Evaluated using BLEU score
Observed Problems with English-Hindi SMT Low BLEU scores recorded for small data sets Linguistically distant languages Morphological differences leading to data sparseness Problem of unknown words Reordering problems Lack of huge language models Quality of the reference translation BLEU score not directly proportional to the quality of the translation
Our Approach Proposed and Tested Solutions Morphological differences leading to data sparseness Use stemming and dictionary based techniques Problem of unknown words NER techniques and transliteration Reordering issues Prior reordering of English side using transfer rules Lack of domain-specific huge language models Using huge monolingual corpus for generating Language Models Remove erroneous phrases translations from the phrase table
Data Sets EILMT parallel corpus Training Set : 7000 sentences Development Set : 500 sentences Test Set : 500 sentences IIIT-TIDES data set Training Set : 50,000 sentence Development Set : 1000 sentences Test Set : 1000 sentences
Reordering: Transfer Grammar Transfer rules learned using Dependency tree of English sentence Libin’s parser used for parsing the English side POS tags from both the sides Example Rules IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2 English side of the training and test corpus are reordered
Reordering: Learning rules Word-Alignment using GIZA++ For each node Consider child nodes Check relative positions of node and child nodes in Source Check relative positions of projections of word and child nodes in Target Combine this information to form transfer rule
Reordering: Example w2//NN w1//IN w3//VBZ t t t Source node Alignment Target Sentence Learnt Rule: IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2
Reordering: Simple Syntactic Rules Movement of the verb in a sentence 30 % (~2100) sentences are compound and complex sentences Based on POS tags No deep syntactic information (like parsing) Handcrafted rules Tag the corpus with Complex and Compound Sentence tags and, but, because, or mark the conjunctions Move the verbs on the both sides of the connective (e.g. and, but) to the end of the conjuncts
Handling Unknown words Dictionary Extract the root from the word on English side Generate TAM information Translate the root into Hindi using dictionary Map the TAM information to Hindi TAM Generate the Hindi word using root and TAM information Transliteration Transliterate the NE Tool currently not available
Experiments with phrase table Observed that end-of-sentence markers such as (.) are aligned wrongly Remove the phrase translations with (.)s aligned to words Found 10,000 of them (~2.5% of the total phrase table) Train and tune the system with EOS markers removed and stored elsewhere Add the EOS markers after running the Decoder Evaluate the system Observed that the BLEU score increases The quality of the translation improves
Effect of language models Experimented with 7K tourism corpus No conclusion can be drawn Hugely varying results Cannot explain the variation in the results
Results - I Without Tuning System Phrase Removal Additional Language Model 7K Hindi corpus Dictionary Reordering with TG Baseline
Results - II With Tuning System Phrase Removal Additional Language Model 7K Hindi corpus - Dictionary - Reordering with TG Baseline
Conclusions & Future Work Incorporate syntactic phrase into SMT Long Distance Reordering Models are required Current systems handle local reordering Robust TG rules for reordering the source side Build huge language models
References P. Koehn and H. Hoang Factored translation models. In Proc. of the 2007 Conference on Empirical Methods in Natural Language Processing (EMNLP/Co-NLL). P. Koehn, F.J. Och, and D. Marcu Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 48–54. Association for Computational Linguistics Morristown, NJ, USA. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al Moses: Open Source Toolkit for Statistical Machine Translation. In ANNUAL MEETING- ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, volume 45, page 2. S. Kumar, W. Byrne, JOHNS HOPKINS UNIV BALTIMORE MD CENTER FOR LANGUAGE, and SPEECH PROCESSING (CLSP Minimum Bayes-Risk Decoding for Statistical Machine Translation.
References Franz Josef Och and Hermann Ney A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. F.J. Och Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160–167. Association for Computational Linguistics Morristown, NJ, USA. K. Papineni, S. Roukos, T. Ward, and WJ Zhu BLEU: a method for automatic evaluation of MT. Research Report, Computer Science RC22176 (W ), IBM Research Division, TJ Watson Research Center, 17. L. Shen STATISTICAL LTAG PARSING. Ph.D. thesis, University of Pennsylvania. A. Stolcke Srilm-an extensible language modeling toolkit, international conference spoken language processing. SRI, Denver, Colorado, Tech. Rep. S. Venkatapathy and S. Bangalore. Three models for discriminative machine translation using Global Lexical Selection and Sentence Reconstruction. In SSST.