Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Using Percolated Dependencies in PBSMT Ankit K. Srivastava and Andy Way Dublin City University CLUKI XII: April 24, 2009.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
1 A Tree Sequence Alignment- based Tree-to-Tree Translation Model Authors: Min Zhang, Hongfei Jiang, Aiti Aw, et al. Reporter: 江欣倩 Professor: 陳嘉平.
A Tree-to-Tree Alignment- based Model for Statistical Machine Translation Authors: Min ZHANG, Hongfei JIANG, Ai Ti AW, Jun SUN, Sheng LI, Chew Lim TAN.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Syntax for MT EECS 767 Feb. 1, Outline Motivation Syntax-based translation model  Formalization  Training Using syntax in MT  Using multiple.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Grammatical Machine Translation Stefan Riezler & John Maxwell.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation.
Natural Language Processing Guangyan Song. What is NLP  Natural Language processing (NLP) is a field of computer science and linguistics concerned with.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
Dependency Tree-to-Dependency Tree Machine Translation November 4, 2011 Presented by: Jeffrey Flanigan (CMU) Lori Levin, Jaime Carbonell In collaboration.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Bayesian Subtree Alignment Model based on Dependency Trees Toshiaki Nakazawa Sadao Kurohashi Kyoto University 1 IJCNLP2011.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Haitham Elmarakeby.  Speech recognition
Imposing Constraints from the Source Tree on ITG Constraints for SMT Hirofumi Yamamoto, Hideo Okuma, Eiichiro Sumita National Institute of Information.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
NLP. Machine Translation Tree-to-tree – Yamada and Knight Phrase-based – Och and Ney Syntax-based – Och et al. Alignment templates – Och and Ney.
Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem Mikhail Zaslavskiy Marc Dymetman Nicola Cancedda ACL 2009.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Improving a Statistical MT System with Automatically Learned Rewrite Rules Fei Xia and Michael McCord IBM T. J. Watson Research Center Yorktown Heights,
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Neural Machine Translation
Approaches to Machine Translation
Translation of Unknown Words in Low Resource Languages
Suggestions for Class Projects
--Mengxue Zhang, Qingyang Li
Approaches to Machine Translation
Statistical Machine Translation Papers from COLING 2004
The XMU SMT System for IWSLT 2007
A Path-based Transfer Model for Machine Translation
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presentation transcript:

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali

Statistical Machine Translation MOSES (Koehn et al., 2007)‏  Started for European language pairs  Now being used for linguistically distant pairs English-Arabic English-Chinese  Surging interest in English-Hindi SMT Simple Syntactic and Morphological Processing  R. Ananthakrishnan et al., 2008 Global Lexical Selection and Sentence Reconstruction  S. Venkatapathy and S. Bangalore, 2006  Evaluated using BLEU score

Observed Problems with English-Hindi SMT Low BLEU scores recorded for small data sets  Linguistically distant languages  Morphological differences leading to data sparseness  Problem of unknown words  Reordering problems  Lack of huge language models  Quality of the reference translation  BLEU score not directly proportional to the quality of the translation

Our Approach Proposed and Tested Solutions  Morphological differences leading to data sparseness Use stemming and dictionary based techniques  Problem of unknown words NER techniques and transliteration  Reordering issues Prior reordering of English side using transfer rules  Lack of domain-specific huge language models Using huge monolingual corpus for generating Language Models  Remove erroneous phrases translations from the phrase table

Data Sets EILMT parallel corpus  Training Set : 7000 sentences  Development Set : 500 sentences  Test Set : 500 sentences IIIT-TIDES data set  Training Set : 50,000 sentence  Development Set : 1000 sentences  Test Set : 1000 sentences

Reordering: Transfer Grammar Transfer rules learned using  Dependency tree of English sentence Libin’s parser used for parsing the English side  POS tags from both the sides Example Rules  IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2 English side of the training and test corpus are reordered

Reordering: Learning rules Word-Alignment using GIZA++ For each node  Consider child nodes  Check relative positions of node and child nodes in Source  Check relative positions of projections of word and child nodes in Target  Combine this information to form transfer rule

Reordering: Example w2//NN w1//IN w3//VBZ t t t Source node Alignment Target Sentence Learnt Rule: IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2

Reordering: Simple Syntactic Rules Movement of the verb in a sentence  30 % (~2100) sentences are compound and complex sentences  Based on POS tags  No deep syntactic information (like parsing)‏  Handcrafted rules Tag the corpus with Complex and Compound Sentence tags and, but, because, or mark the conjunctions Move the verbs on the both sides of the connective (e.g. and, but) to the end of the conjuncts

Handling Unknown words Dictionary  Extract the root from the word on English side  Generate TAM information  Translate the root into Hindi using dictionary  Map the TAM information to Hindi TAM  Generate the Hindi word using root and TAM information Transliteration  Transliterate the NE  Tool currently not available

Experiments with phrase table Observed that end-of-sentence markers such as (.) are aligned wrongly  Remove the phrase translations with (.)s aligned to words  Found 10,000 of them (~2.5% of the total phrase table)‏  Train and tune the system with EOS markers removed and stored elsewhere  Add the EOS markers after running the Decoder  Evaluate the system  Observed that the BLEU score increases  The quality of the translation improves

Effect of language models Experimented with 7K tourism corpus  No conclusion can be drawn  Hugely varying results  Cannot explain the variation in the results

Results - I Without Tuning System Phrase Removal Additional Language Model 7K Hindi corpus Dictionary Reordering with TG Baseline

Results - II With Tuning System Phrase Removal Additional Language Model 7K Hindi corpus - Dictionary - Reordering with TG Baseline

Conclusions & Future Work Incorporate syntactic phrase into SMT Long Distance Reordering Models are required Current systems handle local reordering Robust TG rules for reordering the source side Build huge language models

References P. Koehn and H. Hoang Factored translation models. In Proc. of the 2007 Conference on Empirical Methods in Natural Language Processing (EMNLP/Co-NLL). P. Koehn, F.J. Och, and D. Marcu Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 48–54. Association for Computational Linguistics Morristown, NJ, USA. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al Moses: Open Source Toolkit for Statistical Machine Translation. In ANNUAL MEETING- ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, volume 45, page 2. S. Kumar, W. Byrne, JOHNS HOPKINS UNIV BALTIMORE MD CENTER FOR LANGUAGE, and SPEECH PROCESSING (CLSP Minimum Bayes-Risk Decoding for Statistical Machine Translation.

References Franz Josef Och and Hermann Ney A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. F.J. Och Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160–167. Association for Computational Linguistics Morristown, NJ, USA. K. Papineni, S. Roukos, T. Ward, and WJ Zhu BLEU: a method for automatic evaluation of MT. Research Report, Computer Science RC22176 (W ), IBM Research Division, TJ Watson Research Center, 17. L. Shen STATISTICAL LTAG PARSING. Ph.D. thesis, University of Pennsylvania. A. Stolcke Srilm-an extensible language modeling toolkit, international conference spoken language processing. SRI, Denver, Colorado, Tech. Rep. S. Venkatapathy and S. Bangalore. Three models for discriminative machine translation using Global Lexical Selection and Sentence Reconstruction. In SSST.