Build MT systems with Moses

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Joint Parsing and Alignment with Weakly Synchronized Grammars David Burkett, John Blitzer, & Dan Klein TexPoint fonts used in EMF. Read the TexPoint manual.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE.
Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
First Year Progress Hieu Hoang Luxembourg Achievements Cross-Platform Compatibility Ease of use / Installation Testing and Reliability Speed Language.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Improving a Statistical MT System with Automatically Learned Rewrite Patterns Fei Xia and Michael McCord (Coling 2004) UW Machine Translation Reading.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Natural Language Processing Expectation Maximization.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Korea Maritime and Ocean University NLP Jung Tae LEE
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
NLP. Machine Translation Tree-to-tree – Yamada and Knight Phrase-based – Och and Ney Syntax-based – Och et al. Alignment templates – Och and Ney.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.
A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock 
A Simple Approach for Author Profiling in MapReduce
Statistical Machine Translation Part II: Word Alignments and EM
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Approaches to Machine Translation
CSE 517 Natural Language Processing Winter 2015
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Neural Machine Translation by Jointly Learning to Align and Translate
Digital Speech Processing
Suggestions for Class Projects
--Mengxue Zhang, Qingyang Li
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Yuri Pettinicchi Jeny Tony Philip
Transformer result, convolutional encoder-decoder
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
The CoNLL-2014 Shared Task on Grammatical Error Correction
Expectation-Maximization Algorithm
Approaches to Machine Translation
Machine Translation and MT tools: Giza++ and Moses
Improved Word Alignments Using the Web as a Corpus
Memory-augmented Chinese-Uyghur Neural Machine Translation
Statistical Machine Translation Papers from COLING 2004
Domain Mixing for Chinese-English Neural Machine Translation
Machine Translation and MT tools: Giza++ and Moses
Statistical NLP Spring 2011
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Build MT systems with Moses MT Marathon in the Americas 2017 Hieu Hoang / Jeremy Gwinnup

Outline Pull MTMA17-lab docker image Run each step of training Pre-compiled Moses and mgiza Contain small training/tuning/test corpora Run each step of training Create MT system Run Experiment Management System (EMS) Run all steps with 1 command Install Moses and mgiza on your laptop

Start Install Docker: https://www.docker.com/community-edition Pull mtma17-lab docker image Follow the instructions in the handout http://statmt.org/~s0565741/download/mtma16/ Run commands Creating Arabic-to-English translation system

Data Arabic – Buckwalter encoding (’Romanized’) Datasets AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Datasets Train 5,000 parallel sentences 71,286 monolingual sentences just in English Tune 50 parallel sentences Test 48 parallel sentences

SMT Pipeline Preprocessing - clean Alignment Tuning Decoding - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

Clean data $MOSES_DIR/scripts/training/clean-corpus-n.perl data/Train/Train_data ar en data/Train/Train_data.clean 1 100 Delete sentences over 100 words long Delete sentence pairs where ration > 9

Language Model nice $MOSES_DIR/bin/lmplz --order 3 --text $HOME/$WORK/data/LM/LM_data+Train_data.en --arpa $HOME/$WORK/work/LM/LM_data+Train_data.en.lm Create LM maximum ngram size = 3 Uses KenLM

Language Model Target text: the cow jumped over the moon File work/LM/LM_data+Train_data.en.lm p(the cow jumped over the moon) = p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) \data\ ngram 1=139572 ngram 2=1061731 ngram 3=2239731 \1-grams: -6.0734353 <unk> 0 0 <s> -0.91558355 -1.6365006 </s> 0 -5.2046447 Nicosia -0.11571049 …. \2-grams: -2.1021864 (AFP) </s> 0 -1.4692371 - </s> 0 \3-grams: -0.16613887 <s> (AFP) </s> -1.4355018 18/02 (AFP) </s> p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the)

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

Word Alignment and Phrase-Extraction Run Giza++ Word alignment Extract translation rules (phrases) From word-aligned parallel corpus Create phrase-tables

Word Alignment Training data Word alignment data/Train/Train_data.clean.[en/ar] Word alignment work/model/aligned.grow-diag-final-and Eg. 0-0 0-1 4-1 0-2 1-2 2-2 3-2 0-3 0-4 0-5 7-6 8-7 AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Saddam Hussein&apos;s Half-Brother Refuses to Return to Iraq

Phrase-Table ! ! ! . . ||| People pass by houses ||| 0.2 5.34133e-10 0.166667 4.38429e-14 ||| 0-1 ||| 5 6 1 ||| source target p(s|t) p(t|s) 360,000 translation rules

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

Tuning Iterative process Moses.ini after tuning do log 𝑝 𝑒 𝑓 = 𝑖=1 𝑛 𝜆 𝑖 ℎ 𝑖 (𝑒,𝑓 Iterative process do Decode tuning set Adjust weights ( 𝝀 𝒊 ) until weights converge Moses.ini after tuning [weight] LexicalReordering0= 0.0979471 0.0260167 0.0749775 0.0402326 0.0269783 0.011694 Distortion0= 0.0877464 LM0= 0.111063 WordPenalty0= -0.214965 PhrasePenalty0= 0.0397249 TranslationModel0= 0.0743573 0.0981889 0.0624994 0.0336091 UnknownWordPenalty0= 1

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Evaluation - BLEU score

Evaluation Decode test set Calculate BLEU score Compare output with reference translation Percentage of correct 1-gram, 2-grams, 3-grams, 4-grams Precision metric Geometric mean Brevity penalty

BLEU score Brevity penalty Output length Reference length BLEU = 23.02, 60.0/30.3/17.2/9.5 (BP=0.987, ratio=0.987, hyp_len=1260, ref_len=1277) score unigram matches bigram matches Unigram matches 4-gram matches

Experiment Management System (EMS) Config file Where to find Moses scripts and executables External programs Giza/mgiza/cdec etc POS tagger, parsers etc Training, tuning, test data Parameters eg. recasing/truecasing phrase-based/hiero Number of cores/grid engine jobs to use

EMS Advantages Consistent Run processes in parallel Reduce mistakes Easier debugging Run processes in parallel Run multiple experiments simultaneously Disadvantages Sometime buggy Doesn’t do everything Occasionally need to run some steps manually

Install Moses http://www.statmt.org/moses/