Build MT systems with Moses MT Marathon in the Americas 2017 Hieu Hoang / Jeremy Gwinnup
Outline Pull MTMA17-lab docker image Run each step of training Pre-compiled Moses and mgiza Contain small training/tuning/test corpora Run each step of training Create MT system Run Experiment Management System (EMS) Run all steps with 1 command Install Moses and mgiza on your laptop
Start Install Docker: https://www.docker.com/community-edition Pull mtma17-lab docker image Follow the instructions in the handout http://statmt.org/~s0565741/download/mtma16/ Run commands Creating Arabic-to-English translation system
Data Arabic – Buckwalter encoding (’Romanized’) Datasets AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Datasets Train 5,000 parallel sentences 71,286 monolingual sentences just in English Tune 50 parallel sentences Test 48 parallel sentences
SMT Pipeline Preprocessing - clean Alignment Tuning Decoding - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score
Clean data $MOSES_DIR/scripts/training/clean-corpus-n.perl data/Train/Train_data ar en data/Train/Train_data.clean 1 100 Delete sentences over 100 words long Delete sentence pairs where ration > 9
Language Model nice $MOSES_DIR/bin/lmplz --order 3 --text $HOME/$WORK/data/LM/LM_data+Train_data.en --arpa $HOME/$WORK/work/LM/LM_data+Train_data.en.lm Create LM maximum ngram size = 3 Uses KenLM
Language Model Target text: the cow jumped over the moon File work/LM/LM_data+Train_data.en.lm p(the cow jumped over the moon) = p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) \data\ ngram 1=139572 ngram 2=1061731 ngram 3=2239731 \1-grams: -6.0734353 <unk> 0 0 <s> -0.91558355 -1.6365006 </s> 0 -5.2046447 Nicosia -0.11571049 …. \2-grams: -2.1021864 (AFP) </s> 0 -1.4692371 - </s> 0 \3-grams: -0.16613887 <s> (AFP) </s> -1.4355018 18/02 (AFP) </s> p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the)
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score
Word Alignment and Phrase-Extraction Run Giza++ Word alignment Extract translation rules (phrases) From word-aligned parallel corpus Create phrase-tables
Word Alignment Training data Word alignment data/Train/Train_data.clean.[en/ar] Word alignment work/model/aligned.grow-diag-final-and Eg. 0-0 0-1 4-1 0-2 1-2 2-2 3-2 0-3 0-4 0-5 7-6 8-7 AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Saddam Hussein's Half-Brother Refuses to Return to Iraq
Phrase-Table ! ! ! . . ||| People pass by houses ||| 0.2 5.34133e-10 0.166667 4.38429e-14 ||| 0-1 ||| 5 6 1 ||| source target p(s|t) p(t|s) 360,000 translation rules
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score
Tuning Iterative process Moses.ini after tuning do log 𝑝 𝑒 𝑓 = 𝑖=1 𝑛 𝜆 𝑖 ℎ 𝑖 (𝑒,𝑓 Iterative process do Decode tuning set Adjust weights ( 𝝀 𝒊 ) until weights converge Moses.ini after tuning [weight] LexicalReordering0= 0.0979471 0.0260167 0.0749775 0.0402326 0.0269783 0.011694 Distortion0= 0.0877464 LM0= 0.111063 WordPenalty0= -0.214965 PhrasePenalty0= 0.0397249 TranslationModel0= 0.0743573 0.0981889 0.0624994 0.0336091 UnknownWordPenalty0= 1
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Evaluation - BLEU score
Evaluation Decode test set Calculate BLEU score Compare output with reference translation Percentage of correct 1-gram, 2-grams, 3-grams, 4-grams Precision metric Geometric mean Brevity penalty
BLEU score Brevity penalty Output length Reference length BLEU = 23.02, 60.0/30.3/17.2/9.5 (BP=0.987, ratio=0.987, hyp_len=1260, ref_len=1277) score unigram matches bigram matches Unigram matches 4-gram matches
Experiment Management System (EMS) Config file Where to find Moses scripts and executables External programs Giza/mgiza/cdec etc POS tagger, parsers etc Training, tuning, test data Parameters eg. recasing/truecasing phrase-based/hiero Number of cores/grid engine jobs to use
EMS Advantages Consistent Run processes in parallel Reduce mistakes Easier debugging Run processes in parallel Run multiple experiments simultaneously Disadvantages Sometime buggy Doesn’t do everything Occasionally need to run some steps manually
Install Moses http://www.statmt.org/moses/