Download presentation
Presentation is loading. Please wait.
Published byMartina Charles Modified over 8 years ago
1
Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang
2
Outline Log onto Edinburgh server Pre-compiled Moses and mgiza Contain small training/tuning/test corpora 1.Run each step of training Create MT system 2.Run Experiment Management System (EMS) Run all steps with 1 command 3.Install Moses and mgiza on your laptop
3
Start ssh guest@odin.inf.ed.ac.ukguest@odin.inf.ed.ac.uk Password: welcome123 Follow the instructions in the handout http://statmt.org/~s0565741/download/mtma16/ Run commands Creating Arabic-to-English translation system
4
Data Arabic – Buckwalter encoding (’Romanized’) AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Datasets Train 5,000 parallel sentences 71,286 monolingual sentences just in English Tune 50 parallel sentences Test 48 parallel sentences
5
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Tuning Postprocessing - recasing - detokenizer Scoring - BLEU score Decoding Phrase extraction Create LM
6
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Tuning Postprocessing - recasing - detokenizer Scoring - BLEU score Decoding Phrase extraction Create LM
7
Clean data $MOSES_DIR/scripts/training/clean-corpus-n.perl data/Train/Train_data ar en data/Train/Train_data.clean 1 100 Delete sentences over 100 words long Delete sentence pairs where ration > 9
8
Language Model nice $MOSES_DIR/bin/lmplz --order 3 --text $HOME/$WORK/data/LM/LM_data+Train_data.en --arpa $HOME/$WORK/work/LM/LM_data+Train_data.en.lm Create LM maximum ngram size = 3 Uses KenLM
9
Language Model \data\ ngram 1=139572 ngram 2=1061731 ngram 3=2239731 \1-grams: -6.0734353 0 0 -0.91558355 -1.6365006 0 -5.2046447 Nicosia -0.11571049 …. \2-grams: -2.1021864 (AFP) 0 -1.4692371 - 0 …. \3-grams: -0.16613887 (AFP) -1.4355018 18/02 (AFP) …. Target text: the cow jumped over the moon p(the cow jumped over the moon) = p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) File work/LM/LM_data+Train_data.en.lm
10
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Tuning Postprocessing - recasing - detokenizer Scoring - BLEU score Decoding Phrase extraction Create LM
11
Word Alignment and Phrase-Extraction Run Giza++ Word alignment Extract translation rules (phrases) From word-aligned parallel corpus Create phrase-tables
12
Word Alignment Training data data/Train/Train_data.clean.[en/ar] Word alignment work/model/aligned.grow-diag-final-and Eg. 0-0 0-1 4-1 0-2 1-2 2-2 3-2 0-3 0-4 0-5 7-6 8-7 AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Saddam Hussein's Half-Brother Refuses to Return to Iraq
13
Phrase-Table ! ! !.. ||| People pass by houses ||| 0.2 5.34133e-10 0.166667 4.38429e-14 ||| 0-1 ||| 5 6 1 ||| source target p(s|t) p(t|s) 360,000 translation rules
14
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Tuning Postprocessing - recasing - detokenizer Scoring - BLEU score Decoding Phrase extraction Create LM
15
Tuning [weight] LexicalReordering0= 0.0979471 0.0260167 0.0749775 0.0402326 0.0269783 0.011694 Distortion0= 0.0877464 LM0= 0.111063 WordPenalty0= -0.214965 PhrasePenalty0= 0.0397249 TranslationModel0= 0.0743573 0.0981889 0.0624994 0.0336091 UnknownWordPenalty0= 1
16
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Tuning Postprocessing - recasing - detokenizer Evaluation - BLEU score Decoding Phrase extraction Create LM
17
Evaluation Decode test set Calculate BLEU score Compare output with reference translation Percentage of correct 1-gram, 2-grams, 3-grams, 4-grams Precision metric Geometric mean Brevity penalty
18
BLEU score BLEU = 23.02, 60.0/30.3/17.2/9.5 (BP=0.987, ratio=0.987, hyp_len=1260, ref_len=1277) scoreunigram matches bigram matches Unigram matches 4-gram matches Brevity penalty Output length Reference length
19
Experiment Management System (EMS) Config file Where to find Moses scripts and executables External programs Giza/mgiza/cdec etc POS tagger, parsers etc Training, tuning, test data Parameters eg. recasing/truecasing phrase-based/hiero Number of cores/grid engine jobs to use
20
EMS Advantages Consistent Reduce mistakes Easier debugging Run processes in parallel Run multiple experiments simultaneously Disadvantages Sometime buggy Doesn’t do everything Occasionally need to run some steps manually
21
Install Moses http://www.statmt.org/moses/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.