Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.

Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang

Outline Log onto Edinburgh server Pre-compiled Moses and mgiza Contain small training/tuning/test corpora 1.Run each step of training Create MT system 2.Run Experiment Management System (EMS) Run all steps with 1 command 3.Install Moses and mgiza on your laptop

Start ssh guest@odin.inf.ed.ac.ukguest@odin.inf.ed.ac.uk Password: welcome123 Follow the instructions in the handout http://statmt.org/~s0565741/download/mtma16/ Run commands Creating Arabic-to-English translation system

Data Arabic – Buckwalter encoding (’Romanized’) AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Datasets Train 5,000 parallel sentences 71,286 monolingual sentences just in English Tune 50 parallel sentences Test 48 parallel sentences

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Tuning Postprocessing - recasing - detokenizer Scoring - BLEU score Decoding Phrase extraction Create LM

Clean data $MOSES_DIR/scripts/training/clean-corpus-n.perl data/Train/Train_data ar en data/Train/Train_data.clean 1 100 Delete sentences over 100 words long Delete sentence pairs where ration > 9

Language Model nice $MOSES_DIR/bin/lmplz --order 3 --text $HOME/$WORK/data/LM/LM_data+Train_data.en --arpa $HOME/$WORK/work/LM/LM_data+Train_data.en.lm Create LM maximum ngram size = 3 Uses KenLM

Language Model \data\ ngram 1=139572 ngram 2=1061731 ngram 3=2239731 \1-grams: -6.0734353 0 0 -0.91558355 -1.6365006 0 -5.2046447 Nicosia -0.11571049 …. \2-grams: -2.1021864 (AFP) 0 -1.4692371 - 0 …. \3-grams: -0.16613887 (AFP) -1.4355018 18/02 (AFP) …. Target text: the cow jumped over the moon p(the cow jumped over the moon) = p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) File work/LM/LM_data+Train_data.en.lm

Word Alignment and Phrase-Extraction Run Giza++ Word alignment Extract translation rules (phrases) From word-aligned parallel corpus Create phrase-tables

Word Alignment Training data data/Train/Train_data.clean.[en/ar] Word alignment work/model/aligned.grow-diag-final-and Eg. 0-0 0-1 4-1 0-2 1-2 2-2 3-2 0-3 0-4 0-5 7-6 8-7 AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Saddam Hussein's Half-Brother Refuses to Return to Iraq

Phrase-Table ! ! !.. ||| People pass by houses ||| 0.2 5.34133e-10 0.166667 4.38429e-14 ||| 0-1 ||| 5 6 1 ||| source target p(s|t) p(t|s) 360,000 translation rules

Tuning [weight] LexicalReordering0= 0.0979471 0.0260167 0.0749775 0.0402326 0.0269783 0.011694 Distortion0= 0.0877464 LM0= 0.111063 WordPenalty0= -0.214965 PhrasePenalty0= 0.0397249 TranslationModel0= 0.0743573 0.0981889 0.0624994 0.0336091 UnknownWordPenalty0= 1

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Tuning Postprocessing - recasing - detokenizer Evaluation - BLEU score Decoding Phrase extraction Create LM

Evaluation Decode test set Calculate BLEU score Compare output with reference translation Percentage of correct 1-gram, 2-grams, 3-grams, 4-grams Precision metric Geometric mean Brevity penalty

BLEU score BLEU = 23.02, 60.0/30.3/17.2/9.5 (BP=0.987, ratio=0.987, hyp_len=1260, ref_len=1277) scoreunigram matches bigram matches Unigram matches 4-gram matches Brevity penalty Output length Reference length

Experiment Management System (EMS) Config file Where to find Moses scripts and executables External programs Giza/mgiza/cdec etc POS tagger, parsers etc Training, tuning, test data Parameters eg. recasing/truecasing phrase-based/hiero Number of cores/grid engine jobs to use

EMS Advantages Consistent Reduce mistakes Easier debugging Run processes in parallel Run multiple experiments simultaneously Disadvantages Sometime buggy Doesn’t do everything Occasionally need to run some steps manually

Install Moses http://www.statmt.org/moses/

Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.

Similar presentations

Presentation on theme: "Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.

Similar presentations

Presentation on theme: "Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang."— Presentation transcript:

Similar presentations

About project

Feedback