Presentation is loading. Please wait.

Presentation is loading. Please wait.

Build MT systems with Moses

Similar presentations


Presentation on theme: "Build MT systems with Moses"— Presentation transcript:

1 Build MT systems with Moses
MT Marathon in the Americas 2017 Hieu Hoang / Jeremy Gwinnup

2 Outline Pull MTMA17-lab docker image Run each step of training
Pre-compiled Moses and mgiza Contain small training/tuning/test corpora Run each step of training Create MT system Run Experiment Management System (EMS) Run all steps with 1 command Install Moses and mgiza on your laptop

3 Start Install Docker: https://www.docker.com/community-edition
Pull mtma17-lab docker image Follow the instructions in the handout Run commands Creating Arabic-to-English translation system

4 Data Arabic – Buckwalter encoding (’Romanized’) Datasets
AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Datasets Train 5,000 parallel sentences 71,286 monolingual sentences just in English Tune 50 parallel sentences Test 48 parallel sentences

5 SMT Pipeline Preprocessing - clean Alignment Tuning Decoding
- tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

6 SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment
Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

7 Clean data $MOSES_DIR/scripts/training/clean-corpus-n.perl data/Train/Train_data ar en data/Train/Train_data.clean 1 100 Delete sentences over 100 words long Delete sentence pairs where ration > 9

8 Language Model nice $MOSES_DIR/bin/lmplz --order text $HOME/$WORK/data/LM/LM_data+Train_data.en --arpa $HOME/$WORK/work/LM/LM_data+Train_data.en.lm Create LM maximum ngram size = 3 Uses KenLM

9 Language Model Target text: the cow jumped over the moon
File work/LM/LM_data+Train_data.en.lm p(the cow jumped over the moon) = p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) \data\ ngram 1=139572 ngram 2= ngram 3= \1-grams: <unk> 0 <s> </s> 0 Nicosia …. \2-grams: (AFP) </s> 0 </s> 0 \3-grams: <s> (AFP) </s> /02 (AFP) </s> p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the)

10 SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment
Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

11 Word Alignment and Phrase-Extraction
Run Giza++ Word alignment Extract translation rules (phrases) From word-aligned parallel corpus Create phrase-tables

12 Word Alignment Training data Word alignment
data/Train/Train_data.clean.[en/ar] Word alignment work/model/aligned.grow-diag-final-and Eg.  AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Saddam Hussein&apos;s Half-Brother Refuses to Return to Iraq

13 Phrase-Table ! ! ! . . ||| People pass by houses ||| e e-14 ||| 0-1 ||| ||| source target p(s|t) p(t|s) 360,000 translation rules

14 SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment
Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

15 Tuning Iterative process Moses.ini after tuning do
log 𝑝 𝑒 𝑓 = 𝑖=1 𝑛 𝜆 𝑖 ℎ 𝑖 (𝑒,𝑓 Iterative process do Decode tuning set Adjust weights ( 𝝀 𝒊 ) until weights converge Moses.ini after tuning [weight] LexicalReordering0= Distortion0= LM0= WordPenalty0= PhrasePenalty0= TranslationModel0= UnknownWordPenalty0= 1

16 SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment
Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Evaluation - BLEU score

17 Evaluation Decode test set Calculate BLEU score
Compare output with reference translation Percentage of correct 1-gram, 2-grams, 3-grams, 4-grams Precision metric Geometric mean Brevity penalty

18 BLEU score Brevity penalty Output length Reference length BLEU = 23.02, 60.0/30.3/17.2/9.5 (BP=0.987, ratio=0.987, hyp_len=1260, ref_len=1277) score unigram matches bigram matches Unigram matches 4-gram matches

19 Experiment Management System (EMS)
Config file Where to find Moses scripts and executables External programs Giza/mgiza/cdec etc POS tagger, parsers etc Training, tuning, test data Parameters eg. recasing/truecasing phrase-based/hiero Number of cores/grid engine jobs to use

20 EMS Advantages Consistent Run processes in parallel
Reduce mistakes Easier debugging Run processes in parallel Run multiple experiments simultaneously Disadvantages Sometime buggy Doesn’t do everything Occasionally need to run some steps manually

21 Install Moses


Download ppt "Build MT systems with Moses"

Similar presentations


Ads by Google