Download presentation
Presentation is loading. Please wait.
1
Build MT systems with Moses
MT Marathon in the Americas 2017 Hieu Hoang / Jeremy Gwinnup
2
Outline Pull MTMA17-lab docker image Run each step of training
Pre-compiled Moses and mgiza Contain small training/tuning/test corpora Run each step of training Create MT system Run Experiment Management System (EMS) Run all steps with 1 command Install Moses and mgiza on your laptop
3
Start Install Docker: https://www.docker.com/community-edition
Pull mtma17-lab docker image Follow the instructions in the handout Run commands Creating Arabic-to-English translation system
4
Data Arabic – Buckwalter encoding (’Romanized’) Datasets
AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Datasets Train 5,000 parallel sentences 71,286 monolingual sentences just in English Tune 50 parallel sentences Test 48 parallel sentences
5
SMT Pipeline Preprocessing - clean Alignment Tuning Decoding
- tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score
6
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment
Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score
7
Clean data $MOSES_DIR/scripts/training/clean-corpus-n.perl data/Train/Train_data ar en data/Train/Train_data.clean 1 100 Delete sentences over 100 words long Delete sentence pairs where ration > 9
8
Language Model nice $MOSES_DIR/bin/lmplz --order text $HOME/$WORK/data/LM/LM_data+Train_data.en --arpa $HOME/$WORK/work/LM/LM_data+Train_data.en.lm Create LM maximum ngram size = 3 Uses KenLM
9
Language Model Target text: the cow jumped over the moon
File work/LM/LM_data+Train_data.en.lm p(the cow jumped over the moon) = p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) \data\ ngram 1=139572 ngram 2= ngram 3= \1-grams: <unk> 0 <s> </s> 0 Nicosia …. \2-grams: (AFP) </s> 0 </s> 0 \3-grams: <s> (AFP) </s> /02 (AFP) </s> p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the)
10
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment
Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score
11
Word Alignment and Phrase-Extraction
Run Giza++ Word alignment Extract translation rules (phrases) From word-aligned parallel corpus Create phrase-tables
12
Word Alignment Training data Word alignment
data/Train/Train_data.clean.[en/ar] Word alignment work/model/aligned.grow-diag-final-and Eg. AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Saddam Hussein's Half-Brother Refuses to Return to Iraq
13
Phrase-Table ! ! ! . . ||| People pass by houses ||| e e-14 ||| 0-1 ||| ||| source target p(s|t) p(t|s) 360,000 translation rules
14
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment
Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score
15
Tuning Iterative process Moses.ini after tuning do
log 𝑝 𝑒 𝑓 = 𝑖=1 𝑛 𝜆 𝑖 ℎ 𝑖 (𝑒,𝑓 Iterative process do Decode tuning set Adjust weights ( 𝝀 𝒊 ) until weights converge Moses.ini after tuning [weight] LexicalReordering0= Distortion0= LM0= WordPenalty0= PhrasePenalty0= TranslationModel0= UnknownWordPenalty0= 1
16
SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment
Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Evaluation - BLEU score
17
Evaluation Decode test set Calculate BLEU score
Compare output with reference translation Percentage of correct 1-gram, 2-grams, 3-grams, 4-grams Precision metric Geometric mean Brevity penalty
18
BLEU score Brevity penalty Output length Reference length BLEU = 23.02, 60.0/30.3/17.2/9.5 (BP=0.987, ratio=0.987, hyp_len=1260, ref_len=1277) score unigram matches bigram matches Unigram matches 4-gram matches
19
Experiment Management System (EMS)
Config file Where to find Moses scripts and executables External programs Giza/mgiza/cdec etc POS tagger, parsers etc Training, tuning, test data Parameters eg. recasing/truecasing phrase-based/hiero Number of cores/grid engine jobs to use
20
EMS Advantages Consistent Run processes in parallel
Reduce mistakes Easier debugging Run processes in parallel Run multiple experiments simultaneously Disadvantages Sometime buggy Doesn’t do everything Occasionally need to run some steps manually
21
Install Moses
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.