Build MT systems with Moses

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Joint Parsing and Alignment with Weakly Synchronized Grammars David Burkett, John Blitzer, & Dan Klein TexPoint fonts used in EMF. Read the TexPoint manual.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE.

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

First Year Progress Hieu Hoang Luxembourg Achievements Cross-Platform Compatibility Ease of use / Installation Testing and Reliability Speed Language.

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

1 Improving a Statistical MT System with Automatically Learned Rewrite Patterns Fei Xia and Michael McCord (Coling 2004) UW Machine Translation Reading.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

Natural Language Processing Expectation Maximization.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen

Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.

Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Korea Maritime and Ocean University NLP Jung Tae LEE

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

NLP. Machine Translation Tree-to-tree – Yamada and Knight Phrase-based – Och and Ney Syntax-based – Och et al. Alignment templates – Och and Ney.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.

A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock

A Simple Approach for Author Profiling in MapReduce

Statistical Machine Translation Part II: Word Alignments and EM

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Approaches to Machine Translation

CSE 517 Natural Language Processing Winter 2015

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Neural Machine Translation by Jointly Learning to Align and Translate

Digital Speech Processing

Suggestions for Class Projects

--Mengxue Zhang, Qingyang Li

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Yuri Pettinicchi Jeny Tony Philip

Transformer result, convolutional encoder-decoder

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

The CoNLL-2014 Shared Task on Grammatical Error Correction

Expectation-Maximization Algorithm

Approaches to Machine Translation

Machine Translation and MT tools: Giza++ and Moses

Improved Word Alignments Using the Web as a Corpus

Memory-augmented Chinese-Uyghur Neural Machine Translation

Statistical Machine Translation Papers from COLING 2004

Domain Mixing for Chinese-English Neural Machine Translation

Machine Translation and MT tools: Giza++ and Moses

Statistical NLP Spring 2011

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Build MT systems with Moses MT Marathon in the Americas 2017 Hieu Hoang / Jeremy Gwinnup

Outline Pull MTMA17-lab docker image Run each step of training Pre-compiled Moses and mgiza Contain small training/tuning/test corpora Run each step of training Create MT system Run Experiment Management System (EMS) Run all steps with 1 command Install Moses and mgiza on your laptop

Start Install Docker: https://www.docker.com/community-edition Pull mtma17-lab docker image Follow the instructions in the handout http://statmt.org/~s0565741/download/mtma16/ Run commands Creating Arabic-to-English translation system

Data Arabic – Buckwalter encoding (’Romanized’) Datasets AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Datasets Train 5,000 parallel sentences 71,286 monolingual sentences just in English Tune 50 parallel sentences Test 48 parallel sentences

SMT Pipeline Preprocessing - clean Alignment Tuning Decoding - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

Clean data $MOSES_DIR/scripts/training/clean-corpus-n.perl data/Train/Train_data ar en data/Train/Train_data.clean 1 100 Delete sentences over 100 words long Delete sentence pairs where ration > 9

Language Model nice $MOSES_DIR/bin/lmplz --order 3 --text $HOME/$WORK/data/LM/LM_data+Train_data.en --arpa $HOME/$WORK/work/LM/LM_data+Train_data.en.lm Create LM maximum ngram size = 3 Uses KenLM

Language Model Target text: the cow jumped over the moon File work/LM/LM_data+Train_data.en.lm p(the cow jumped over the moon) = p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the) \data\ ngram 1=139572 ngram 2=1061731 ngram 3=2239731 \1-grams: -6.0734353 <unk> 0 0 <s> -0.91558355 -1.6365006 </s> 0 -5.2046447 Nicosia -0.11571049 …. \2-grams: -2.1021864 (AFP) </s> 0 -1.4692371 - </s> 0 \3-grams: -0.16613887 <s> (AFP) </s> -1.4355018 18/02 (AFP) </s> p(the) * p(cow|the) * p(jumped| the cow) * p(over| the cow jumped) * p(the|the cow jumped over) * p(moon| the cow jumped over the)

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

Word Alignment and Phrase-Extraction Run Giza++ Word alignment Extract translation rules (phrases) From word-aligned parallel corpus Create phrase-tables

Word Alignment Training data Word alignment data/Train/Train_data.clean.[en/ar] Word alignment work/model/aligned.grow-diag-final-and Eg. 0-0 0-1 4-1 0-2 1-2 2-2 3-2 0-3 0-4 0-5 7-6 8-7 AlOx gyr Alcqyq lSdAm Hsyn yrfD AlEwdp IlY AlErAq Saddam Hussein's Half-Brother Refuses to Return to Iraq

Phrase-Table ! ! ! . . ||| People pass by houses ||| 0.2 5.34133e-10 0.166667 4.38429e-14 ||| 0-1 ||| 5 6 1 ||| source target p(s|t) p(t|s) 360,000 translation rules

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Scoring - BLEU score

Tuning Iterative process Moses.ini after tuning do log 𝑝 𝑒 𝑓 = 𝑖=1 𝑛 𝜆 𝑖 ℎ 𝑖 (𝑒,𝑓 Iterative process do Decode tuning set Adjust weights ( 𝝀 𝒊 ) until weights converge Moses.ini after tuning [weight] LexicalReordering0= 0.0979471 0.0260167 0.0749775 0.0402326 0.0269783 0.011694 Distortion0= 0.0877464 LM0= 0.111063 WordPenalty0= -0.214965 PhrasePenalty0= 0.0397249 TranslationModel0= 0.0743573 0.0981889 0.0624994 0.0336091 UnknownWordPenalty0= 1

SMT Pipeline Preprocessing - clean - tokenize - lowercase Alignment Create LM Phrase extraction Tuning Decoding MT pipeline - each part is critical to producing good MT system Can show you how to do each part - take a week Lose the will to live! However, not necessary to know the mechanics of each & every part to start Those that don’t need to know, or know but just want it to work consistently - provide a system which wraps up the pipeline Postprocessing - recasing - detokenizer Evaluation - BLEU score

Evaluation Decode test set Calculate BLEU score Compare output with reference translation Percentage of correct 1-gram, 2-grams, 3-grams, 4-grams Precision metric Geometric mean Brevity penalty

BLEU score Brevity penalty Output length Reference length BLEU = 23.02, 60.0/30.3/17.2/9.5 (BP=0.987, ratio=0.987, hyp_len=1260, ref_len=1277) score unigram matches bigram matches Unigram matches 4-gram matches

Experiment Management System (EMS) Config file Where to find Moses scripts and executables External programs Giza/mgiza/cdec etc POS tagger, parsers etc Training, tuning, test data Parameters eg. recasing/truecasing phrase-based/hiero Number of cores/grid engine jobs to use

EMS Advantages Consistent Run processes in parallel Reduce mistakes Easier debugging Run processes in parallel Run multiple experiments simultaneously Disadvantages Sometime buggy Doesn’t do everything Occasionally need to run some steps manually

Install Moses http://www.statmt.org/moses/