Morphological Analysis for Phrase-Based Statistical Machine Translation LUONG Minh Thang Supervisor: Dr. KAN Min Yen National University of Singapore Web.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Enriched translation model using morphology in MT Luong Minh Thang WING group meeting – 07 July, /20/20141.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Morphological Analysis for Phrase- Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008 HYP update - part1 4/30/20151.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

TURKALATOR A Suite of Tools for English to Turkish MT Siddharth Jonathan Gorkem Ozbek CS224n Final Project June 14, 2006.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.

Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU

Image Annotation and Feature Extraction

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

Morphology & Machine Translation Eric Davis MT Seminar 02/06/08 Professor Alon Lavie Professor Stephan Vogel.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.

Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010,

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.

Korea Maritime and Ocean University NLP Jung Tae LEE

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Neural Net Language Models

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock

Language Identification and Part-of-Speech Tagging

Statistical Machine Translation Part II: Word Alignments and EM

CSE 517 Natural Language Processing Winter 2015

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.

Ankit Srivastava CNGL, DCU Sergio Penkale CNGL, DCU

Statistical NLP: Lecture 13

Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.

A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen.

CSCI 5832 Natural Language Processing

Build MT systems with Moses

CSCI 5832 Natural Language Processing

Memory-augmented Chinese-Uyghur Neural Machine Translation

Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B

Statistical Machine Translation Papers from COLING 2004

CMU Y2 Rosetta GnG Distillation

Statistical NLP Spring 2011

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Morphological Analysis for Phrase-Based Statistical Machine Translation LUONG Minh Thang Supervisor: Dr. KAN Min Yen National University of Singapore Web IR / NLP Group (WING)

Luong Minh Thang 2Machine translation: understand word structure2 Modern Machine Translation (MT) State-of-the-art systems – phrase-phrase translation – with data-intensive techniques but still, they – treat words as different entities – don’t understand the internal structure of words We investigate the incorporation of word structure knowledge (morphology) and adopt a language-independent approach

Luong Minh Thang Issues we address Morphologically-aware system – Out-of-vocabulary problem – Derive word structure from only raw data, language-general approach Translation to high-inflected languages – English-Finnish case study – Understand the characteristics  Suggestion of self-correcting model 3Machine translation: understand word structure Seen “car” before, but not “cars” “cars” has two morphemes “car”+“s” auto car auto/si: your car auto/i/si: your cars auto/i/ssa/si: in your cars auto/i/ssa/si/ko: in your cars?

Luong Minh Thang What others have done? A majority of works address the translation direction from high- to low-inflected languages – Arabic-English, German-English, Finnish-English Only few works touch at the reverse direction, which is considered more challenging – English-Turkish: (El-Kahlout & Oflazer, 2007) – English-Russian, Arabic: (Toutanova et. al., 2008)  employ feature-rich approach using abundant annotation data & language-specific tools. 4Machine translation: understand word structure We also look at the reverse direction, English-Finnish, but stick to our language-general approach!

Luong Minh Thang Agenda Baseline statistical MT  Terminology Our morphologically-aware SMT system  Baseline + Morphological layers Finnish study – morphological aspects  Suggestion of self-correcting model Experiments & results 5Machine translation: understand word structure

Luong Minh Thang 6Machine translation: understand word structure Baseline statistical MT (SMT) - overview We construct our baseline using Moses (Koehn et.al, 2007), a state-of-the-art open-source SMT toolkit Translation modelLanguage modelReordering model Training Monolingual/Parallel train data Decoding Test data (source language) Output translation (target language) Evaluating BLEU score

Luong Minh Thang 7Machine translation: understand word structure Baseline statistical MT - Terminology Marianodabaunabotefadaalabrujaverde NULLMarydidnotslapthegreenwitch Source Target Parallel data: pairs of sentences in both language (implies alignment correspondence) Monolingual data: from one language only Distortion limit parameter: control reordering - how far translated words could be from the source word  We test the effect of this parameter later Reordering effect

Luong Minh Thang 8Machine translation: understand word structure Automatic evaluation in SMT Human judgment is expensive & labor-consuming Automatically evaluate using reference translation(s) Baseline SMT system Evaluating BLEU score Input: Mary did not slap the green witch Output: Maria daba una botefada a verde bruja Ref: Maria no daba una botefada a la bruja verde

Luong Minh Thang 9Machine translation: understand word structure Automatic evaluation in SMT – BLEU score BLEU score = length_ratio* exp (p1+..+ p4)/4 Ref: Maria no daba una botefada a la bruja verde Output: Maria daba una botefada a verde bruja p1 (unigram) = 7 Output: Maria daba una botefada a verde bruja p2 (bigram) = 4 Output: Maria daba una botefada a verde bruja p3 (trigram) = 2 Output: Maria daba una botefada a verde bruja p4 (4-gram) = 1 Match unigram, bigram, trigram, and up to N-gram

Luong Minh Thang 10Machine translation: understand word structure Baseline SMT – Shortcomings? Only deal with language of similar morphology level Suffer from data sparseness problem in high- inflected languages Type countToken count English Finnish (Statistics from 714 K) Type: number of different words (vocabulary size) Token: the total number of words

Luong Minh Thang Why high-inflected language is hard? Has huge vocabulary size. – Finnish vocabulary ~ 6 times English vocabulary Could f reely concatenate prefixes/suffixes to form new word Finnish: oppositio/kansa/n/edusta/ja (opposition/people/of/represent/-ative) = opposition of parliarment member Turkish: uygarlas,tiramadiklarimizdanmis,sinizcasina (uygar/las, tir/ama/dik/lar/imiz/dan/mis, siniz/casina) = (behaving) as if you are among those whom we could not cause to become civilized  Make our system morphologically-aware to address these 11 This is a word!!! Machine translation: understand word structure

Luong Minh Thang Agenda Baseline statistical MT  Terminology Our morphological-aware SMT system  Baseline + Morphological layers Finnish study – morphological aspects  Suggestion of self-correcting model Experiments & results 12Machine translation: understand word structure

Luong Minh Thang 13Machine translation: understand word structure Morpheme pre- & post-processing modules Morpheme Pre- processing cars car + s Morpheme Post-processing auto + t autot Parallel train data Translation & reordering model training Language model training Monolingual train data Decoding Final translation Test data

Luong Minh Thang 14Machine translation: understand word structure Incorporating morphological layers Decoding Final translation E Morpheme Pre- processing Test data Morpheme Post-processing Our morphologically- aware SMT Morpheme Pre-processing Parallel train data Translation & reordering model training Language model training Monolingual train data

Luong Minh Thang Preprocessing – morpheme segmentation (MS) We perform MS to address the data sparse problem – cars might not appears in the training, but car & s do (Oflazer, 2007) & (Toutanova, 2008) also perform MS but use morphological analyzers that – customized for a specific language – utilize richly-featured annotated data We use an unsupervised morpheme segmentation tool, Morfessor, that requires only unannotated monolingual data. 15Machine translation: understand word structure

Luong Minh Thang Morpheme segmentation - Morfessor Morfessor – segment words, unsupervised manner straight/STM + forward/STM + ness/SUF 3 tags: PRE (prefix), STM (stem), & SUF(suffix) 16Machine translation: understand word structure Type count PRE count STM count SUF count English Finnish (Statistics from 714 K) Reduce data sparseness problem

Luong Minh Thang Post-processing – morpheme concatenation Output after decoding is a sequence of morphemes Pitäkää mme se omassa täsmällis essä tehtävä ssä ä n  How to put them back into words? During translation, keep the tag info & “+” sign to (indicate internal morpheme) Use word structure : WORD = ( PRE* STM SUF* )+ 17Machine translation: understand word structure Decoding Final translation Morpheme Pre- processing Test data Morpheme Post-processing Pitäkää/STM+ mme/SUF se/STM omassa/STM täsmällis/STM+ essä/STM tehtävä/STM+ ssä/SUF+ ä/SUF+ n/SUF Pitäkäämme se omassatäsmällisessätehtävässään

Luong Minh Thang Agenda Baseline statistical MT  Terminology Our morphological-aware SMT system  Baseline + Morphological layers Finnish study – morphological aspects  Suggestion of self-correcting model Experiments & results 18Machine translation: understand word structure

Luong Minh Thang Finnish study – two distinct characteristics More case endings than usual Indo-European languages – Normally correspond to prepositions or postpositions. – E.g.: auto/sta “out of the car”, auto/on “into the car” Use endings where Indo-European languages have function words. – Finnish possessive suffixes = English possessive pronouns – E.g.: auto/si “my car”, auto/mme “our car”. 19Machine translation: understand word structure

Luong Minh Thang 20Machine translation: understand word structure Structure of nominal – A word followed by many suffixes CategorySuffixFunction Sample word form Translation of the sample Number (2)-i- / -tPluralauto/tcars Case (15) genitive-nPossessionauto/nof the car inessive-ssaInsideauto/ssain the car Possessive (6) -ni1 st personauto/nimy car -si2 nd personauto/siyour car Particle (6) -kinToo, alsoauto/i/ssa/si/kinin your cars too -koInterrogativeauto/ssa/si/koin your car? Structure: Nominal + number + case + possessive + particle

Luong Minh Thang 21 CategorySuffixFunction Sample word form Translation of the sample Passive (2) -ta/tt/ttaUnspecified person sano/ta/anone says -an/en/inPersonal ending Tense (2) / Mood (4) -i-Pastsano/i/nI said -isi-Conditionalsano/isi/nI would say Personal ending (6) -n1 st personsano/nI say -t2 nd personsano/tyou say Particle (6) -kinToo, alsosano/i/n/kinI said also -koInterrogativesano/i/n/kodid I say? Structure of finite verb form – Finnish suffixes ~ English function words Machine translation: understand word structure Structure: Nominal + tense/mood + personal ending + particle

Luong Minh Thang 22Machine translation: understand word structure Potential challenges of high-inflected language to the system A word might be followed by several suffixes  A potential that the system might get the stem right, but miss a suffix. Correct translation: my cars  auto/i/ni (i: plural, ni: my) Intuition: use “my” and “s” to help ………. my/STM car/STM+ s/SUF ………...…………..auto/STM+ i/SUF………….... How to self-correct this suffix to i/ni?

Luong Minh Thang 23Machine translation: understand word structure Preliminary self-correcting model Suffixes in high-inflected language ~ function words in low-inflected language  Besides prefixes & suffixes, make use of source function words Model as a sequence labeling task – Labels are suffixes my/STM car/STM+ s/SUF auto/STM+ i/SUF Stem t =“auto” Suffix t-1 Suffix t Suffix t+1 func=“my” suf =“s” Predict correct suffix = ini/ Stem t-1 Stem t+1

Luong Minh Thang Agenda Baseline statistical MT  Terminology Our morphological-aware SMT system  Baseline + Morphological layers Finnish study – morphological aspects  Suggestion of self-correcting model Experiments & results 24Machine translation: understand word structure

Luong Minh Thang Datasets from European Parliament corpora 25Machine translation: understand word structure Train sizeDev sizeTest size Dataset 15K130 Dataset 210K270 Dataset 336K949 Dataset 461K1615 Four data sets of various sizes – select by first pick a keyword for each dataset, and extract all sentences contain the key word and its morphological variants modest in size as compared to 714K of the full corpora. We choose because: - Reduce running time - Simulate the real situation of scarce resources

Luong Minh Thang Experiments – Out-of-vocabulary (OOV) rates 26Machine translation: understand word structure OOV rate = number of un-translated words / total words Reduction rate = (baseline OOV – our OOV rates) / baseline OOV rate OOVReduction rate Dataset 1 (5K) Baseline SMT % Our SMT Dataset 2 (10K) Baseline SMT % Our SMT 9 Dataset 3 (36K) Baseline SMT % Our SMT 6.96 Dataset 4 (61K) Baseline SMT % Our SMT 7.25 Reduction rate: 10.33% to 34.74%. Highest effect when data is limited

Luong Minh Thang Overall results with BLEU score 27Machine translation: understand word structure Use BLEU score metric – judge at – word level: unit in N-gram is word – morpheme level: unit in N-gram is morpheme Dataset 1Dataset 2Dataset 3Dataset 4 Word BLEU Baseline SMT Our SMT Morpheme BLEU Baseline SMT Our SMT Word BLEU: our STM is as competitive as the baseline SMT Morpheme BLEU: our STM shows better morpheme coverage

Luong Minh Thang Overall results - distortion limit tuning 28Machine translation: understand word structure Distortion limit controls reordering Has influential effect on the performance (Virpioja, 2007) Distortion limit69unlimited Word BLEU Baseline SMT Our SMT Morpheme BLEU Baseline SMT Our SMT Baseline STM is best at 6Our STM is best at 9 Our SMT is better in both word and morpheme BLUE

Luong Minh Thang Error analysis 29Machine translation: understand word structure Interested to know how many times the system could get the stem right but not the suffixes # stem with correct suffix # stem with incorrect suffix Incorrect- suffix ratio Dataset % Dataset % Dataset % Dataset % Real need of the self-correcting model

Luong Minh Thang Even further analysis – New results after thesis !!! 30 Machine translation: understand word structure Our datasets are specialized on their keywords Result will be more conclusive if we look at translations of phrases containing dataset keywords Baseline SMTOur SMT #src phrase /w keyword #num correct stem # right suffix # src phrase /w keyword #num correct stem # right suffix Dataset 1 (success) Dataset 2 (environment) Dataset 3 (report) Dataset 4 (european) Conclusion: our SMT performs better in both tasks, getting the stems and suffixes right.

Luong Minh Thang Reference Kohen, P., et. al, Moses: open source toolkit for statistical machine translation Oflazer & Durgar El-Kahlout, Exploring different representational units in English-to- Turkish statistical machine translation Virproja, S., et. al., Morphology-aware statistical machine translations based on morphs induced in an unsupervised manner Toutanova et. al., Applying morphology generation models to machine translation 31Machine translation: understand word structure

Luong Minh Thang Q & A? Thank you 32Machine translation: understand word structure

Luong Minh Thang 33Machine translation: understand word structure Train data Translation model training EM algorithm, symmetrizing word alignment (GIZA++ tool) Phrase tables Language model Target train data Language model training N-gram extraction, SRILM Development data Tuning P(E|F) ~ ∑ λ i f i (E|F)  Learn λ i (Minimume error rate training) λ*Iλ*I Decoding E = argmax E ∑ λ * i f i (E|F) (Beam search, Moses toolkit) Test data (F) Final translation E Baseline statistical MT

Luong Minh Thang 34Machine translation: understand word structure Standard SMT system – translation model Parallel train data Translation model training Learn how to translate from one source phrase to a target phrase Output phrase table car industry in europe ||| euroopan autoteollisuus car industry in the ||| autoteollisuuden car industry in ||| autoteollisuuden

Luong Minh Thang 35Machine translation: understand word structure Standard SMT system – language model Language model training Target train data Constraints on a sequence of words that could go togerther Output N-gram table commission 's argument commission 's arguments commission 's assertion commission 's assessment 0

Luong Minh Thang 36Machine translation: understand word structure Standard SMT system - tuning Tuning Parallel development data Determine the weights to combine different models, e.g. translation or language model. P(E|F) ~ ∑ λ i f i (E|F)  Learn λ i

Luong Minh Thang 37Machine translation: understand word structure Standard SMT system - Decoding Decoding Final translation E Test data F Use phrase table in translation model, N-gram table in language model, and parameters to combine them in tuning. Generate for each input sentence F, a set of best translations, and pick the highest-score one.