Download presentation
Presentation is loading. Please wait.
Published byBarry Fields Modified over 9 years ago
1
A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010, Massachusetts, USA
2
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan 2 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Highly inflected languages (Arabic, Basque, Turkish, Russian, Hungarian, etc. ) Extensive use of affixes Huge number of distinct word forms used in the Finnish Air Force lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas lento+ kone+ suihku+ turbiini+ moottori+ apu+ mekaanikko+ ali+ upseeri+ oppilas This is a Finnish word!!! flight machine shower turbine engine assistance mechanic under officer student technical warrant officer trainee specialized in aircraft jet engines Highly inflected languages are hard for MT Analysis at the morpheme level is sensible.
3
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morphological Analysis Helps Morpheme representation alleviates data sparseness Arabic English (Lee, 2004) Czech English (Goldwater & McClosky, 2005) Finnish English (Yang & Kirchhoff, 2006) Word representation captures context better for large corpora Arabic English (Sadat & Habash, 2006) Finnish English (de Gispert et al., 2009) 3 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Our approach: the basic unit of translation is the morpheme, but word boundaries are respected at all stages. auto+ si = your car auto+ i+ si = your car+ s auto+ i+ ssa+ si = in your car+ s auto+ i+ ssa+ si+ ko = in your car+ s ?
4
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Translation into Morphologically Rich Languages A challenging translation direction Recent interest in morphologically rich languages: Arabic (Badr et al., 2008) Turkish (Oflazer and El-Kahlout, 2007) enhance the performance for only small bi-texts Greek (Avramidis and Koehn, 2008), Russian (Toutanova et al., 2008) rely heavily on language-specific tools 4 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages We are interested in unsupervised approach, and experiment translation with a large dataset
5
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Methodology Morphological Analysis – Unsupervised Morphological Enhancements – Respect word boundaries Translation Model Enrichment – Merge phrase tables (PT) 5 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Words 1) Morphological analysis Morph- emes 5) Translation Model Enrichment Align- ment Alignment training PT scoring Phrase table Phrase pairs 2) Phrase extraction 3) MERT tuning 4) Decoding
6
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morphological Analysis Use Morfessor (Creutz and Lagus, 2007) - unsupervised morphological analyzer Segments words morphemes (PRE, STM, SUF) un/PRE+ care/STM+ ful/SUF+ ly/SUF “+” sign used to enforce word boundary constraints later 6 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment
7
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Word Boundary-aware Phrase Extraction Typical SMT: maximum phrase length n=7 words Problem: morpheme phrases of length n can span less than n words may only partially span words Severe for morphologically rich languages Solution: morpheme phrases span n words or less span a sequence of whole words 7 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages SRC = the STM new STM, un PRE+ democratic STM immigration STM policy STM TGT = uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM (uusi=new, epädemokraattisen=undemocratic maahanmuuttopolitiikan=immigration policy) T RAINING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment advoid proposing nonwords
8
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morpheme MERT Optimizing Word BLEU Why ? BLEU brevity penalty influenced by sentence length Same # words span different # morphemes a 6-morpheme Finnish word: epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF Suboptimal weight for the word penalty feature function Solution: for each iteration of MERT, Decode at the morpheme level Convert morpheme translation word sequence Compute word BLEU Convert back to morphemes 8 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages T UNING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment
9
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models Morpheme language model (LM) Pros: alleviate data sparseness Cons: phrases span fewer words Introduce a second LM at the word level Log-linear model: add a separate feature Moses decoder: add word-level “view” on the morpheme-level hypotheses 9 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: Previous hypotheses Current hypothesis D ECODING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment
10
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models 10 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: “s SUF+ en SUF maahanmuutto PRE+ ” Previous hypotheses Current hypothesis D ECODING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment Morpheme language model (LM) Pros: alleviate data sparseness Cons: phrases span fewer words Introduce a second LM at the word level Log-linear model: add a separate feature Moses decoder: add word-level “view” on the morpheme-level hypotheses
11
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models 11 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: “s SUF+ en SUF maahanmuutto PRE+ ” ; “en SUF maahanmuutto PRE+ politiikan STM ” Previous hypotheses Current hypothesis D ECODING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment Morpheme language model (LM) Pros: alleviate data sparseness Cons: phrases span fewer words Introduce a second LM at the word level Log-linear model: add a separate feature Moses decoder: add word-level “view” on the morpheme-level hypotheses
12
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models 12 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: “s SUF+ en SUF maahanmuutto PRE+ ” ; “en SUF maahanmuutto PRE+ politiikan STM ” Word LM: uusi, epädemokraattisen maahanmuuttopolitiikan Previous hypotheses Current hypothesis D ECODING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment Morpheme language model (LM) Pros: alleviate data sparseness Cons: phrases span fewer words Introduce a second LM at the word level Log-linear model: add a separate feature Moses decoder: add word-level “view” on the morpheme-level hypotheses
13
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models 13 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: “s SUF+ en SUF maahanmuutto PRE+ ” ; “en SUF maahanmuutto PRE+ politiikan STM ” Word LM: uusi, epädemokraattisen maahanmuuttopolitiikan Previous hypotheses Current hypothesis D ECODING This is: (1) different from scoring with two word-level LMs & (2) superior to n-best rescoring Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment Morpheme language model (LM) Pros: alleviate data sparseness Cons: phrases span fewer words Introduce a second LM at the word level Log-linear model: add a separate feature Moses decoder: add word-level “view” on the morpheme-level hypotheses
14
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Building Twin Translation Models 14 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages problems ||| vaikeuksista From the same source, we generate two translation models E NRICHING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment Decoding PT merging PT w PT m Word GIZA++ Word alignmentMorpheme alignment Morpheme Phrase Extraction GIZA++ Parallel corpora problem STM+ s SUF ||| vaikeu STM+ ksi SUF+ sta SUF Morphological segmentation PT w→m problem STM+ s SUF ||| ongelma STM+ t SUF
15
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Phrase Table (PT) Merging Add-feature methods e.g., Chen et al., 2009 Heuristic-driven Interpolation-based methods e.g., Wu & Wang, 2007 Linear interpolation of phrase and lexicalized translation probabilities For two PTs originated from different sources 15 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages E NRICHING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment Phrase penaltyproblem STM+ s SUF ||| vaikeu STM+ ksi SUF + sta SUF ||| 0.07 0.11 0.01 0.01 2.7 ||| 0.37 0.60 0.11 0.14 2.7 problem STM+ s SUF ||| ongelma STM+ t SUF Lexicalized translation probabilities Phrase translation probabilities problem STM+ s SUF ||| vaikeu STM+ ksi SUF + sta SUF ||| 0.07 0.11 0.01 0.01 2.7 2.7 1 ||| 0.37 0.60 0.11 0.14 2.7 1 2.7 problem STM+ s SUF ||| ongelma STM+ t SUF We take into account the fact: our twin translation models are of equal quality
16
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Preserve the normalized ML estimations (Koehn et al., 2003) Use the raw counts of both models to compute Our method – Phrase translation probabilities 16 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages PT m PT w→m PT w m PT m The number of times the pair was extracted from the training dataset E NRICHING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment
17
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Use linear interpolation What happens if a phrase pair belongs to only one PT? Previous methods: interpolate with 0 Might cause some good phrases to be penalized Our method: induce all scores before interpolation Use lexical model of one PT to score phrase pairs of the other Lexicalized translation probabilities 17 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages problem STM+ s SUF ||| vaikeu STM+ ksi SUF+ sta problem STM+ s SUF ||| ongelma STM+ sta SUF problem STM+ s SUF ||| vaikeu STM+ ksi SUF+ sta problem STM+ s SUF ||| ongelma STM+ t SUF PT w m PT m PT w m PT m E NRICHING Morpheme Lexical Model (PT m )Word Lexical Model (PT w ) (problem STM+ s SUF | ongelma STM+ t SUF ) (problems | ongelmat) Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment
18
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Dataset & Settings 18 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages E XPERIMENTS Dataset Past shared task WPT05 (en/fi) 714K sentence pairs Split into T1, T2, T3, and T4 of sizes 40K, 80K, 160K, and 320K Standard phrase-based SMT settings: Moses IBM Model 4 Case insensitive BLEU
19
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan MT Baseline Systems w-system – word level m-system – morpheme level m-BLEU – morpheme version of BLEU 19 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages w-systemm-system BLEUm-BLEUBLEUm-BLEU T111.5645.5711.0749.15 T212.9548.6312.6853.78 T313.6450.3013.3254.40 T414.2050.8513.5754.70 Full14.5853.0514.0855.26 E XPERIMENTS Either the m-system does not perform as well as the w-system or BLEU is not capable of measuring morpheme improvements
20
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morphological Enhancements - Individual phr: boundary-aware phrase extraction tune: MERT tuning – word BLEU lm: decoding with twin LMs 20 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages SystemT1 (40K)Full (714K) w-system11.5614.58 m-system11.0714.08 m+phr11.44 +0.37 14.43 +0.35 m+tune11.73 +0.66 14.55 +0.47 m+lm11.58 +0.51 14.53 +0.45 E XPERIMENTS Individual enhancement yields improvements for both small and large corpora Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment
21
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morphological Enhancements - combined 21 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages SystemT1 (40K)Full (714K) w-system11.5614.58 m-system11.0714.08 m+phr+lm11.77 +0.70 14.58 +0.50 m+phr+lm+tune11.90 +0.83 14.39 +0.31 E XPERIMENTS Morphological enhancements: on par with the w-system and yield sizable improvements over the m-system Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment phr: boundary-aware phrase extraction tune: MERT tuning – word BLEU lm: decoding with twin LMs
22
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Translation Model Enrichment add-1: one extra feature add-2: two extra features Interpolation: linear interpolation ourMethod 22 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Merging methods Full (714K) m-system14.08 w-system14.58 add-114.25 +0.17 add-213.89 -0.19 interpolation14.63 +0.55 ourMethod14.82 +0.74 E XPERIMENTS Our method outperforms the w-system baseline Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment
23
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Result Significance Absolute improvement of 0.74 BLEU over the m-system, non-trival relative improvement of 5.6% Outperform the w-system by 0.24 points (1.56% relative) Statistically significant with p < 0.01 (Collins’ sign test) 23 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages BLEU m-system14.08 w-system14.58 ourSystem14.82 +0.74 A NALYSIS
24
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Translation Proximity Match Automatically extract phrase triples (src, out, ref ) use src as the pivot: translation output (out) & reference (ref) longest common subsequence ratio high character-level similarity between out & ref (economic and social, taloudellisia ja sosiaalisia, taloudellisten ja sosiaalisten) 16,262 triples: 6,758 match the references exactly the remaining triples were close wordforms 24 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Hypothesis: our approach yields translations close to the reference wordforms but unjustly penalized by BLEU A NALYSIS
25
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Human Evaluation 4 native Finnish speakers 50 random test sentences follow WMT’09 evaluation: provided judges with the source sentence, its reference translation & outputs of (m-system, w-system, ourSystem) shown in random order asked for three pairwise judgments 25 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages our vs. mour vs. ww vs. m Judge 1251819122119 Judge 2241619152514 Judge 327†12171127†15 Judge 4252026†1222 Total101‡6681‡5095†70 A NALYSIS The judges consistently preferred: (1) ourSystem to the m-system, (2) ourSystem to the w-system, (3) w-system to the m-system
26
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Sample translations src: we were very constructive and we negotiated until the last minute of these talks in the hague. ref: olimme erittain rakentavia ja neuvottelimme haagissa viime hetkeen saakka. our: olemme olleet hyvin rakentavia ja olemme neuvotelleet viime hetkeen saakka naiden neuvottelujen haagissa. w : olemme olleet hyvin rakentavia ja olemme neuvotelleet viime tippaan niin naiden neuvottelujen haagissa. m : olimme erittain rakentavan ja neuvottelimme viime hetkeen saakka naiden neuvotteluiden haagissa. Rank: our > m ≥ w src: it would be a very dangerous situation if the europeans were to become logistically reliant on russia. ref: olisi erittäin vaarallinen tilanne, jos eurooppalaiset tulisivat logistisesti riippuvaisiksi venäjästä. our: olisi erittäin vaarallinen tilanne, jos eurooppalaiset tulee logistisesti riippuvaisia venäjän. w : se olisi erittäin vaarallinen tilanne, jos eurooppalaisten tulisi logistically riippuvaisia venäjän. m : se olisi hyvin vaarallinen tilanne, jos eurooppalaiset haluavat tulla logistisesti riippuvaisia venäjän. Rank: our > w ≥ m 26 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Wrong case Match reference Confusing meaning Match reference Wrong case OOV Change the meaning ourSystem consistently outperforms the m-system & w-system, and seems to blend well translations from both baselines!
27
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Two key challenges: bring the performance of morpheme systems to a level rivaling the standard word ones incorporate morphological analysis directly into the translation process We have built a preliminary framework for the second challenge Future work: Extend the morpheme level framework Tackle the second challenge Conclusion 27 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Thank you!
28
Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Analysis – Translation Model Comparison PT m+phr vs. PT m PT m+phr is about half the size of PT m 95.07% of the phrase pairs in PT m+phr are also in PT m boundary-aware phrase extraction selects good phrase pairs from PT m to be retained in PT m+phr PT m+phr vs. PT w m comparable in size: 22.5M and 28.9M pairs, overlap is only 47.67% of PT m+phr enriching the translation model with PT w m helps improve coverage 28 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Full (714K) PT m 43.5M PT w m 28.9M PT m+phr 22.5M PT m+phr PT m 21.4M PT m+phr PT w m 10.7M A NALYSIS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.