Download presentation
Presentation is loading. Please wait.
Published byAshlyn Lee Modified over 9 years ago
1
1 Gholamreza Haffari Simon Fraser University MT Summit, August 2009 Machine Learning approaches for dealing with Limited Bilingual Data in SMT
2
2 Acknowledgments Special thanks to: Anoop Sarkar Some slides are adapted or used from Chris Callison Burch Trevor Cohn Dragos Stefan Munteanu
3
3 Statistical Machine Translation Translate from a source language to a target language by computer using a statistical model M F E is a standard log-linear model MFEMFE Source Lang. F Target Lang. E
4
4 Log-Linear Models Feature functions Weights In the test time, the best output t* for a given s is chosen by t * = arg max t i w i. f i (t,s)
5
5 Phrase-based SMT M F E is composed of two main components: The language model f lm : Takes care of the fluency of the generated translation The phrase table f pt : Takes care of the content of the source sentence in the generated translation Huge bitext is needed to learn a high quality phrase dictionary
6
6 Bilingual Parallel Data Source TextTarget Text
7
7 This Talk What if we don’t have large bilingual text to learn a good phrase table?
8
8 Motivations Low-density Language pairs Population speaking the language is small / Limited online resources Adapting to a new style/domain/topic Overcome training and testing mismatch
9
9 Available Resources Small bilingual parallel corpora Large amounts of monolingual data Comparable corpora Small translation dictionary Multilingual parallel corpora which includes multiple source languages but not the target language
10
10 The Map source-target small bitext MT system large comparable source-target bitext parallel sentence extraction bilingual dictionary induction large source monotext semi-supervised/ active learning source-another language bitext paraphrasing source-another another-target source-target bitexts triangulation/ co-training
11
11 Learning Problems (I) Supervised Learning: Given a sample of object-label pairs (x i,y i ), find the predictive relationship between object and labels Un-supervised learning: Given a sample consisting of only objects, look for interesting structures in the data, and group similar objects
12
12 Learning Problems (II) Now consider a training data consisting of: Labeled data: Object-label pairs (x i,y i ) Unlabeled data: Objects x j Leads to the following learning scenarios: Semi-Supervised Learning: Find the best mapping from objects to labels benefiting from Unlabeled data Transductive Learning: Find the labels of unlabeled data Active Learning: Find the mapping while actively query the oracle for the label of unlabeled data
13
13 The Big Picture Unlabeled {x j } (monotext) Labeled {(x i,y i )} (bitext) Data Train M Select Self-Training
14
14 Mining More Bilingual Parallel Data Comparable Corpora are texts which are not parallel in the strict sense but convey overlapping information Wikipedia pages New agencies: BBC, CNN From comparable corpora, we can extract sentence pairs which are (approximately) translation of each other
15
15 Extracting Parallel Sentences (Munteanu & Marcu, 2005) Un-matched Documents Parallel sentences
16
16 Article Selection (Munteanu & Marcu, 2005) Select the n-most relevant target-language docs to a source-language document using an information retrieval (IR) system: Translate each source-lang article into a target-lang query using the bilingual dictionary Un-matched Documents
17
17 Candidate Sentence Pair Selection (Munteanu & Marcu, 2005) Consider all of the sentence pairs from the source- lang article and relevant target-lang articles. Filter the sentence pairs if: The ratio of the length is more than 2 At least half of the words in each sentence does not have a translation in the other sentence
18
18 Parallel Sentence Selection (Munteanu & Marcu, 2005) Each candidate sentence pair (s,t) is classified into c 0 =‘parallel’ or c 1 =‘not parallel’ according to the following log-linear model: The weights are learned during training phase using training data
19
19 Model Features & Training Data (Munteanu & Marcu, 2005) The features of the log-linear classifier include: Length of the sentences, as well as their ratio Percentage of words in one side that do not have translation in the other side / are not connected by alignment links Training data can be prepared by a parallel corpus containing K sentence pairs This gives K positive and K 2 – K negative examples (which can be filtered further using the previous heuristics)
20
20 Improvement in SMT (Arabic to English) (Munteanu & Marcu, 2005) Initial out-of-domain parallel corpus Initial + extracted corpus Initial + human translated data
21
21 Outline Introduction Semi-supervised Learning for SMT Background (EM, Self-training, Co-Training) SSL for Alignments / Phrases / Sentences Active Learning for SMT Single-language pair Multiple Language Pairs
22
22 Inductive vs.Transductive Transductive: Produce label only for the available unlabeled data. The output of the method is not a classifier It’s like writing answers for the take-home exam! Inductive: Not only produce label for unlabeled data, but also produce a classifier. It’s like preparation for writing answers for the in-class exam!
23
23 Self-Training Iteration: 0 + - A Model trained by SL Choose instances labeled with high confidence Iteration: 1 + - Add them to the pool of current labeled training data …… Iteration: 2 + - (Yarowsky 1995)
24
24 EM Use EM to maximize the joint log-likelihood of labeled and unlabeled data: : Log-likelihood of labeled data : Log-likelihood of unlabeled data (Dempster et al 1977)
25
25 EM Iteration: 0 + - A Model trained by SL Clone new weighted labeled instances with unlab instances using (probabilisitc) model Iteration: 1 + - …… (Yarowsky 1995) w+iw+i w-iw-i Iteration: 2 + -
26
26 Co-Training Instances contain two sufficient sets of features i.e. an instance is x=(x 1,x 2 ) Each set of features is called a View Two views are independent given the label: Two views are consistent: x x1x1 x2x2 (Blum & Mitchell 1998)
27
27 Co-Training Iteration: t + - Iteration: t +1 + - …… C1: A Classifier trained on view 1 C2: A Classifier trained on view 2 Allow C1 to label Some instances Allow C2 to label Some instances Add self-labeled instances to the pool of training data
28
28 Outline Introduction Semi-supervised Learning for SMT Background (EM, Self-training, Co-Training) SSL for Alignments / Phrases / Sentences Active Learning for SMT Single-language pair Multiple Language Pairs
29
29 Word Alignment & Translation Quality (Fraser & Marcu 2006a) presented an SSL method for learning a better word alignment A small / big set of sentence pairs annotated/unannotated with word alignments (~ 100 / ~ 2-3 million) They showed that improvement in the word alignment caused improvement in the BLEU The same conclusion was made later in (Ganchev et al 2008) for other translation tasks
30
30 Word Alignment Model Consider the following log-linear model for word alignment: The feature functions are sub-models used in the IBM model 4, such as Translation probability t(f|e) Fertility probs n( |e): number of words generated by e ……
31
31 SS-Word Alignment (Fraser & Marcu 2006a) tuned the word alignment model parameters on the small labeled data in a discriminative fashion With the current, generate the n-best list Manipulate so that the best alignment stands out, i.e. the one which maximizes modified f-measure (MERT style alg) Use to find the word alignments of the big unlabeled data Estimate the feature functions’ parameters based on these best (Viterbi) alignments: 1 iteration of the EM algorithm Repeat the above two steps
32
32 Outline Introduction Semi-supervised Learning for SMT Background (EM, Self-training, Co-Training) SSL for Alignments / Phrases / Sentences Active Learning for SMT Single-language pair Multiple Language Pairs
33
33 Paraphrasing If a word is unseen then SMT will not be able to translate it Keep/omit/transliterate source word or use regular expression to translate it (dates, …) If a phrase is unseen, but its individual words are, then SMT will be less likely to produce a correct translation The idea: Use paraphrases in the source language to replace unknown words/phrases Paraphrases are alternative ways of conveying the same information (Callison Burch, 2007)
34
34 Coverage Problem in SMT Percentage of Test Item Types vs Corpus Size (Callison Burch, 2007)
35
35 Behavior on Unseen Data A system trained on 10,000 sentences (~200,000 words) may translate: Es positivo llegar a un acuerdo sobre los procedimientos, pero debemos encargarnos de que este sistema no sea susceptible de ser usado como arma pol´ıtica. as It is good reach an agreement on procedures, but we must encargarnos that this system is not susceptible to be usado as political weapon. Since the translations of encargarnos and usado were not learned, they are either reproduced in the translation, or omitted entirely (Callison Burch, 2007)
36
36 Substituting Paraphrases then Translating It is good reach an agreement on procedures, but we must encargarnos that this system is not susceptible to be usado as political weapon. encargarnos? usado? (Callison Burch, 2007)
37
37 Substituting Paraphrases then Translating It is good reach an agreement on procedures, but we must encargarnos that this system is not susceptible to be usado as political weapon. encargarnos? garantizar velar procurar Asegurarnos usado? utilizado empleado uso utiliza (Callison Burch, 2007)
38
38 Substituting Paraphrases then Translating It is good reach an agreement on procedures, but we must guarantee that this system is not susceptible to be used as political weapon. encargarnos? garantizar velar procurar Asegurarnos guarantee, ensure, guaranteed, assure, provided ensure, ensuring, safeguard, making sure ensure that, try to, ensure, endeavour to ensure, secure, make certain usado? utilizado empleado uso utiliza used, use, spent, utilized used, spent, employee use, used, usage used, uses, used, being used (Callison Burch, 2007)
39
39 Learning paraphrases (I) From monolingual parallel corpora Multiple source sentences which are conveying the same information Extract paraphrases seen in the same context in the aligned source sentences Emma burst into tears and he tried to comfort her, saying things to make her smile. Emma cried, and he tried to console her, adorning his words with puns. (Callison Burch, 2007)
40
40 Learning paraphrases (I) From monolingual parallel corpora Multiple source sentences which are conveying the same information Extract paraphrases seen in the same context in the aligned source sentences burst into tears = cried comfort= console Emma burst into tears and he tried to comfort her, saying things to make her smile. Emma cried, and he tried to console her, adorning his words with puns. (Callison Burch, 2007)
41
41 Learning paraphrases (I) From monolingual parallel corpora Multiple source sentences which are conveying the same information Extract paraphrases seen in the same context in the aligned source sentences Problems with this approach Monolingual parallel corpora are relatively uncommon Limits what paraphrases we can generate, e.g. limited number of paraphrases (Callison Burch, 2007)
42
42 Learning paraphrases (I) From monolingual source corpora For each unknown phrase x, build a distributional profile DP x which shows the co-occurrences of the surrounding words with x Select the top-k phrases which have the most similar distributional profile with DP x Is position important when building the profile? Should we simply count words, or use TF/IDF, or …? Which vector similarity measure should be used? Needs smart tricks to make it scalable (Marton et al 2009)
43
43 Learning paraphrases (II) From bilingual parallel corpora However no longer we have access to identical contexts Adopt techniques from phrase-based SMT Use aligned foreign language phrases as pivot (Callison Burch, 2007)
44
44 Paraphrase Probability Generate multiple paraphrases for a given phrase We give them probabilities so they can be ranked Define translation model probability:
45
45 Refined Paraphrase Probability Using multiple bilingual corpora, e.g. English-Spanish, English-German, … C is the set of bilingual corpora and c is the weight of the corpus c, e.g. we may put more weight on larger corpora Taking word sense into account In a paraphrase, replace each word with its word_sense item
46
46 Plugging Paraphrases into SMT Model For each paraphrase s 2 having a translation t, we expand the phrase table by adding new entries (t,s 1 ) s 1 s 2 t Add a new feature function into the SMT log-linear model to take into account the paraphrase probabilities p(s 2 | s 1 ) If phrase table entry (t,s 1 ) is generated from (t,s 2 ) 1 Otherwise f(t,s 1 ) =
47
47 Results of Paraphrasing (Callison Burch, 2007)
48
48 Improvement in Coverage (Callison Burch, 2007)
49
49 Triangulation We can find additional data by focusing on: Multi-parallel corpora Collection of bitexts with some common language(s) (Cohn & Lapata, 2007)
50
50 Triangulation We can find additional data by focusing on: Multi-parallel corpora Collection of bitexts with some common language(s) (Cohn & Lapata, 2007)
51
51 Triangulation We can find additional data by focusing on: Multi-parallel corpora Collection of bitexts with some common language(s) (Cohn & Lapata, 2007)
52
52 Phrase-Level Triangulation Triangulation (Kay, 1997) Translate source phrase into an intermediate language phrase Translate this intermediate phrase into the target phrase Example: Translating a hot potato into French (Cohn & Lapata, 2007)
53
53 A Generative Model for Triangulation Marginalize out the intermediate phrases: The generative model for p(s|t) : (Cohn & Lapata, 2007)
54
54 Marginalize out the intermediate phrases: Conditional independence assumption: i fully represents the information in t needed to translating s Extends trivially to many intermediate languages p(s|i) and p(i|t) are estimated using phrase frequencies (Cohn & Lapata, 2007) A Generative Model for Triangulation
55
55 A Generative Model for Triangulation Marginalize out the intermediate phrases: Conditional independence may be violated Translation model is estimated from noisy alignments Missing contexts, i, in p(s|i) Fewer large or rare phrases can be translated (Cohn & Lapata, 2007)
56
56 Plugging Triangulated Phrases into Model A mixture model of phrase pair probabilities from training set (standard) and the newly learned phrase pairs by triangulation: As a new feature in the log-linear model standard triang + (1- )
57
57 Coverage Benefit
58
58 For any Language Pair? 10k bilingual sentences, interpolated with 3 intermediate langs: / (Cohn & Lapata, 2007)
59
59 Larger Corpora For French to English with Spanish as the intermediate language using different sizes for bitext(s) triang: only triangulated phrases interp: mixture model of the two phrase tables (Cohn & Lapata, 2007)
60
60 What Languages are best for triangulation? 10K bilingual sentences, translating from French to English (Cohn & Lapata, 2007)
61
61 How many languages are required? 10K bilingual sentences, translating from French to English, ordered by language family (Cohn & Lapata, 2007)
62
62 Paraphrasing vs Triangulation Paraphrasing Uses bilingual projection to translate to and from a source phrase It is employed to improve the source side coverage Triangulation Generalizes the paraphrasing method to any translation pathway linking the source and target Improves both source and target coverage (Cohn & Lapata, 2007)
63
63 Bilingual Lexicon Induction The goal is to induce a larger bilingual dictionary. It can be used, for example, to augment the phrase table/parallel text Suppose we have access to a small bilingual dictionary plus large monolingual text Build distributional profile using use monolingual source text Map the profile using seed rules (initial bilingual dictionary) to the target language vocabulary space Select the top-k target language words with most similar distributional profiles (Rapp, 1999)
64
64 Context-based Rapp Model ( Garera et al 2009)
65
65 Dependency Context Usually words in a fixed-size window are used to represent the context (Garera et al 2009) uses the latent structure in the dependency parse tree to represent the context ( Garera et al 2009)
66
66 Dependency Context Usually words in a fixed-size window are used to represent the context (Garera et al 2009) uses the latent structure in the dependency parse tree to represent the context Dynamic context size Accounts for reordering ( Garera et al 2009)
67
67 Bilingual Lexicon Induction (more references) (Koehn & Knight 2002) takes into account the orthographic features in addition to the context (Haghighi et al 2008) devise a generative model which generates the (feature vector of) related words in the source and target languages Each word is represented by a feature vector containing both contextual and ortographic features (Mann & Yarowsky 2001) and (Schafer & Yarowsky 2002) use a bridge language to induce bilingual lexicon
68
68 Bilingual Phrase Induction (non-comparable corpora) Non-comparable corpora contain “... disparate, very nonparallel bilingual documents that could either be on the same topic (on-topic) or not” (Fung & Cheung 2004) The goal is to extract parallel sub-sentential fragments, as opposed to extracting parallel sentences Assume we have a lexical dictionary P(t | s): the probability the source word s translates into target word t Using some heuristics, specify the candidate sentence pairs ( Munteanu & Marcu 2006)
69
69 The Signal Processing Approach target source
70
70 The Signal Processing Approach target source
71
71 The Signal Processing Approach target source
72
72 The Signal Processing Approach P(t|s) target source
73
73 The Signal Processing Approach target source
74
74 The Signal Processing Approach target source
75
75 The Signal Processing Approach target source Average of “signals” from neighbors
76
76 The Signal Processing Approach target source Average of “signals” from neighbors
77
77 Bilingual Phrase Induction (non-comparable corpora) Retain “positive fragments”, i.e. those fragments for which the corresponding filtered signal values are positive Repeat the procedure in the other direction (target to source) to obtain the fragments for source, and consider the resulting two text chunks as parallel The signal filtering function is simple, more advanced filters might work better ( Munteanu & Marcu 2006)
78
78 The Effect of Parallel Fragments for SMT ( Munteanu & Marcu 2006) Explained in the beginning of the talk The method just explained
79
79 Outline Introduction Semi-supervised Learning for SMT Background (EM, Self-training, Co-Training) SSL for Alignments / Phrases / Sentences Active Learning for SMT Single-language pair Multiple Language Pairs
80
80 Self-Training for SMT Train MFEMFE Bilingual text F F E E Monolingual text Decode Translated text F F E E F F E E Select high quality Sent. pairs Select high quality Sent. pairs Re- Log-linear Model Re-training the SMT model
81
81 Self-Training for SMT Train MFEMFE Bilingual text F F E E Monolingual text Decode Translated text F F E E F F E E Select high quality Sent. pairs Select high quality Sent. pairs Re- Log-linear Model Re-training the SMT model ( Ueffing et al 2007a )
82
82 Scoring & Selecting Sentence Pairs Scoring: Use normalized decoder’s score Confidence estimation method (Ueffing & Ney 2007) Selecting: Importance sampling: Those whose score is above a threshold Keep all sentence pairs
83
83 Confidence Estimation A log linear combination of Word posterior probabilities: The chance of seeing a word in a particular position in translations Phrase posterior probabilities Language model score The weights are tuned to minimize the classification error rate Translations having a WER above a threshold are considered incorrect
84
84 Self-Training for SMT Train MFEMFE Bilingual text F F E E Monolingual text Decode Translated text F F E E F F E E Select high quality Sent. pairs Select high quality Sent. pairs Re- Log-linear Model Re-training the SMT model ( Ueffing et al 2007a )
85
85 Re-Training the SMT Model (I) Simply add the newly selected sentence pairs to the initial bitext, and fully re-train the phrase table A mixture model of phrase pair probabilities from training set combined with phrase pairs from the newly selected sentence pairs Initial Phrase TableNew Phrase Table + (1- ) ( Ueffing et al 2007a )
86
86 Re-training the SMT Model (II) Use new sentence pairs to train an additional phrase table and use it as a new feature function in the SMT log-linear model One phrase table trained on sentences for which we have the true translations One phrase table trained on sentences with their generated translations Phrase Table 1 Phrase Table 2
87
87 Results (Chinese to English, Transductive) SelectionScoringBLEU%WER%PER% Baseline 27.9 .767.2 .644.0 .5 Keep all28.166.544.2 Importance Sampling Norm. score28.766.143.6 Confidence28.465.843.2 ThresholdNorm. score28.366.143.5 confidence29.365.643.2 WER: Lower is better (Word error rate) PER: Lower is better (Position independent WER ) BLEU: Higher is better Bold: best result, italic: significantly better Using additional phrase table
88
88 Results (Chinese to English, Inductive) systemBLEU%WER%PER% Eval-04 (4 refs.) Baseline 31.8 .766.8 .741.5 .5 Add Chinese dataIter 132.865.740.9 Iter 432.665.840.9 Iter 1032.566.141.2 WER: Lower is better (Word error rate) PER: Lower is better (Position independent WER ) BLEU: Higher is better Bold: best result, italic: significantly better Using importance sampling and additional phrase table
89
89 Why does it work (I) Reinforces parts of the phrase translation model which are relevant for test corpus, hence obtain more focused probability distribution source | targetprob A B | a b e A B | c d ….5 … Decode monotext ---- A B ----- ---- c d ----- “c d” is chosen since LM picks it according to signals from context source | targetprob A B | a b e A B | c d ….2.8 … Use this to resolve ambiguity of translating “A B” in other parts of the text Retraining ( Ueffing et al 2008 )
90
90 Why does it work (II) Composes new phrases, for example: Original parallel corpusAdditional source dataPossible new phrases ‘A B’, ‘C D E’‘A B C D E’‘A B C’, ‘B C D E’, … Source: ----- A B C D E ----- Translation: ----- a b c d e ----- ----- A B C D E ----- ----- a b c d e ----- ----- A B C D E ----- ----- a b c d e ----- ( Ueffing et al 2008 )
91
91 Analysis New phrases are used rarely, hence most of the benefit comes from focused probability distributions
92
92 Co-training for SMT Source sentence is a view onto the translation Existing translations of a source sentence can be used as additional views on the translation (Callison Burch, 2003)
93
93 Co-Training for SMT (Callison Burch, 2003)
94
94 Co-Training for SMT (Callison Burch, 2003) Having initial bitexts, train SMT models from source languages to the target language
95
95 Co-Training for SMT (Callison Burch, 2003) Translate a multilingual parallel sentence in the source languages using the trained SMT models
96
96 Co-Training for SMT (Callison Burch, 2003) Choose the best generated translation
97
97 Co-Training for SMT (Callison Burch, 2003) Add the new sentence pairs to the bitexts and re-train the SMT models
98
98 Results of Co-Training 20k initial labeled sentences, 60k unlabeled parallel sentences in 5 languages, select 10k pseudo-labeled sentences in each iteration (Callison Burch, 2003)
99
99 Coaching Suppose we have no German-English bitext There is a French-English bitext There is a French-German bitext Train a French to English translation model Translate the French to English and align the generated translations with German
100
100 Results of Coaching Coaching of German to English by a French to English translation model (Callison Burch, 2003)
101
101 Results of Coaching Coaching of German to English by multiple translation models (Callison Burch, 2003)
102
102 Outline Introduction Semi-supervised Learning for SMT Background (EM, Self-training, Co-Training) SSL for Alignments / Phrases / Sentences Active Learning for SMT Single-language pair Multiple Language Pairs
103
103 Shortage of Bilingual Data: A Solution Suppose we are given a large monolingual text in the source language F Pay a human expert and ask him/her to translate these sentences into the target language E This way, we will have a bigger bilingual text But our budget is limited ! We cannot afford to translate all monolingual sentences
104
104 A Better Solution Choose a subset of monolingual sentences for which: if we had the translation, the SMT performance would increase the most Only ask the human expert for the translation of these highly informative sentences This is the goal of Active Learning
105
105 Active Learning for SMT Train MFEMFE Bilingual text F F E E Monolingual text Decode Translated text F F E E Translate by human F F E E F F Select Informative Sentences Select Informative Sentences Re- Log-linear Model Re-training the SMT models (Haffari et al 2009)
106
106 Active Learning for SMT Train MFEMFE Bilingual text F F E E Monolingual text Decode Translated text F F E E Translate by human F F E E F F Select Informative Sentences Select Informative Sentences Re- Log-linear Model Re-training the SMT models
107
107 Sentence Selection Strategies Baselines: Randomly choose sentences from the pool of monolingual sentences Choose longer sentences from the monolingual corpus Other methods Decoder’s confidence for the translations (Kato & Barnard, 2007) Reverse model Utility of the translation units (Haffari et al 2009)
108
108 Decoder’s Confidence Sentences for which the model is not confident about their translations are selected first Hopefully high confident translations are good ones Normalize the confidence score by the sentence length (Haffari et al 2009)
109
109 Reverse Model Comparing the original sentence, and the final sentence Tells us something about the value of the sentence I will let you know about the issue later Je vais vous faire plus tard sur la question I will later on the question MEFMEF Rev: M F E (Haffari et al 2009)
110
110 Sentence Selection Strategies Baselines: Randomly choose sentences from the pool of monolingual sentences Choose longer sentences from the monolingual corpus Other methods Decoder’s confidence for the translations (Kato & Barnard, 2007) Reverse model Utility of the translation units (Haffari et al 2009)
111
111 Utility of the Translation Units Phrases are the basic units of translations in phrase-based SMT I will let you know about the issue later Monolingual Text 6 6 1 8 3 Bilingual Text 5 6 1 2 3 7 The more frequent a phrase is in the monolingual text, the more important it is The more frequent a phrase is in the bilingual text, the less important it is mm bb
112
112 Generative Models for Phrases Monolingual TextBilingual Text 6 6 1 8 3 Count.25.05.33.12 Probability 5 6 1 2 3 7 CountProbability.21.22.05.09.14.29 mm bb
113
113 Sentence Selection: Probability Ratio Score For a monolingual sentence S Consider the bag of its phrases: Score of S depends on its probability ratio: = {,, } m ( ) b ( ) m ( ) b ( ) m ( ) b ( ) (Haffari et al 2009)
114
114 Sentence Selection: Probability Ratio Score For a monolingual sentence S Consider the bag of its phrases: Score of S depends on its probability ratio: Phrase probability ratio captures our intuition about the utility of the translation units = {,, } Phrase Prob. Ratio
115
115 Extensions of the Score Instead of using phrases, we may use n- grams We may alternatively use the following score (Haffari et al 2009)
116
116 Sentence Segmentation How to prepare the bag of phrases for a sentence S? For the bilingual text, we have the segmentation from the training phase of the SMT model For the monolingual text, we run the SMT model to produce the top-n translations and segmentations What about OOV fragments in the sentences of the monolingual text? (Haffari & Sarkar 2009)
117
117 OOV Fragments: An Example i will go to school on friday OOV Fragment go toschoolon friday go to schoolon friday goto school onfriday OOV Phrases Which can be long (Haffari & Sarkar 2009b)
118
118 Counting OOV Phrases Fix an OOV fragment x Put a uniform distribution over all possible segmentations of x Use the expected count of OOV Phrases under this uniform distribution See (Haffari & Sarkar 2009b) for how to compute these expectations efficiently x: … (Haffari & Sarkar 2009)
119
119 Active Learning for SMT Train MFEMFE Bilingual text F F E E Monolingual text Decode Translated text F F E E Translate by human F F E E F F Select Informative Sentences Select Informative Sentences Re- Log-linear Model Re-training the SMT models
120
120 Re-training the SMT Models We use two phrase tables in each SMT model M Fi E One trained on sents for which we have the true translations One trained on sents with their generated translations (Self-training) F i E i Phrase Table 1 Phrase Table 2
121
121 Experimental Setup Dataset size: We select 200 sentences from the monolingual sentence set for 25 iterations We use Portage from NRC as the underlying SMT system (Ueffing et al, 2007) BitextMonotexttest French-English5K20K2K
122
122 The Simulated AL Setting Utility of phrases Random Decoder’s Confidence Better
123
123 The Simulated AL Setting Better
124
124 Outline Introduction Semi-supervised Learning for SMT Background (EM, Self-training, Co-Training) SSL for Alignments / Phrases / Sentences Active Learning for SMT Single-language pair Multiple Language Pairs
125
125 Multiple Language-Pair AL-SMT E (English) Add a new lang. to a multilingual parallel corpus To build high quality SMT systems from existing languages to the new lang. F 1 (German) F 2 (French) F 3 (Spanish) … AL Translation Quality
126
126 AL-SMT: Multilingual Setting Train MFEMFE F 1,F 2, … E E Monolingual text Decode E 1,E 2,.. Translate by human Select Informative Sentences Select Informative Sentences Re- Log-linear Model Re-training the SMT models F 1,F 2, … E E
127
127 Selecting Multilingual Sents. (I) Alternate Method: To choose informative sents. based on a specific F i in each AL iteration F 1 F 2 F 3 ……… 2 35 1 3 19 2 2 17 3 Rank (Reichart et al, 2008)
128
128 Selecting Multilingual Sents. (II) Combined Method: To sort sents. based on their ranks in all lists F 1 F 2 F 3 ……… 2 35 1 3 19 2 2 17 3 Combined Rank … 7=2+3+2 71=35+19+17 6=1+2+3 (Reichart et al, 2008)
129
129 Selecting Multilingual Sents. (III) Disagreement Method –Pairwise BLEU score of the generated translations –Sum of BLEU scores from a consensus translation F 1 F 2 F 3 ……… E 1 … E 2 … E 3 … Consensus Translation
130
130 AL-SMT: Multilingual Setting Train MFEMFE F 1,F 2, … E E Monolingual text Decode E 1,E 2,.. Translate by human Select Informative Sentences Select Informative Sentences Re- Log-linear Model Re-training the SMT models F 1,F 2, … E E
131
131 Re-training the SMT Models (I) We use two phrase tables in each SMT model M Fi E One trained on sents for which we have the true translations One trained on sents with their generated translations (Self-training) F i E i Phrase Table 1 Phrase Table 2
132
132 Re-training the SMT Models (II) Phrase Table 2: We can instead use the consensus translations (Co-Training) F i Phrase Table 1 E 1 E 2 E 3 E consensus Phrase Table 2
133
133 Experimental Setup We want to add English to a multilingual parallel corpus containing Germanic languages in EuroParl: Germanic Langs: German, Dutch, Danish, Swedish Sizes of dataset and selected sentences Initially there are 5k multilingual sents parallel to English sents 20k parallel sents in multilingual corpora. 10 AL iterations, and select 500 sentences in each iteration We use Portage from NRC as the underlying SMT system (Ueffing et al, 2007b)
134
134 Self-training vs Co-training Germanic Langs to English Co-Training mode outperforms Self-Training mode 19.75 20.20
135
135 Germanic Languages to English methodSelf-Training WER / PER / BLEU Co-Training WER / PER / BLEU Combined Rank Alternate Random WER: Lower is better (Word error rate) PER: Lower is better (Position independent WER ) BLEU: Higher is better 41.0 40.2 41.6 40.1 40.0 40.5 30.2 30.0 31.0 30.1 29.6 30.7 19.9 20.0 19.4 20.2 20.3 20.2 Bold: best result, italic: significantly better
136
136 Conclusion source-target small bitext MT system large comparable source-target bitext parallel sentence extraction bilingual dictionary induction large source monotext semi-supervised/ active learning source-another language bitext paraphrasing source-another another-target source-target bitexts triangulation/ co-training
137
137 Finish
138
138 References (Blum & Mitchell 1998) A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training”, COLT. (Callison Burch 2007) C. Callison Burch, “Paraphrasing and Translation”, PhD thesis, University of Edinburgh. (Callison Burch 2003) C. Callison Burch, “Co-Training for Statistical Machine Translation”, Master’s thesis, University of Edinburgh. (Cohn & Lapata 2007) T. Cohn and M. Lapata, “Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora”, ACL. (Dempster et al 1977) A. P. Dempster, N. M. Laird, D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society. Series B. (Fraser & Marcu 2006a) A. Fraser and D. Marcu, “Semi-Supervised Training for Statistical Word Alignment”, ACL.
139
139 References (Fraser & Marcu 2006b) A. Fraser and D. Marcu, “Measuring Word Alignment Quality for Statistical Machine Translation”, Technical Report ISI-TR-616, ISI/University of Southern California. (Fung & Cheung 2004) P. Fung and P. Cheung, “ Mining very non-parallel corpora: Parallel sentence and lexicon extraction vie bootstrapping and EM”, EMNLP. (Garera et al 2009) N. Garera, C. Callison-Burch and D. Yarowsky, “Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences”, CoNLL. (Haffari et al 2009) G. Haffari, M. Roy, A. Sarkar, “Active Learning for Statistical Phrase-based Machine Translation ”, NAACL. (Haffari & Sarkar 2009) G. Haffari and A. Sarkar, “Active Learning for Multilingual Statistical Machine Translation ”, ACL-IJCNLP. (Haghighi et al 2008) A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein, ”Learning bilingual lexicons from monolingual Corpora”, ACL.
140
140 References (Kuzman et al 2008) K. Ganchev, J. Graca and B. Taskar, “Better Alignments = Better Translations?”, ACL. (Koehn & Knight 2002) P. Koehn and K. Knight, ”Learning a translation lexicon from monolingual corpora”, ACL Workshop on Unsupervised Lexical Acquisition. (Mann & Yarowsky 2001) G.Mann and D. Yarowsky, “Multi-path translation lexicon induction via bridge languages”, NAACL. (Munteanu Marcu 2006) D. Munteanu and D. Marcu, “Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora”, COLING-ACL. (Marton et al 2009) Y. Marton, C. Callison-Burch and P. Resnik, “Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases ”, EMNLP. (Munteanu & Marcu, 2005) D. Munteanu and D. Marcu, “Improving Machine Translation Performance by Exploiting Non-parallel Corpora”, Computational Linguistics, 31(4).
141
141 References (Rapp 1999) R. Rapp, “Automatic identification of word translations from unrelated english and german corpora”, ACL. (Reichart et al 2008) R. Reichart, K. Tomanek, U. Hahn and A. Rappoport, “Multi-Task Active Learning for Linguistic Annotations”, ACL. (Schafer & Yarowsky 2001) C. Schafer and D. Yarowsky, “Inducing translation lexicons via diverse similarity measures and bridge languages”, COLING. (Ueffing & Ney 2007) N. Ueffing and H. Ney, “ Word-Level Confidence Estimation for Machine Translation”, Computational Linguistics. (Ueffing et al 2007a) N. Ueffing, G.R. Haffari, A. Sarkar, “Transductive Learning for Statistical Machine Translation ”, ACL. (Ueffing et al 2007b) N. Ueffing, M. Simard, S. Larkin, and J. H. Johnson, “NRC’s Portage system for WMT 2007”, ACL Workshop on SMT.
142
142 References (Ueffing et al 2008) N. Ueffing, G.R. Haffari, A. Sarkar, “Semi-supervised model adaptation for statistical machine translation ”, Machine Treanslation Journal. (Yarowsky 1995) D. Yarowsky, “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods”, ACL.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.