Yajuan Lü, Jin Huang and Qun Liu EMNLP, 2007 Presented by Mei Yang, May 12nd, 2008 Improving SMT Preformance by Training Data Selection and Optimization
Goal and Approach Translation model adaptation/optimization Relevant data is better than more data select test-related training data with IR optimize the distribution of training data Offline model optimization Online model optimization
Select Relevant Data Query: a testing sentence Document: a source language training sentence Information retrieval uses TF-IDF term both the query and the document are represented by a vector similarity score is calculated by the cosine of two vectors
Offline Model Optimization For each sentence in the devset and the testset, retrieve the top-N similar sentences in the original training set T All retrieved sentence pairs form the adaptive training set D with or without duplicated instances Train the adaptive model with D Train the optimized model by adding D into T address data sparseness issue in the adaptive model in practice, adding D into T can be done by making necessary change to the count of each training sentence pair in T (Fig 1)
Online Model Optimization Train a general model with the entire data and several candidate sub-models with prepared subsets, which can be obtained by divide data according to its origins (in this paper) clustering method use IR with a small amount of domain-specific data Translate using a log-linear model with weights optimized online given a testing sentence, retrieve the top-N similar sentences in the original training set T determine the model weights according to the proportions of sentences that are used for sub-model training 4 different weighting schemes are proposed (page 346)
Experiment: Data Chinese-to-English translation task 200K sentence pairs randomly selected from each of the three corpora: FBIS, HK_Hansards, and HK_News. Totally 600K sentence pairs are used as training set Devset and testset Offline: use NIST02 as devset and NIST05 as testset (both 4 references) Online: use 500 randomly selected sentence pairs from each of the three corpora, and 500 sentence pairs from NIST05. Totally 2K sentence pairs are used as testset (1 references)
Experiment #1: Offline optimization N = 100, 200, 500, 1000, 2000 Adaptive models (Table 3) comparable BLEU scores with much smaller model sizes duplicated data achieved better results than distinct data when N is large, the performance starts to drop Optimized models (Table 4) significant improvement over baseline and adaptive models
Experiment #2: Online optimization Preliminary results N = 500 No significant difference observed by using different weighting schemes Small improvement over the baseline model
Future Work More sophisticated similarity measure for information retrieval Optimization algorithm for online weighting of sub-models Introducing language model optimization into the system