Download presentation
Presentation is loading. Please wait.
Published byAubrey Joseph Modified over 9 years ago
1
Yajuan Lü, Jin Huang and Qun Liu EMNLP, 2007 Presented by Mei Yang, May 12nd, 2008 Improving SMT Preformance by Training Data Selection and Optimization
2
Goal and Approach Translation model adaptation/optimization Relevant data is better than more data select test-related training data with IR optimize the distribution of training data Offline model optimization Online model optimization
3
Select Relevant Data Query: a testing sentence Document: a source language training sentence Information retrieval uses TF-IDF term both the query and the document are represented by a vector similarity score is calculated by the cosine of two vectors
4
Offline Model Optimization For each sentence in the devset and the testset, retrieve the top-N similar sentences in the original training set T All retrieved sentence pairs form the adaptive training set D with or without duplicated instances Train the adaptive model with D Train the optimized model by adding D into T address data sparseness issue in the adaptive model in practice, adding D into T can be done by making necessary change to the count of each training sentence pair in T (Fig 1)
5
Online Model Optimization Train a general model with the entire data and several candidate sub-models with prepared subsets, which can be obtained by divide data according to its origins (in this paper) clustering method use IR with a small amount of domain-specific data Translate using a log-linear model with weights optimized online given a testing sentence, retrieve the top-N similar sentences in the original training set T determine the model weights according to the proportions of sentences that are used for sub-model training 4 different weighting schemes are proposed (page 346)
6
Experiment: Data Chinese-to-English translation task 200K sentence pairs randomly selected from each of the three corpora: FBIS, HK_Hansards, and HK_News. Totally 600K sentence pairs are used as training set Devset and testset Offline: use NIST02 as devset and NIST05 as testset (both 4 references) Online: use 500 randomly selected sentence pairs from each of the three corpora, and 500 sentence pairs from NIST05. Totally 2K sentence pairs are used as testset (1 references)
7
Experiment #1: Offline optimization N = 100, 200, 500, 1000, 2000 Adaptive models (Table 3) comparable BLEU scores with much smaller model sizes duplicated data achieved better results than distinct data when N is large, the performance starts to drop Optimized models (Table 4) significant improvement over baseline and adaptive models
8
Experiment #2: Online optimization Preliminary results N = 500 No significant difference observed by using different weighting schemes Small improvement over the baseline model
9
Future Work More sophisticated similarity measure for information retrieval Optimization algorithm for online weighting of sub-models Introducing language model optimization into the system
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.