An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty Liu
2 Outline Introduction The Language Model and the Task of IME Related Work LM Adaptation Methods Experimental Results Discussion Conclusion and Future Work
3 Introduction Language model adaptation attempts to adjust the parameters of a LM so that it will perform well on a particular domain of data. In particular, we focus on the so-called cross-domain LM adaptation paradigm, that is, to adapt a LM trained on one domain (background domain) to a different domain (adaptation domain), for which only a small amount of training data is available. The LM adaptation methods investigated here can be grouped into two categories: (1) Maximum a posteriori (MAP) : Linear interpolation (2) Discriminative training : boosting 、 perceptron 、 minimum sample risk
4 The Language Model and the Task of IME IME (Input Method Editor) : The users first input phonetic strings, which are then converted into appropriate word strings by software. Unlike speech recognition, there is no acoustic ambiguity in IME, since the phonetic string is provided directly by users. Moreover, we can assume a unique mapping from W to A in IME, that is,. From the perspective of LM adaptation, IME faces the same problem that speech recognition does: the quality of the model depends heavily on the similarity between the training data and the test data.
5 Related Work (1/3) I. Measuring Domain Similarity: : a language : true underlying probability distribution of : another distribution (e.g., an SLM) which attempts to model : the cross entropy of with respect to : a word string in
6 Related Work (2/3) However, in reality, the underlying is never known and the corpus size is never infinite. We therefore make the assumption that is an ergodic and stationary process, and approximate the cross entropy by calculating it for a sufficiently large n instead of calculating it for the limit. The cross entropy takes into account both the similarity between two distributions (given by KL divergence) and the entropy of the corpus in question.
7 Related Work (3/3) II. LM Adaptation Methods MAP : adjust the parameters of the background model → maximize the likelihood of the adaptation data Discriminative training methods : using adaptation data → directly minimize the errors in it made by the background model These techniques have been applied successfully to language modeling in non-adaptation as well as adaptation scenarios for speech recognition.
8 LM Adaptation Methods ─LI I. The Linear Interpolation Method : the probability of the background model : the probability of the adaptation model : the history, corresponds to the two preceding words : For simplicity, we chose a single for all histories and tuned it on held-out data
9 LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (1/3) II. Discriminative Training Methods ◎ Problem Definition
10 LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (2/3) which views IME as a ranking problem, where the model gives the ranking score, not probabilities. We therefore do not evaluate the LM obtained using discriminative training via perplexity.
11 LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (3/3) : reference transcript : an error function which is an edit distance function in this case : sample risk, the sum of error counts over the training samples Discriminative training methods strive to minimize the by optimizing the model parameters. However, cannot be optimized easily, since is a piecewise constant (or step) function of and its gradient is undefined. Therefore, discriminative methods apply different approaches that optimize it approximately. The boosting and perceptron algorithms approximate by loss functions that are suitable for optimization, while MSR uses a simple heuristic training procedure to minimize directly.
12 LM Adaptation Methods─ The Boosting Algorithm (1/2) (i) The Boosting Algorithm margin : a ranking error : an incorrect candidate conversion gets a higher score than the correct conversion, where if, and 0 otherwise Optimizing the RLoss : NP-complete →optimizes its upper bound, ExpLoss ExpLoss : convex
13 LM Adaptation Methods─ The Boosting Algorithm (2/2) : a value increasing exponentially with the sum of the margins of pairs over the set where is seen in but not in : the value related to the sum of margins over the set where is seen in but not in : a smoothing factor (whose value is optimized on held-out data) :a normalization constant.
14 LM Adaptation Methods─ The Perceptron Algorithm (1/2) (ii) The Perceptron Algorithm delta rule: stochastic approximation:
15 LM Adaptation Methods ─ The Perceptron Algorithm (2/2) averaged perceptron algorithm
16 LM Adaptation Methods─ MSR(1/7) (iii) The Minimum Sample Risk Method Conceptually, MSR operates like any multidimensional function optimization approach: - The first direction (i.e., feature) is selected and SR is minimized along that direction using a line search, that is, adjusting the parameter of the selected feature while keeping all other parameters fixed. - Then, from there, along the second direction to its minimum, and so on - Cycling through the whole set of directions as many times as necessary, until SR stops decreasing.
17 LM Adaptation Methods ─ MSR(2/7) This simple method can work properly under two assumptions. - First, there exists an implementation of line search that efficiently optimizes the function along one direction. - Second, the number of candidate features is not too large, and they are not highly correlated. However, neither of the assumptions holds in our case. - First of all, Er(.) in is a step function of λ, and thus cannot be optimized directly by regular gradient-based procedures –- a grid search has to be used instead. However, there are problems with simple grid search: using a large grid could miss the optimal solution, whereas using a fine-grained grid would lead to a very slow algorithm. - Second, in the case of LM, there are millions of candidate features, some of which are highly correlated with each other.
18 LM Adaptation Methods ─ MSR(3/7) ◎ active candidate of a group : : candidate word string, Since in our case takes integer values and ( is the count of a particular n-gram in ), we can group the candidates using so that candidates in each group have the same value of. In each group, we define the candidate with the highest value of as the active candidate of the group because no matter what value takes, only this candidate could be selected according to :
19 LM Adaptation Methods ─ MSR(4/7) ◎ Grid Line Search By finding the active candidates, we can reduce to a much smaller list of active candidates. We can find a set of intervals for, within each of which a particular active candidate will be selected as. As a result, for each training sample, we obtain a sequence of intervals and their corresponding values. The optimal value can then be found by traversing the sequence and taking the midpoint of the interval with the lowest value. By merging the sequence of intervals of each training sample in the training set, we obtain a global sequence of intervals as well as their corresponding sample risk. We can then find the optimal value as well as the minimal sample risk by traversing the global interval sequence.
20 LM Adaptation Methods ─ MSR(5/7) ◎ Feature Subset Selection Reducing the number of features is essential for two reasons: to reduce computational complexity and to ensure the generalization property of the linear model. Effectiveness of : The cross-correlation coefficient between two features and
21 LM Adaptation Methods ─ MSR(6/7)
22 LM Adaptation Methods ─ MSR(7/7) : the number of all candidate features : the number of features in the resulting model, According to the feature selection method: - step1: for each of the candidate features - step4: estimates of are required Therefore, we only estimate the value of between each of the selected features and each of the top remaining features with the highest value of. This reduces the number of estimates of to.
23 Experimental Results (1/3) I. Data The data used in our experiments stems from five distinct sources of text. Different sizes of each adaptation training data were also used to show how different sizes of adaptation training data affected the performances of various adaptation methods. NikkeiYomiuriTuneUpEncartaShincho newspaper balanced corpus (newspaper and other sources) encyclopedianovels
24 Experimental Results (2/3) II. Computing Domain Characteristics (i) The similarity between two domains: cross entropy - not symmetric - self entropy (the diversity of the corpus) increases in the following order : N→Y→E→T→S
25 Experimental Results (3/3) III. Results of LM Adaptation We trained our baseline trigram model on our background (Nikkei) corpus.
26 Discussion (1/6) I. Domain Similarity and CER The more similar the adaptation domain is to the background domain, the better the CER results.
27 Discussion (2/6) II. Domain Similarity and the Robustness of Adaptation Methods The discriminative methods outperform LI in most cases. The performance of LI is greatly influenced by domain similarity. Such a limitation is not observed with the discriminative methods.
28 Discussion (3/6) III. Adaptation Data Size and CER Reduction X-axis : self entropy Y-axis : the improvement in CER reduction a positive correlation between the diversity of the adaptation corpus and the benefit of having more training data available An intuitive explanation: The less diverse the adaptation data, the fewer distinct training examples will be included for discriminative training.
29 Discussion (4/6) IV. Domain Characteristics and Error Ratios error ratio (ER) metric, which measures the side effects of a new model : : the number of errors found only in the new (adaptation) model : the number of errors corrected by the new model if the adapted model introduces no new errors if the adapted model makes CER improvements if the CER improvement is zero (i.e., the adapted model makes as many new mistakes as it corrects old mistakes) when the adapted model has worse CER performance than the baseline model
30 Discussion (5/6) RER: relative error rate reduction, i.e., the CER difference between the background and adapted models in % A discriminative method (in this case MSR) is superior to linear interpolation, not only in terms of CER reduction but also in having fewer side effects.
31 Discussion (6/6) Although the boosting and perceptron algorithms have the same CER for Yomiuri and TuneUp from Table III, the perceptron is better in terms of ER. This may be due to the use of an exponential loss function in the boosting algorithm, which is less robust against noisy data. Corpus diversity: the less stylistically diverse, the more consistent within the domain.
32 Conclusion and Future Work Conclusion: (1) cross-domain similarity (cross entropy) correlates with the CER of all models (2) diversity (self entropy) correlates with the utility of more adaptation training data for discriminative training methods Future Work : an online learning scenario