An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
CMPUT 466/551 Principal Source: CMU
Planning under Uncertainty
Visual Recognition Tutorial
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Ensemble Learning: An Introduction
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Lecture II-2: Probability Review
Online Learning Algorithms
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Introduction to Statistical Inferences
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
Online Learning for Collaborative Filtering
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
CHAPTER 4 Adaptive Tapped-delay-line Filters Using the Least Squares Adaptive Filtering.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
1 Chapter 8 Interval Estimation. 2 Chapter Outline  Population Mean: Known  Population Mean: Unknown  Population Proportion.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
12. Principles of Parameter Estimation
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Statistical Models for Automatic Speech Recognition
Data Mining Lecture 11.
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
12. Principles of Parameter Estimation
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty Liu

2 Outline Introduction The Language Model and the Task of IME Related Work LM Adaptation Methods Experimental Results Discussion Conclusion and Future Work

3 Introduction Language model adaptation attempts to adjust the parameters of a LM so that it will perform well on a particular domain of data. In particular, we focus on the so-called cross-domain LM adaptation paradigm, that is, to adapt a LM trained on one domain (background domain) to a different domain (adaptation domain), for which only a small amount of training data is available. The LM adaptation methods investigated here can be grouped into two categories: (1) Maximum a posteriori (MAP) : Linear interpolation (2) Discriminative training : boosting 、 perceptron 、 minimum sample risk

4 The Language Model and the Task of IME IME (Input Method Editor) : The users first input phonetic strings, which are then converted into appropriate word strings by software. Unlike speech recognition, there is no acoustic ambiguity in IME, since the phonetic string is provided directly by users. Moreover, we can assume a unique mapping from W to A in IME, that is,. From the perspective of LM adaptation, IME faces the same problem that speech recognition does: the quality of the model depends heavily on the similarity between the training data and the test data.

5 Related Work (1/3) I. Measuring Domain Similarity: : a language : true underlying probability distribution of : another distribution (e.g., an SLM) which attempts to model : the cross entropy of with respect to : a word string in

6 Related Work (2/3) However, in reality, the underlying is never known and the corpus size is never infinite. We therefore make the assumption that is an ergodic and stationary process, and approximate the cross entropy by calculating it for a sufficiently large n instead of calculating it for the limit. The cross entropy takes into account both the similarity between two distributions (given by KL divergence) and the entropy of the corpus in question.

7 Related Work (3/3) II. LM Adaptation Methods MAP : adjust the parameters of the background model → maximize the likelihood of the adaptation data Discriminative training methods : using adaptation data → directly minimize the errors in it made by the background model These techniques have been applied successfully to language modeling in non-adaptation as well as adaptation scenarios for speech recognition.

8 LM Adaptation Methods ─LI I. The Linear Interpolation Method : the probability of the background model : the probability of the adaptation model : the history, corresponds to the two preceding words : For simplicity, we chose a single for all histories and tuned it on held-out data

9 LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (1/3) II. Discriminative Training Methods ◎ Problem Definition

10 LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (2/3) which views IME as a ranking problem, where the model gives the ranking score, not probabilities. We therefore do not evaluate the LM obtained using discriminative training via perplexity.

11 LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (3/3) : reference transcript : an error function which is an edit distance function in this case : sample risk, the sum of error counts over the training samples Discriminative training methods strive to minimize the by optimizing the model parameters. However, cannot be optimized easily, since is a piecewise constant (or step) function of and its gradient is undefined. Therefore, discriminative methods apply different approaches that optimize it approximately. The boosting and perceptron algorithms approximate by loss functions that are suitable for optimization, while MSR uses a simple heuristic training procedure to minimize directly.

12 LM Adaptation Methods─ The Boosting Algorithm (1/2) (i) The Boosting Algorithm margin : a ranking error : an incorrect candidate conversion gets a higher score than the correct conversion, where if, and 0 otherwise Optimizing the RLoss : NP-complete →optimizes its upper bound, ExpLoss ExpLoss : convex

13 LM Adaptation Methods─ The Boosting Algorithm (2/2) : a value increasing exponentially with the sum of the margins of pairs over the set where is seen in but not in : the value related to the sum of margins over the set where is seen in but not in : a smoothing factor (whose value is optimized on held-out data) :a normalization constant.

14 LM Adaptation Methods─ The Perceptron Algorithm (1/2) (ii) The Perceptron Algorithm delta rule: stochastic approximation:

15 LM Adaptation Methods ─ The Perceptron Algorithm (2/2) averaged perceptron algorithm

16 LM Adaptation Methods─ MSR(1/7) (iii) The Minimum Sample Risk Method Conceptually, MSR operates like any multidimensional function optimization approach: - The first direction (i.e., feature) is selected and SR is minimized along that direction using a line search, that is, adjusting the parameter of the selected feature while keeping all other parameters fixed. - Then, from there, along the second direction to its minimum, and so on - Cycling through the whole set of directions as many times as necessary, until SR stops decreasing.

17 LM Adaptation Methods ─ MSR(2/7) This simple method can work properly under two assumptions. - First, there exists an implementation of line search that efficiently optimizes the function along one direction. - Second, the number of candidate features is not too large, and they are not highly correlated. However, neither of the assumptions holds in our case. - First of all, Er(.) in is a step function of λ, and thus cannot be optimized directly by regular gradient-based procedures –- a grid search has to be used instead. However, there are problems with simple grid search: using a large grid could miss the optimal solution, whereas using a fine-grained grid would lead to a very slow algorithm. - Second, in the case of LM, there are millions of candidate features, some of which are highly correlated with each other.

18 LM Adaptation Methods ─ MSR(3/7) ◎ active candidate of a group : : candidate word string, Since in our case takes integer values and ( is the count of a particular n-gram in ), we can group the candidates using so that candidates in each group have the same value of. In each group, we define the candidate with the highest value of as the active candidate of the group because no matter what value takes, only this candidate could be selected according to :

19 LM Adaptation Methods ─ MSR(4/7) ◎ Grid Line Search By finding the active candidates, we can reduce to a much smaller list of active candidates. We can find a set of intervals for, within each of which a particular active candidate will be selected as. As a result, for each training sample, we obtain a sequence of intervals and their corresponding values. The optimal value can then be found by traversing the sequence and taking the midpoint of the interval with the lowest value. By merging the sequence of intervals of each training sample in the training set, we obtain a global sequence of intervals as well as their corresponding sample risk. We can then find the optimal value as well as the minimal sample risk by traversing the global interval sequence.

20 LM Adaptation Methods ─ MSR(5/7) ◎ Feature Subset Selection Reducing the number of features is essential for two reasons: to reduce computational complexity and to ensure the generalization property of the linear model. Effectiveness of : The cross-correlation coefficient between two features and

21 LM Adaptation Methods ─ MSR(6/7)

22 LM Adaptation Methods ─ MSR(7/7) : the number of all candidate features : the number of features in the resulting model, According to the feature selection method: - step1: for each of the candidate features - step4: estimates of are required Therefore, we only estimate the value of between each of the selected features and each of the top remaining features with the highest value of. This reduces the number of estimates of to.

23 Experimental Results (1/3) I. Data The data used in our experiments stems from five distinct sources of text. Different sizes of each adaptation training data were also used to show how different sizes of adaptation training data affected the performances of various adaptation methods. NikkeiYomiuriTuneUpEncartaShincho newspaper balanced corpus (newspaper and other sources) encyclopedianovels

24 Experimental Results (2/3) II. Computing Domain Characteristics (i) The similarity between two domains: cross entropy - not symmetric - self entropy (the diversity of the corpus) increases in the following order : N→Y→E→T→S

25 Experimental Results (3/3) III. Results of LM Adaptation We trained our baseline trigram model on our background (Nikkei) corpus.

26 Discussion (1/6) I. Domain Similarity and CER The more similar the adaptation domain is to the background domain, the better the CER results.

27 Discussion (2/6) II. Domain Similarity and the Robustness of Adaptation Methods The discriminative methods outperform LI in most cases. The performance of LI is greatly influenced by domain similarity. Such a limitation is not observed with the discriminative methods.

28 Discussion (3/6) III. Adaptation Data Size and CER Reduction X-axis : self entropy Y-axis : the improvement in CER reduction a positive correlation between the diversity of the adaptation corpus and the benefit of having more training data available An intuitive explanation: The less diverse the adaptation data, the fewer distinct training examples will be included for discriminative training.

29 Discussion (4/6) IV. Domain Characteristics and Error Ratios error ratio (ER) metric, which measures the side effects of a new model : : the number of errors found only in the new (adaptation) model : the number of errors corrected by the new model if the adapted model introduces no new errors if the adapted model makes CER improvements if the CER improvement is zero (i.e., the adapted model makes as many new mistakes as it corrects old mistakes) when the adapted model has worse CER performance than the baseline model

30 Discussion (5/6) RER: relative error rate reduction, i.e., the CER difference between the background and adapted models in % A discriminative method (in this case MSR) is superior to linear interpolation, not only in terms of CER reduction but also in having fewer side effects.

31 Discussion (6/6) Although the boosting and perceptron algorithms have the same CER for Yomiuri and TuneUp from Table III, the perceptron is better in terms of ER. This may be due to the use of an exponential loss function in the boosting algorithm, which is less robust against noisy data. Corpus diversity: the less stylistically diverse, the more consistent within the domain.

32 Conclusion and Future Work Conclusion: (1) cross-domain similarity (cross entropy) correlates with the CER of all models (2) diversity (self entropy) correlates with the utility of more adaptation training data for discriminative training methods Future Work : an online learning scenario