DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Slides:

Advertisements

Similar presentations

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Advertisements

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,

Visual Recognition Tutorial

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Evaluating Hypotheses

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.

8-2 Basics of Hypothesis Testing

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

Lecture Slides Elementary Statistics Twelfth Edition

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Department of Electrical Engineering, Southern Taiwan University Robotic Interaction Learning Lab 1 The optimization of the application of fuzzy ant colony.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

Chin-Yu Huang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan Optimal Allocation of Testing-Resource Considering Cost, Reliability,

Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Ch 5b: Discriminative Training (temporal model) Ilkka Aho.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Task: It is necessary to choose the most suitable variant from some set of objects by those or other criteria.

Data Mining Lecture 11.

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Presentation transcript:

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan Lai Department of Computer Science & Information Engineering National Taiwan Normal University

2 Outline Introduction Discriminative training Issues in language model adjustments Experimental setup Results Discussion and conclusion

3 Introduction Language models are important to guide the speech recognizer, particularly in compensating for mistakes in acoustic decoding. However, what is more important for accurate decoding is not necessarily having the maximum likelihood, but rather the best separation of the correct string from the competing, acoustically confusable hypotheses.

4 Introduction In this paper, they propose a model-based approach for discriminative training of language models. The language model is trained using the generalized probabilistic descent (GPD) algorithm to minimize the string error rate. The motivation is to adjust the language model to overcome acoustic confusion and achieve minimum recognition error rate.

5 Discriminative Training They describe how the parameters of an n-gram language model that is originally trained using the conventional maximum likelihood criterion are adjusted to achieve minimum sentence error through improving the separation of the correct word sequence from competing word sequence hypotheses. Given an observation sequence X i representing the speech signal and a word sequence.

6 Discriminative Training They define a discriminant function that is a weighted combination of acoustic and language model scores: A common strategy for a speech recognizer is to select the word sequence W 1 with the largest value for this function: where is the acoustic model, is the language model, and is the inverse of the language model weight.

7 Discriminative Training They can also recursively define a list of N-best word sequence hypotheses as follows: Let W 0 be the known correct word sequence. For this purpose, the misclassification function is defined to be: where is the th best hypothesized word sequence.

8 Discriminative Training Where the anti-discriminant function representing the competitors is defined as: is a positive parameter that controls how the different hypotheses are weighted. In the limit as, the anti-discriminant function is dominated by the biggest competing discriminant function:

9 Discriminative Training To formulate an error function appropriate for gradient descent optimization, a smooth differentiable 0-1 function such as the sigmoid function is chosen to be the class loss function: where and are constants which control the slope and the shift of the sigmoid function, respectively.

10 Discriminative Training Using the GPD algorithm, the parameters of the language model can be adjusted iteratively (with step size ) using the following update equation to minimize the recognition error: For simplicity, they here focus on training only the language model discriminatively, keeping the acoustic model constant.

11 Discriminative Training They will therefore only calculate so that where the first term is the slope associated with the sigmoid class-loss function and is given by:

12 Discriminative Training Using the definition of in Equation 5 and after working out the mathematics, they get: where and denotes the number of times the bigram appears in word sequence. These two equations hold for any n-gram, not just bigram.

13 Issues in language model adjustments In the previous section, they showed that the probabilities associated with bigrams are adjusted during discriminative training. If a particular bigram does not exist in the language model, the probability of the bigram is computed via back-off to the unigram language model according to: where is the back-off weight for word, and is the unigram probability of word

14 Issues in language model adjustments There are different choices: A related issue arises in class-based language models, where the bigram probability is given by: 1. keep the back-off weight constant while adjusting the unigram probability, keep the unigram probability constant while adjusting the back-off weight, adjust both. 2. create a new bigram with the backed off probability and adjust this bigram probability. In this case, the adjustment to the bigram probability can be assigned to either the class bigram or to the probability of membership of word in its class.

15 Issues in language model adjustments Another interesting issue is whether certain bigram probabilities should be adjusted if they do not have the same contexts. As an example, suppose the correct string is “A B C D” and the competing string is “A X Y D.” Potential bigrams for adjustments include P(B|A),P(X|A),P(C|B),P(Y|X),P(D|C), and P(D|Y). Such analysis has potential to improve the overall word error rate. However, in this paper they focus on the sentence error rate and make the simplifying assumption that all of the divergent bigrams should be adjusted: P(B|A),P(X|A),P(C|B),P(Y|X),P(D|C), and P(D|Y).

16 Experimental setup The data used for experiments were collected as part of the DARPA Communicator Project. The initial language model was built based on the transcriptions of calls collected by Colorado University prior to the NIST data collection. This set contained about 8.5K sentences. Their initial experiments were on a set of data collected by Bell Labs, consisting of 1395 sentences, which included NIST subjects as well as some data from other subjects.

17 Experimental setup The baseline language model has about 900 unigrams and 41K bigrams.

18 Results Out of the 1395 sentences, 1077 had no unknown words and at least one competing hypothesis. These 1077 sentences were first used as training data for performing discriminative training on the baseline language model. After ten iterations, the final language model was tested on the entire set of 1395 sentences by decoding using the speech recognizer.

19 Results Compared to the baseline language model, the word error rate was reduced from 19.7% to 17.8%, a relative improvement of 10%. The sentence error rate went from 30.9% to 27.0%, a relative improvement of 13%. Interestingly, the bigram perplexity increased from 34 to 35, despite the reduction in word error rate; this observation is consistent with the argument that perplexity and word error rate are not perfectly correlated.

20 Results Recall that the misclassification function is more negative when the separation between the correct and competing hypotheses is larger. The figure shows that the histogram is shifted left, demonstrating that this separation is indeed increased after discriminative training. This increase in separation can improve the robustness of the language model.

21 Results The next experiment is a fair experiment that involves round-robin training and testing. The set of sentences were divided into four roughly equal subsets. Three subsets were used for discriminative training and the fourth used for testing the resulting model.

22 Results In an entirely different experiment, we used a new class- based language model that we had later built with more data (about 9K sentences collected by other sites during the NIST data collection) and additional semantic classes. The baseline class-based bigram language model had a word error rate of 15.9% on the 1395 sentence test set, with a sentence error rate of 25.4%. Preliminary results with discriminative training gave a relative improvement of about 3%: the word error rate dropped to 15.5% and sentence error rate to 24.7%.

23 Discussion and conclusion They showed that after working out the equations for GPD training, the equations simplify into terms that include weighted counts of n-gram sequences that appear exclusively in either the correct or competing sentences. From their preliminary experiments on a spontaneous speech database associated with a flight reservation dialogue system, they showed modest improvements in word and sentence error rates of 4-6%.

24 Discussion and conclusion One problem is that after the language model has been changed, certain hypotheses may emerge which were not captured in the N-best list.