Discriminative Training Approaches for Continuous Speech Recognition Berlin Chen, Jen-Wei Kuo, Shih-Hung Liu Speech Lab Graduate Institute of Computer.

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

What is Statistical Modeling

Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference:

Extended Baum-Welch algorithm Present by shih-hung Liu

Hidden Markov Models Theory By Johan Walters (SR 2003)

Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

Visual Recognition Tutorial

Speech Recognition Training Continuous Density HMMs Lecture Based on:

Lecture 5: Learning models using EM

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Expectation Maximization Algorithm

Maximum Likelihood (ML), Expectation Maximization (EM)

Visual Recognition Tutorial

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Ch 8.1 Numerical Methods: The Euler or Tangent Line Method

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Isolated-Word Speech Recognition Using Hidden Markov Models

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

7-Speech Recognition Speech Recognition Concepts

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

INTRODUCTION TO Machine Learning 3rd Edition

CS Statistical Machine learning Lecture 24

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

LECTURE 03: DECISION SURFACES

Extended Baum-Welch algorithm

Statistical Models for Automatic Speech Recognition

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Statistical Models for Automatic Speech Recognition

LECTURE 05: THRESHOLD DECODING

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

LECTURE 15: REESTIMATION, EM AND MIXTURES

Discriminative Training Approaches for Continuous Speech Recognition

Parametric Methods Berlin Chen, 2005 References:

LECTURE 05: THRESHOLD DECODING

Presentation transcript:

Discriminative Training Approaches for Continuous Speech Recognition Berlin Chen, Jen-Wei Kuo, Shih-Hung Liu Speech Lab Graduate Institute of Computer Science ＆ Information Engineering National Taiwan Normal University

NTNU Speech Lab 2 Outline Statistical Speech Recognition Overall Expected Risk Decision/Estimation Maximum Likelihood (ML) Estimation Maximum Mutual Information (MMI) Estimation Minimum Phone Error (MPE) Estimation Preliminary Experimental Results Conclusions

NTNU Speech Lab 3 Statistical Speech Recognition (1/2) The task of the speech recognition is to determine the identity of an given observation sequence by assigning the recognized word sequence to it The decision is to find the identity with maximum posterior (MAP) probability –The so-called Bayes decision (or minimum-error-rate) rule

NTNU Speech Lab 4 Statistical Speech Recognition (2/2) The probabilities needed for the Bayes decision rule are not given in practice and have to be estimated from training dataset Moreover, as the true families of distributions to which (typically assumed Multinomial) and (typically assumed Gaussian) belong are not known –A certain parametric representation of these distributions is needed –In this presentation, language model is assumed be given in advance while acoustic model is needed to be estimated HMMs (hidden Markov models) are widely adopted for acoustic modeling

NTNU Speech Lab 5 Expected Risk Let be a finite set of various possible word sequences for a given observation utterance –Assume that the true word sequence is also in Let be the action of classifying a given observation sequence to a word sequence – Let be the loss incurred when we take such an action (and the true word sequence is just ) Therefore, the (expected) risk for a specific action

NTNU Speech Lab 6 Decoding: Minimum Expected Risk (1/2) In speech recognition, we can take the action with the minimum (expected) risk If zero-one loss function is adopted (string-level error) – Then

NTNU Speech Lab 7 Decoding: Minimum Expected Risk (2/2) –Thus, Select the word sequence with maximum posterior probability (MAP decoding) The string editing or Levenshtein distance also can be accounted for the loss function –Take individual word errors into consideration –E.g., Minimum Bayes Risk (MBR) search/decoding [V. Goel et al. 2004]

NTNU Speech Lab 8 Training: Minimum Overall Expected Risk (1/2) In training, we should minimize the overall (expected) loss of the action of the training utterances – is the true word sequence of –The integral extends over the whole observation sequence space However, when a limited number of training observation sequences are available, the overall risk can be approximated by

NTNU Speech Lab 9 Training: Minimum Overall Expected Risk (2/2) Assume to be uniform –The overall risk can be further expressed as If zero-one loss function is adopted –Then

NTNU Speech Lab 10 Training: Maximum Likelihood (1/3) The objective function of Maximum Likelihood (ML) estimation can be obtained if Jensen Inequality is further applied –Find a new parameter set that minimizes the overall expected risk is equivalent to those that maximizes the overall log likelihood of all training utterances minimize the upper bound maximize the low bound

NTNU Speech Lab 11 Training: Maximum Likelihood (2/3) The objective function can be maximized by adjusting the parameter set, with the EM algorithm and a specific auxiliary function (or the Baum-Welch algorithm) –E.g., update formulas for Gaussians

NTNU Speech Lab 12 Training: Maximum Likelihood (3/3) On the other hand, the discriminative training approaches attempt to optimize the correctness of the model set by formulating an objective function that in some way penalizes the model parameters that are liable to confuse correct and incorrect answers

NTNU Speech Lab 13 Training: Maximum Mutual Information (1/4) The objective function can be defined as the sum of the pointwise mutual information of all training utterances and their associated true word sequences –A kind of rational functions The maximum mutual information (MMI) estimation tries to find a new parameter set that maximizes the above objective function

NTNU Speech Lab 14 Training: Maximum Mutual Information (2/4) An alternative derivation based on the overall expected risk criterion –zero-one loss function –Which is equivalent to the maximization of the overall log likelihood of training utterances

NTNU Speech Lab 15 Training: Maximum Mutual Information (3/4) When we maximize the MMIE objection function –Not only the probability of true word sequence (numerator, like the MLE objective function) can be increased, but also can the probabilities of other possible word sequences (denominator) be decreased –Thus, MMIE attempts to make the correct hypothesis more probable, while at the same time it also attempts to make incorrect hypotheses less probable

NTNU Speech Lab 16 Training: Maximum Mutual Information (4/4) The objective functions used in discriminative training, such as that of MMI, are often rational functions –The original Baum-Welch algorithm is not feasible –Gradient descent and the extended Baum-Welch (EB) algorithm are two applicable approaches for such a function optimization problem Gradient descent may require a large number of iterations to obtain an local optimal solution While Baum-Welch algorithm was extended (EB) for the optimization of rational functions –MMI training has similar update formulas as those of MPE (Minimum Phone Error ) training to be introduced later

NTNU Speech Lab 17 Training: Minimum Phone Error (1/4) The objective function of Minimum Phone Error (MPE) is also derived with the overall expected risk criterion –Replace the loss function with the so-called accuracy function (cf. p.9) MPE tries to maximize the expected (phone or word) accuracy of all possible word sequences (generated by the recognizer) regarding the training utterances

NTNU Speech Lab 18 Training: Minimum Phone Error (2/4) Like the MMI, MPE also aims to make sure that the training data is correctly recognized –In MPE each phone error in the training utterance can contribute at most one unit to the objective function, rather than that as an unlimited quantity in MMI (zero-one word string error) The MPE objective function is less sensitive to portions of the training data that are poorly transcribed A (word) lattice structure can be used here to approximate the set of all possible word sequences of each training utterance –Training statistics can be efficiently computed via such structure

NTNU Speech Lab 19 Training: Minimum Phone Error (3/4) The objective function of MPE has the “latent variable” problem, such that it can not be directly optimized –However, it is also a rational function and the Baum-Welch (EM) algorithm can not be directly applied Without a strong-sense auxiliary function satisfying that strong-sense auxiliary function

NTNU Speech Lab 20 Training: Minimum Phone Error (3/4) A weak-sense auxiliary function is employed instead for the objective function of MPE,, which a has a local optimal solution and only satisfies that weak-sense auxiliary function

NTNU Speech Lab 21 Training: Minimum Phone Error (4/4) A smoothing function can be further incorporated into the weak-sense auxiliary function to constrain the function optimization process (to improve convergence) –The local differential will not be affected and the result ( ) is still a weak-sense auxiliary function has a optimum value at

NTNU Speech Lab 22 MPE: Auxiliary Function (1/2) The weak-sense auxiliary function for MPE model updating can be defined as – is a scalar value (a constant) calculated for each phone arc q, and can be either positive or negative (because of the accuracy function) –The auxiliary function also can be decomposed as arcs with positive contributions (so-called numerator) arcs with negative contributions (so-called denominator) These two statistics can be compressed by saving their difference (? I-Smoothing) still have the “latent variable” problem

NTNU Speech Lab 23 MPE: Auxiliary Function (2/2) The auxiliary function can be modified by considering the normal auxiliary function for –The smoothing term is not added yet here The key quantity (statistics value) required in MPE training is, which can be termed as

NTNU Speech Lab 24 MPE: Statistics Accumulation (1/2) The objective function can be expressed as (for a specific phone arc ) The differential can be expressed as

NTNU Speech Lab 25 MPE: Statistics Accumulation (2/2) The average accuracy of sentences passing through the arc q The likelihood of the arc q The average accuracy of all the sentences in the word graph

NTNU Speech Lab 26 MPE: Accuracy Function (1/4) and can be calculated in an approximation way using the word graph and the Forward-Backward algorithm Note that the exact accuracy function is express as the sum of phone-level accuracy over all phones, e.g. –However, such accuracy is obtained by full alignment between the true and all possible word sequences, which is computational expensive

NTNU Speech Lab 27 MPE: Accuracy Function (2/4) An approximated phone accuracy is defined – : the portion of that is overlapped by 1. Assume the true word sequence has no pronunciation variation 2. Phone accuracy can be obtained by simple local search 3. Context-independent phones can be used for accuracy calculation

NTNU Speech Lab 28 MPE: Accuracy Function (3/4) Forward-Backward algorithm for statistics calculation for 開始時間為 0 的音素 q end for t=1 to T-1 for 開始時間為 t 的音素 q for 結束時間為 t-1 且可連至 q 的音素 r end for 結束時間為 t-1 且可連至 q 的音素 r end end end Forward

NTNU Speech Lab 29 MPE: Accuracy Function (4/4) for 結束時間為 T-1 的音素 q for t=T-2 to 0 for 結束時間為 t 的音素 q for 開始時間為 t+1 且可連至 q 的音素 r end for 開始時間為 t+1 且可連至 q 的音素 r end Backward for 每一音素 q end

NTNU Speech Lab 30 MPE: Smoothing Function The smoothing function can be defined as –The old model parameters( ) are used here as the parameters –It has a maximum value at

NTNU Speech Lab 31 MPE: Final Auxiliary Function weak-sense auxiliary function strong-sense auxiliary function smoothing function involved weak-sense auxiliary function

NTNU Speech Lab 32 MPE: Model Update using EB (1/2) Based on the final auxiliary function, the Extended Baum-Welch (EB) training algorithm has the following update formulas diagonal covariance matrix

NTNU Speech Lab 33 MPE: Model Update using EB (2/2) Two sets of statistics (numerator, denominator) are accumulated respectively

NTNU Speech Lab 34 MPE: More About Smoothing Function (1/2) The mean and variance update formulas rely on the proper setting of the smoothing constant ( ) –If is too large, the step size is small and convergence is slow –If is too small, the algorithm may become unstable – also needs to make all variance positive A B C

NTNU Speech Lab 35 MPE: More About Smoothing Function (2/2) Previous work [Povey 2004] used a value of that was twice the minimum positive value needed to insure all variance updates were positive

NTNU Speech Lab 36 MPE: I-Smoothing I-smoothing increases the weight of the numerator counts depending on the amounts of data available for each Guassian This is done by multiplying the numerator terms ( ) in the update formulas by – can be set empirically (e.g., ) emphasize positive contributions (arcs with higher accuracy)

NTNU Speech Lab 37 Preliminary Experimental Results (1/4) CER(%)ML_itr10 MPE_itr10 ML_itr150 MPE_itr10 MFCC(CMS) MFCC(CN) HLDA+MLLT(CN) Experiment conducted on the MATBN (TV broadcast news) corpus (field reporters) –Training: 34,672 utterances (25.5hrs) –Testing: 292 utterances (1.45hrs, outside-testing) –Metric: Chinese character error rate (CER) 12.83% relative improvement 11.51% relative improvement

NTNU Speech Lab 38 Preliminary Experimental Results (2/4) Another experiment conducted on the MATBN (TV broadcast news) corpus (field reporters) –Discriminative Feature: HLDA+MLLT(CN) –Discriminative training: MPE –Total 34,672 utterances (24.5hrs) - (10-fold cross validation) A relative improvement of 26.2% finally achieved (HLDA-MLLT(CN)+ MPE)

NTNU Speech Lab 39 Preliminary Experimental Results (3/4) Conducted on radio broadcast news recorded from several radio station located at Taipei –MFCC(CMS)/LDA(CN)/HLDA-MLLT(CN) features –Unsupervised Training –Metric: Chinese character error rate (CER) MFCCMFCC+MPELDALDA+MPEHLDA-MLLTHLDA-MLLT+MPE Original 4 Hrs22.56%-20.04%-20.12%- +5 Hrs (Thr=0.9)18.05%-16.43%14.87%16.44%14.78% +21 Hrs (Thr=0.8)16.90%15.34%15.22%14.11%15.64%14.22% +33 Hrs (Thr=0.7)17.32%-15.49%14.23%15.37%14.28% +48 Hrs (Thr=0.6)17.29%-15.54%14.32%15.42%- +54 Hrs (Thr=0.5)17.30%-15.54%14.31%15.46%- +60 Hrs (Thr=0.4) %-15.36%- 9.32% 16.51% 15.86% 9.94%

NTNU Speech Lab 40 Preliminary Experimental Results (4/4) Conducted on microphone speech recorded for spoken dialogues –Free syllable decoding with MFCC_CMS features –About 8hr training data was used Syllable error rate was reduced form 27.79% to 22.74% – 18.17% relative improvement

NTNU Speech Lab 41 Conclusions MPE/MWE (or MMI) based discriminative training approaches have shown effectiveness in Chinese continuous speech recognition –Joint training of feature transformation, acoustic models and language models –Unsupervised training –More in-deep investigation and analysis are needed –Exploration of variant accuracy/error functions

NTNU Speech Lab 42 References D. Povey. “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, Peterhouse, University of Cambridge, July 2004 K. Vertanen, “An Overview of Discriminative Training for Speech Recognition” V. Goel, S. Kumar, W. Byrne, “Segmental minimum Bayes-risk decoding for automatic speech recognition,” IEEE Transactions on Speech and Audio Processing, May 2004 J.W. Kuo, “An Initial Study on Minimum Phone Error Discriminative Learning of Acoustic Models for Mandarin Large Vocabulary Continuous Speech Recognition” Master Thesis, NTNU, 2005 J.W. Kuo, B. Chen, "Minimum Word Error Based Discriminative Training of Language Models," Eurospeech 2005

Thank You !

NTNU Speech Lab 44 Minimum Phone Error Estimation An example showing the values of and reference = “ pat ” “p”“p” “a”“a” “t”“t” “s”“s” “t”“t” “a”“a”

NTNU Speech Lab 45 Extended Baum-Welch Algorithm A general formula of the Extended Baum-Welch Algorithm

NTNU Speech Lab 46 MPE: Shortcomings The standard MPE objective function discourages insertion error more than deletion error The dynamic range of raw phone accuracy is typically quite narrow, which makes MPE occupancies considerably lower than MMI occupancies

NTNU Speech Lab 47 Other Issues Major computation load of discriminative training: –Decoding the training data –Collecting statistics for parameter estimation Discriminative training techniques often perform very well on the training data, but fail to generalize well to unseen data This generalization failure may result from the domination in the numerator and denominator by very few number of paths or from the modeling of training data features which are not significant to the task