Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie Mellon University
Reference Papers [ICASSP 03] Improving The Performance of An LVCSR System Through Ensembles of Acoustic Models [EUROSPEECH 03] Comparative Study of Boosting and Non-Boosting Training for Constructing Ensembles of Acoustic Models [ICSLP 04] A Frame Level Boosting Training Scheme for Acoustic Modeling [ICSLP 04] Optimizing Boosting with Discriminative Criteria [ICSLP 04] Apply N-Best List Re-Ranking to Acoustic Model Combinations of Boosting Training [EUROSPEECH 05] Investigations on Ensemble Based Semi-Supervised Acoustic Model Training [ICSLP 06] Investigations of Issues for Using Multiple Acoustic Models to Improve Continuous Speech Recognition
Improving The Performance of An LVCSR System Through Ensembles of Acoustic Models ICASSP 2003 Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie Mellon University
Introduction An ensemble of classifiers is a collection of single classifiers –which is used to select a hypothesis based on the majority vote from its components Bagging and Boosting are the two most successful algorithms for construction ensembles This paper describes the work on applying ensembles of acoustic models to the problem of LVCSR plurality voting majority voting weighted voting
Bagging vs. Boosting Bagging –In each round, bagging randomly selects a number of examples from the original training set, and produces a new single classifier based on the selected subset –The final classifier is built by choosing the hypothesis best agreed on by single classifiers Boosting –In boosting, the single classifiers are iteratively trained in a fashion such that hard-to-classify examples are given increasing emphasis –A parameter that measures the classifier’s importance is determined in respect of its classification accuracy –The final hypothesis is the weighted majority vote from the single classifiers
Algorithms The first algorithm is based on the intuition that an incorrectly recognized utterance should receive more attention in training If the weight of an utterance is 2.6, we first add two copies of the utterance to the new training set, and then add its third copy with probability 0.6
Algorithms The exponential increase in the size of training set is a severe problem for algorithm 1 Algorithm 2 is proposed to address this problem
Algorithms In algorithm 1 and 2, there is no concern to measure how important a model is relative to others –Good model should play more important role than bad one
Experiments Corpus : CMU Communicator system Experimental results :
A Frame Level Boosting Training Scheme for Acoustic Modeling ICSLP 2004 Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie Mellon University
Introduction In the current Boosting algorithm, utterance is the basic unit used for acoustic model training Our analysis shows that there are two notable weaknesses in this setting.. –First, the objective function of current Boosting algorithm is designed to minimize utterance error instead of word error –Second, in the current algorithm, an utterance is treated as a unity for resample This paper proposes a frame level Boosting training scheme for acoustic modeling to address these two problems
is the pseudo loss for frame t, which describes the degree of confusion of this frame for recognition Frame Level Boosting Training Scheme The metrics that we will use in Boosting training is the frame level conditional probability -----(word level) Objective function :
Frame Level Boosting Training Scheme Training Scheme: –How to resample the frame level training data? to duplicate for times and creates a new utterance for acoustic model training
Experiments Corpus : CMU Communicator system Experimental results :
Discussions Some speculations on this outcome : –1. The mismatch between training criterion and target in principle still exists in the new scheme –2. The frame based resample method is exclusively depends on the weight calculated in the training process –3. Forced-alignment is used to determine the correct word for each frame –4. A new training scheme considering both utterance and frame level recognition errors may be more suitable for accurate modeling