Download presentation
Presentation is loading. Please wait.
Published byNoah Murphy Modified over 9 years ago
1
An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training Mr. Yik-Cheung Tam Dr. Brian Mak
2
Motivation Overview of MCE training Problem using N-best hypotheses Alternative:1-nearest hypothesis What? Why? How? Evaluation Conclusion Outline
3
MCE Overview The MCE loss function: Distance measure: G(X) may be computed using the N-best hypotheses. l(.) = 0-1 soft error-counting function (Sigmoid) Gradient descent method to obtain a better estimate.
4
When d(X) gets large enough, It falls out of the steep trainable region of Sigmoid. Trainable region Problem Using N-best Hypotheses
5
What is 1-nearest Hypothesis? d(1-nearest) <= d(1-best) The idea can be generalized to N-nearest hypotheses.
6
Keep the training data inside the steep trainable region. Trainable region Using 1-nearest Hypothesis
7
Method 1 (exact approach) Stack-based N-best decoder Drawback: N may be very large => memory problem Need to limit the size of N. Method 2 (approximated approach) Modify the Viterbi algorithm with a special pruning scheme. How to Find 1-nearest Hypothesis?
8
Approximated 1-nearest Hypothesis Notation: V(t+1, j) : accumulated score at time t+1 and state j : transition probability from state i to j : observation probability at time t+1 and state j : accumulated score of the Viterbi path of the correct string at time t+1. Beam(t+1) : beam width applied at time t+1
9
There exists some “nearest” path in the search space (shaded area). Approximated 1-nearest Hypothesis (.)
10
System Evaluation
11
Corpus: Aurora Aurora Noisy connected digits derived from TIDIGIT. Multi-condition training: (Train on noisy condition) {subway, babble, car, exhibition} x {clean, 20, 15, 10, 5} (5 noise levels) 8440 training utterances. Testing: (Test on matched noisy condition) Same as above except with additional samples with 0 and –5 dB (7 noise levels) 28,028 testing utterances.
12
System Configuration Standard 39-dimension MFCC (cep + + ) 11 Whole-word digit HMM (0-9, oh) 16 states, 3 Gaussians per state 3-state silence HMM, 6 Gaussians per state 1-state short pause HMM tied to the 2 nd state of the silence model. Baum-Welch training to obtain the initial HMM. Corrective MCE training on HMM parameters.
13
Compare 3 kinds of competing hypotheses: 1-best hypothesis Exact 1-nearest hypothesis Approx. 1-nearest hypothesis Sigmoid parameters: Various (control slope of Sigmoid) Offset = 0 System Configuration (.)
14
Learning rate = 0.05, with different 0.1 (best test performance) 0.5 (steeper) 0.02, 0.004 (more flat) Experiment I: Effect of Sigmoid slope Baseline: 12.71% 1-best: 11.01% Approx. 1-nearest: 10.71% Exact 1-nearest: 10.45%
15
Soft error < 0.95 is defined to be “effective”. 1-nearest approach has more training data when the Sigmoid slope is relatively steep. Effective Amount of Training Data 1-best (40%) Approx. 1-nearest (51%) Exact. 1-nearest (67%)
16
With 100% effective training data, apply more training iterations: = 0.004, learning rate = 0.05 Result: Slow improvement compared to the best case. Experiment II: Compensation With More Training Iterations Exact 1-nearest with gamma = 0.1
17
Use a larger learning rate (0.05 -> 1.25) Fix = 0.004 (100% effective training data) Result: 1-nearest approach is better than one-best approach after compensation. Experiment II: Compensation Using a Larger Learning Rate SystemBefore compensation After compensation Baseline12.71% 1-best12.07%11.55% Approx 1-nearest12.27%10.70% Exact 1-nearest12.16%10.79%
18
Using a Larger Learning Rate (.) Training performance: MCE loss versus # of training iterations. 1-best Approx. 1-nearest Exact. 1-nearest
19
Using a Larger Learning Rate (..) Test performance: WER versus # of training iterations. Approx. 1-nearest (10.70%) Exact. 1-nearest ( Exact. 1-nearest ( 10.79%) 1-best ( 1-best ( 11.55%)
20
Conclusion 1-best and 1-nearest methods were compared in MCE training. Effect of Sigmoid slope. Compensation on using a flat sigmoid. 1-nearest method is better than 1-best approach. More trainable data are available in the 1-nearest approach. Approx. and exact 1-nearest methods yield comparable performance.
21
Questions and Answers
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.