Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI), Berkeley, CA, USA, 2004MLMI 報告 : 張志豪 日期 : 2005/02/24
2 Reference 1998 Hynek Hermansky, “TRAPS - Classifiers of Temporal Patterns”, ICSLP 1999 Hynek Hermansky, “Data-Derived Nonlinear Mapping For Feature Extraction in HMM”, ICASSP 2003 Barry Chen, “Learning Discriminative Temporal Patterns in Speech : Development of Novel Traps-Link Classifiers”, Eurospeech 2003 Hemant Misra, “New Entropy Based Combination Rules in HMM/ANN Multi-Stream ASR”, ICASSP
3 Outline Introduction –TRAPS –HATS MLP Architectures –One Stage Architectures –Two Stage Architectures Experiment –Frame accuracy –Word accuracy –Combine long-term and short-term
4 Introduction Hynek Hermansky’s group pioneered a method to capture long-term( ms) information for phonetic classification using multi-layered perceptrons(MLP). They developed an MLP architecture called TRAPS, which stands for “TempoRAl PatternS” TRAPS perform about as well as more conventional ASR systems using short-term features, and improve word error rates when used in combination with these short-term features. This paper worked on improving the TRAPS architecture in the context of TIMIT phoneme recognition. Hidden Activation TRAPS (HATS), which differ from TRAPS in that HATS use the hidden activations of the critical band MLPs instead of their outputs as inputs to the “merger” MLP.
5 Log-Critical Band Energies (LCBE) (1/2) Conventional Feature Extraction
6 Log-Critical Band Energies (LCBE) (2/2) TRAPS/HATS Feature Extraction
7 MLP Architectures One Stage Approach (unconstrained) –15 Bands x 51 Frames Two Stage Approach (constrained) –Linear Approach PCA40 LDA40 –Non-Linear Approach TRAPS HATS
8 One Stage Approach (1/3) The paper use LCBEs calculated every 10 ms on 8 kHz sampled speech which gives a total of 15 bark scale spaced LCBEs. There are mean and variance normalized per utterance. Use 51 frames of all 15 bands of LCBEs as inputs to an MLP. These inputs are built by stacking 25 frames before and after the current frame to the current frame, and the target phoneme comes from the current frame. The network is trained with output targets that are “1.0” for the class associated with the current frame, and “0” for all others. The MLPs are trained on 46 phoneme targets, and consist of a single hidden layer with sigmoidal nonlinearity and an output layer with softmax nonlinearity. Baseline system : “15 Bands x 51 Frames” unconstraint
9 One Stage Approach (2/3) Softmax nonlinearity –If you want the outputs of a network to be interpretable as posterior probabilities for a categorical target variable, it is highly desirable for those outputs to lie between zero and one and to sum to one. The purpose of the softmax activation function is to enforce these constraints on the outputs. Let the net input to each output unit be q_i, i=1,...,c, where c is the number of categories. Then the softmax output p_i is: Sigmoidal nonlinearity
10 One Stage Approach (3/3)
11 Two Stage Approach They developed an MLP architecture called TRAPS, which stands for “TempoRAl PatternS”. The TRAPS system consists of two stages fo MLPs. 1. Critical band MLPs learn phone probabilities posterior on the input, which is a set of consecutive frames of LCBEs, or LCBE trajectory. 2. A “merger” MLP merges the output of each of these individual critical band MLPs resulting in overall phone posteriors probabilites. Correlations among individual frames of LCBEs from difference frequency bands are not directly modeled; instead, correlation among long-term LCBE trajectories from different frequency bands are modeled.
12 Linear Approaches (1/2) Feature –The paper calculate PCA transforms for successive 51 frames of each of the 15 individual 51 frames of LCBE resulting in a 51 x 51 transform matrix for each of the 15 bands. –Then use this transform to orthogonalize the temporal trajectory in each band, retaining only the top 40 features per band. –Final use these transformed features as input to an MLP.
13 Linear Approaches (2/2)
14 Non-Linear Approaches (1/2)
15 Non-Linear Approaches (2/2)
16 Experimental Setup Training : ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular. Testing : 2001 Hub-5 Evaluation Set (Eval2001) –A large vocabulary conversational telephone speech test set –2,255,609 frames and 62,890 words Back-end recognizer: SRI’s Decipher System. 1 st pass decoding using a bigram language model and within-word triphone acoustic models.
17 Frame Accuracy (1/2) Classification is deemed correct when the highest output of the MLP corresponds to the correct phoneme label. A conventional intermediate temporal context MLP that uses 9 frames of per-side normalized (mean, variance, vocal tract length) PLP plus deltas and double deltas as inputs (PLP 9 Frames). Result –With the exception of the TRAPS system, all of the two-stage systems do better than this. –HATS Before Sigmoid and TRAPS Before Softmax perform comparably at 65.80% and 65.85% respectively, while PCA and LDA approaches perform similarly at 65.50% and 65.52% respectively.
18 Frame Accuracy (2/2)
19 Word Error Rates (1/2) System (46D) –The experiment take the log of the outputs from the MLPs and then decorrelate the features via PCA. –Apply mean and variance normalization in these transformed outputs. Result –The HATS always ranks 1 when compared to all other long temporal systems, achieving 7.29% relative improvement over the baseline. –The TRAPS doesn’t provide an improvement over the baseline, but all of the other approaches do. The final softmax nonlinearity in the critical band MLPs in TRAPS is the only difference between it and TRAPS Before Softmax. So including this nonlinearity during recognition, causes performance degradation. It is likely that the softmax’s output normalization is obscuring useful information that the second stage MLP needs.
20 Word Error Rates (2/2)
21 Combine Long-Term with Short-Term (1/3) SRI’s EARS Rich Transcription 2003 front-end features (short-term) (39 D) –Baseline HLDA(PLP+3d) Feature 1. 12th order PLP plus first three ordered deltas, 2. mean, variance, and vocal tract length normalized 3. transformed by heteroskedastic linear discriminant analysis (HLDA), keeping the top 39 features.
22 Combine Long-Term with Short-Term (2/3) Methods (64 D) –Appended the top 25 dimensions after PCA on each of the temporal features to the baseline HLDA(PLP+3d) features. –PLP 9 Frames –Combine HATS and PLP 9 Frames systems using an inverse entropy weighting method, take the log followed by PCA to 25 dimension and append to HLDA(PLP+3d) feature, can get the “Inv Entropy Combo HATS+PLP 9 Frames” Frames. Result –HATS improves 3.23% WER. –PLP 9 Frames is the same as HATS. –Combine PLP 9 Frames with HATS improves 8.60% WER.
23 Combine Long-Term with Short-Term (3/3)
24 Conclusions So TRAPS including this softmax during recognition, causes performance degradation. Inverse entropy weighting is good research direction. Combine long-term with short-term information has improvement.