Download presentation
Presentation is loading. Please wait.
Published byKevin Hardy Modified over 9 years ago
1
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USA hongbing.hu@binghamton.edu, zahorian@binghamton.edu Introduction Accurate Automatic Speech Recognition (ASR) Highly discriminative features »Incorporate nonlinear frequency scales and time dependency »Low dimensionality feature spaces Efficient recognition models (HMMs & Neural Networks) Neural Network Based Dimensionality Reduction Neural Networks (NNs) used to represent complex data while preserving the variability and discriminability of the original data Combine with a HMM recognizer to form a hybrid NN/HMM recognition model NLDA Reduction Overview Nonlinear Discriminant Analysis (NLDA) A multilayer neural network performs a nonlinear feature transformation of the input speech features Phone models for transformed features using HMMs with each state modeled with a GMM (Gaussian Mixture Model) PCA performs a Karhunen-Loeve (KL) transform for reducing the correlation of the network outputs NLDA1 Method Dimensionality Reduced Features Obtained at the output layer of the neural network Feature dimensionality is further reduced by PCA Node Nonlinearity (Activation Function) In the feature transformation, a linear function for the output layer and a Sigmoid nonlinear function for the other layers In the NLDA1 training, all layers are nonlinear Experimental Setup TIMIT Database (‘SI’ and ‘SX” only) 48 phoneme set mapped down from 62 phoneme set Training Data: 3696 sentences (460 speakers) Testing Data: 1344 sentences (168 speakers) DCTC/DCSC Features A total of 78 features (13 DCTCs x 6 DCSCs) were computed 10 ms frames with 2 ms frame spacing and 8 ms block spacing with 1s block length Conclusions Very high recognition accuracies were obtained using the outputs of network middle layer as in NLDA2 The NLDA methods are able to produce a low- dimensional effective representation of speech features HMM 3-state Left-to-right Markov models with no skip 48 monophone HMMs were created using HTK (ver 3.4) Language model: Phone bigram information of the training data Neural Networks in NLDA 3 hidden-layers: 500-36-500 nodes Input layer: 78 nodes corresponding to the feature dimensionality Output Layer: 48 nodes for the phoneme targets, or 144 nodes for the state level targets Training Neural Networks Phone Level Targets Each NN output correspond to a specific phone Straightforward to implement, using phonetically labeled training database But why should NN output be forced to the same value for the entire phone State Level Targets Each NN output correspond to a single state of a phone HMM But how to determine state boundaries o Estimate using percentage of total length o Use initial training iteration, then Viterbi alignment Original Features PCA Network Outputs Dimensionality Reduced Features NLDA2 Method Reduced Features Use outputs of the network middle hidden layer The reduced dimensionality is determined by the number of middle nodes, giving flexibility in reduced feature dimensionality The linear PCA is used only for feature decorrelation Nonlinearity All nonlinear layers are used in both the feature transformation and network training PCA Dimensionality Reduced Outputs Dimensionality Reduced Features Experimental Results Control Experiment Compare the original DCTC-DSCS with the PCA and LDA reduced features Use various numbers of mixtures in HMMs Accuracies using the original, PCA and LDA reduced features (20 & 36 dimensions) The original 78-dimensional features yield the highest accuracy of 73.2% using 64-mix HMMs NLDA Experiment Evaluate NLDA1 and NLDA2 with or without PCA 48-dimensional phoneme level targets used The features reduced to 36 dimensions Accuracies using the NLDA1 and NLDA2 reduced features, and the reduced features without the PCA processing The middle layer outputs of a network results in more effective features in a reduced space The accuracies improved about 2% with PCA State Level Target Exp 2 Use a fixed length ration and the Viterbi alignment for the state targets State level targets with “Don’t cares” used Targets obtained using a fixed length ratio (3 states: 1:4:1) and the Viterbi alignment Network training: 4x10 7 weight updates Accuracies; (R)” indicates a fixed length ratio and “(A)” the Viterbi forced alignment Literature Comparison Recognition accuracy based on TIMIT FeatureRecognizerAcc. (%)Study MFCCHMM68.5Somervuo (2003) PLPMLP-GMM71.5Ketabdar et al. (2008) LPCHMM-MLP74.6Pinto et al. (2008) MFCCTandem NN78.5Schwarz et al. (2006) DCTC/DCSCHMM73.9Zahorian et al. (2009) DCTC/DCSCNN-HMM74.9This study State Level Target Exp 1 Compare the state level targets with and without “Don’t cares” 144-dimensional state level targets used State boundaries obtained using the fixed state length method (3 states: 1:4:1) Network training: 8x10 6 weight updates Accuracies using the state level targets with and without “Don’t cares” The state level targets with “Don’t cares” result in higher accuracies The NLDA2 reduced features achieved a substantial improvement versus the original features Dimensionality Reduction Reduction & Decorrelation Dimensionality Reduction Decorrelation For 3-state models, train using “Don’t Cares” o For 1st portion, target is “1” for state 1 and “Don’t Cares” for states 2 and 3 o For 2nd portion, target is “1” for state 2 and “Don’t cares” for states 1 and 3 o For 3rd portion, target is “1” for state 3 and “Don’t Cares” for states 1 and 2
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.