Speaker Adaptation for Vowel Classification Xiao Li Electrical Engineering Dept.
Outline Introduction Background on statistical classifiers Proposed Adaptation strategies Experiments and results Conclusion
Application “Vocal Joystick” (VJ) Vowel classification Human-computer interaction for people with motor-impairments Acoustic parameters – energy, pitch, vowel quality, discrete sound Vowel classification Vowels /ae/ (bat); /aa/ (bought); /uh/ (boot); /iy/ (beat) Control motion direction /ae/ /aa/ /uh/ /iy/
Features Formants Mel-frequency cesptral coefficients (MFCC) Peaks in spectrum Low dimension (F1, F2, F3, F4 + dynamics) Hard to estimate Mel-frequency cesptral coefficients (MFCC) Cosine transform of log spectrum High dimension (26 including deltas) Easy to compute Our choice – MFCCs
User-Independent vs. User–Dependent User-independent models NOT optimized for a specific speaker Easy to get a large train set User-dependent models Optimized for a specific speaker Difficult to get a large train set
Adaptation What is adaptation? Adapting user-independent models to a specific user, using a small set of user-dependent data Adaptation methodology for vowel classification Train speaker-independent vowel models Ask a speaker to articulate a few seconds of vowels for each class Adapt the classifier on this small amount of speaker-dependent data
Outline Introduction Background on statistical classifiers Proposed Adaptation strategies Experiments and results Conclusion
Gaussian mixture models (GMM) Generative models Training objective – maximum likelihood (EM) For training samples O1:T Classification Compute the likelihood scores for each class, and choose the one with the highest likelihood Limitation A class model is trained using only the data in this class Constraints on the discriminant functions
Neural Networks (NN) Three layer perceptrons Training objective # input nodes – feature dimension x window size # hidden nodes – empirically chosen # output nodes – # of classes Training objective Minimum relative entropy Classification Compare the output values Advantages Discriminative training Nonlinearity Features taken from multiple frames Target yk
NN-SVM Hybrid Classifier Idea – replace the hidden-to-output layer of the NN by linear-kernel SVMs Training objective Maximum margin theoretically guaranteed on test error bound Classification Compare the output values of binary classifiers Advantages Compared to pure NN: optimal solution in the last layer Compared to pure SVM: efficiently handling features from multiple frames; no need to choose kernel
Outline Introduction Background on statistical classifiers Proposed Adaptation strategies Experiments and results Conclusion
MLLR for GMM Adaptation Maximum Likelihood Linear Regression Apply a linear transformation on the Gaussian mean Same transformation for the mixture of Gaussians in the same class The covariance matrix can be adapted in a similar fashion, but less effective
MLLR Formulas Objective – maximum likelihood For adaptation samples O1:T First-order derivative vanishes The transform W is obtained by solving a linear equation
NN Adaptation Idea – fix the nonlinear mapping and adapt the last layer (linear classifier) Adaptation objective – minimum relative entropy Start from the original weights Gradient descent formulas
NN-SVM Classifier Adaptation Idea – *again* fix the nonlinear mapping and adapt the last layer Adaptation objective – maximum margin Adaptation procedure Keep the support vectors of the training data Combine these support vectors with the adaptation data Retrain the linear-kernel SVMs for the last layer
Outline Introduction Background on statistical classifiers Proposed Adaptation strategies Experiments and results Conclusion
Database Pure vowel recordings with different energy and pitch Duration – long short Energy – loud, normal, quiet Pitch – rising, level, falling Statistics Train set -- 10 speakers Test set – 5 speakers 4 or 8 or 9 vowel classes 18 utterances (2000 samples) for each vowel and each speaker
Adaptation and Evaluation Set 6-fold cross-validation for each speaker 18 utterances are divided into 6 subsets We adapt on each subset and evaluate on the rest We get 6 accuracy scores for each vowel, and compute the mean and deviation Average over 5 speakers
Speaker-Independent Classifiers % Accuracy 4 –class 8-class 9-class GMM mixture # = 16 85.13±0.67 55.88±0.64 51.21±0.54 NN window = 7 hidden = 50 89.19±0.65 60.05±0.72 53.75±0.61 NN-SVM 89.89±0.55 -- The individual scores for different speakers vary a lot If NN window = 1, the performance is similar to GMM
Adapted Classifiers % Accuracy 4 –class 8-class 9-class MLLR for GMM 85.13±0.67 90.73±0.82 55.88±0.64 67.52±1.27 51.21±0.54 62.94±1.37 Gradient Descent for NN 89.19±0.65 91.85±1.30 60.05±0.72 74.33±1.41 53.75±0.61 71.06±1.62 Maximum Margin for NN-SVM 89.89±0.55 94.70±0.30 --
Conclusion For speaker-independent models, the NN classifier (with multiple frame input) works well For speaker-adapted models, the NN classifier is effective, and NN-SVM so far gets the best performance