Download presentation
Presentation is loading. Please wait.
1
Speaker Adaptation for Vowel Classification
Xiao Li Electrical Engineering Dept.
2
Outline Introduction Background on statistical classifiers
Proposed Adaptation strategies Experiments and results Conclusion
3
Application “Vocal Joystick” (VJ) Vowel classification
Human-computer interaction for people with motor-impairments Acoustic parameters – energy, pitch, vowel quality, discrete sound Vowel classification Vowels /ae/ (bat); /aa/ (bought); /uh/ (boot); /iy/ (beat) Control motion direction /ae/ /aa/ /uh/ /iy/
4
Features Formants Mel-frequency cesptral coefficients (MFCC)
Peaks in spectrum Low dimension (F1, F2, F3, F4 + dynamics) Hard to estimate Mel-frequency cesptral coefficients (MFCC) Cosine transform of log spectrum High dimension (26 including deltas) Easy to compute Our choice – MFCCs
5
User-Independent vs. User–Dependent
User-independent models NOT optimized for a specific speaker Easy to get a large train set User-dependent models Optimized for a specific speaker Difficult to get a large train set
6
Adaptation What is adaptation?
Adapting user-independent models to a specific user, using a small set of user-dependent data Adaptation methodology for vowel classification Train speaker-independent vowel models Ask a speaker to articulate a few seconds of vowels for each class Adapt the classifier on this small amount of speaker-dependent data
7
Outline Introduction Background on statistical classifiers
Proposed Adaptation strategies Experiments and results Conclusion
8
Gaussian mixture models (GMM)
Generative models Training objective – maximum likelihood (EM) For training samples O1:T Classification Compute the likelihood scores for each class, and choose the one with the highest likelihood Limitation A class model is trained using only the data in this class Constraints on the discriminant functions
9
Neural Networks (NN) Three layer perceptrons Training objective
# input nodes – feature dimension x window size # hidden nodes – empirically chosen # output nodes – # of classes Training objective Minimum relative entropy Classification Compare the output values Advantages Discriminative training Nonlinearity Features taken from multiple frames Target yk
10
NN-SVM Hybrid Classifier
Idea – replace the hidden-to-output layer of the NN by linear-kernel SVMs Training objective Maximum margin theoretically guaranteed on test error bound Classification Compare the output values of binary classifiers Advantages Compared to pure NN: optimal solution in the last layer Compared to pure SVM: efficiently handling features from multiple frames; no need to choose kernel
11
Outline Introduction Background on statistical classifiers
Proposed Adaptation strategies Experiments and results Conclusion
12
MLLR for GMM Adaptation
Maximum Likelihood Linear Regression Apply a linear transformation on the Gaussian mean Same transformation for the mixture of Gaussians in the same class The covariance matrix can be adapted in a similar fashion, but less effective
13
MLLR Formulas Objective – maximum likelihood
For adaptation samples O1:T First-order derivative vanishes The transform W is obtained by solving a linear equation
14
NN Adaptation Idea – fix the nonlinear mapping and adapt the last layer (linear classifier) Adaptation objective – minimum relative entropy Start from the original weights Gradient descent formulas
15
NN-SVM Classifier Adaptation
Idea – *again* fix the nonlinear mapping and adapt the last layer Adaptation objective – maximum margin Adaptation procedure Keep the support vectors of the training data Combine these support vectors with the adaptation data Retrain the linear-kernel SVMs for the last layer
16
Outline Introduction Background on statistical classifiers
Proposed Adaptation strategies Experiments and results Conclusion
17
Database Pure vowel recordings with different energy and pitch
Duration – long short Energy – loud, normal, quiet Pitch – rising, level, falling Statistics Train set speakers Test set – 5 speakers 4 or 8 or 9 vowel classes 18 utterances (2000 samples) for each vowel and each speaker
18
Adaptation and Evaluation Set
6-fold cross-validation for each speaker 18 utterances are divided into 6 subsets We adapt on each subset and evaluate on the rest We get 6 accuracy scores for each vowel, and compute the mean and deviation Average over 5 speakers
19
Speaker-Independent Classifiers
% Accuracy 4 –class 8-class 9-class GMM mixture # = 16 85.13±0.67 55.88±0.64 51.21±0.54 NN window = 7 hidden = 50 89.19±0.65 60.05±0.72 53.75±0.61 NN-SVM 89.89±0.55 -- The individual scores for different speakers vary a lot If NN window = 1, the performance is similar to GMM
20
Adapted Classifiers % Accuracy 4 –class 8-class 9-class MLLR for GMM
85.13±0.67 90.73±0.82 55.88±0.64 67.52±1.27 51.21±0.54 62.94±1.37 Gradient Descent for NN 89.19±0.65 91.85±1.30 60.05±0.72 74.33±1.41 53.75±0.61 71.06±1.62 Maximum Margin for NN-SVM 89.89±0.55 94.70±0.30 --
21
Conclusion For speaker-independent models, the NN classifier (with multiple frame input) works well For speaker-adapted models, the NN classifier is effective, and NN-SVM so far gets the best performance
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.