Speaker Identification: Speaker Modeling Techniques and Feature Sets Alex Park SLS Affiliates Meeting December 16, 2002 Advisor: T.J. Hazen
Overview Modeling Techniques Feature set experiments Baseline GMM (ASR Independent) Speaker Adaptive (ASR Dependent) Score Combination Multiple Utterance Results in Mercury Feature set experiments Comparison of Formants and F0 vs. MFCCs in TIMIT Current and Future work
Modeling - Baseline (GMM) Scoring Training Utterance Training Input waveforms for speaker “i” split into fixed-length frames Feature Space Feature vectors computed from each frame of speech GMMs trained from set of feature vectors One GMM per speaker GMM for Speaker “i” Testing Input feature vectors scored against each speaker GMM x1 -feature vectors are composed of mfcc measurements from points in the frame -data for all speech is lumped into a single probability model for that speaker -when we get to testing, we find out the score for a particular speaker by scoring feature vectors from the test segment against GMM for that speaker + x2 Test Utterance Frame scores for each speaker summed over entire utterance Highest total score is hypothesized speaker = score for speaker “i”
Modeling – Speaker Adaptive Scoring Training Build speaker-dependent (SD) recognition models for each speaker Get best hypothesis from recognizer using speaker-independent (SI) models Testing Rescore hypothesis with SD models Compute total speaker adapted score by interpolating SD score with SI score -interpolation factor is proportional to how much data there is for a particular phone model for speaker i -speaker adapted score is used as opposed to speaker dependent score so that we have more robust scores for phones that don't have much speaker dependent training data “fifty-five” SUMMIT (SI models) Test utterance f ih t tcl iy fi ihi ti tcli iyi fi ayi vi f ay v SD models for speaker “i” + ( ) li Speaker adapted score for speaker “i” x1 x2 x3 x4 x5 x6 x7 x8 x9 SI score y1 y2 y3 y4 y5 y6 y7 y8 y9 SD score
Modeling - Score Combination and Specifics Speaker ID done in two passes Initial n-best list computed using GMM speaker models N-best list rescored using second stage models N-best pruning reduces computation for refined models Refined models can use recognition results Test utterance 1st Stage 2nd Stage ASR GMM SID Speaker N-best list Word hypothesis 1. speaker “i” 2. speaker “j” : “fifty” Refined speaker ID Multiple speaker ID techniques can be combined in 2nd stage e.g., Multigrained + Speaker Adapted Classifier3 Classifier2 Classifier1 Rescored N-best list 1. speaker “k” 2. speaker “j” : -overview of what is actually done in the system -GMM speaker ID and speech recognizer both run in parallel, fairly quickly -2 reasons 1. Later models need the recognition output of summit 2. Nbest list pruning reduces the space of speakers that the more refined models need to search -along with this comes the added advantage that it is easy to combine the scores of multiple 2nd stage scoring techiques
Modeling – Single Utterance Results Compared modeling techniques on single utterances in YOHO and Mercury For YOHO, a standard speaker verification corpus, identification error rates were very low GMM (0.83%), Speaker Adaptive (0.31%) For Mercury, an in-house corpus, identification error rates were much higher GMM (22.4%), Speaker Adaptive (27.8%) Higher error rates likely due to effects of varied recording conditions and spontaneous speech Found that combination of classifiers lowered error rates in both domains YOHO: GMM + Phonetically Structured GMM (0.25%) Mercury: GMM + Phonetically Structured GMM (18.3%)
Modeling - Multiple Utterances Results 11.6 % 5.5 % 14.3 % 10.3 % 13.1 % 7.4 % -In yoho, best reported results on the identification task are around 0.7% error rate -most techniques beat that error rate, but we have to examine significance of these numbers -large difference in error rates between YOHO and mercury attributed to the relative difficulty of the two tasks. (refer to earlier) -speaker adapted method not performing as well in mercury can be due to word recognition errors, which it is more prone to than the other methods -score combination is good On multiple utterances, speaker adaptive scoring achieves lower error rates than next best individual method Relative error rate reductions of 28%, 39%, and 53% on 3, 5, and 10 utterances compared to baseline
Feature Set – Outline and Experiments Compared performance of non-MFCC features in mismatched conditions Used formants and F0, computed offline using ESPS Global speaker GMMs trained using formants and F0 values and trajectories in voiced regions Evaluation performed using closed set speaker ID task on TIMIT and NTIMIT Mismatched conditions used to evaluate robustness of feature extraction in telephone conditions
Feature Set – Results in Mismatched Conditions Compared performance of non-MFCC features in mismatched conditions Trained Test Baseline (MFCC) F1, F2, F3, F4 F1, F2 F1 F2 F3 F4 F0 TIMIT TIMIT NTIMIT MFCCs not well estimated in mismatched conditions 53% accuracy when trained and tested on NTIMIT F3 and F4 perform better in matched conditions, but have greater degradation in accuracy than F1 and F2 F3 and F4 performance degradations likely due to band-limiting in NTIMIT F0 has best individual performance with the least degradation 100.0 64.6 24.7 10.7 9.2 18.8 26.8 45.5 1.2 4.8 9.9 9.0 6.0 3.0 39.1 Identification Accuracy (%)
Extensions and Future Work Explored additional scoring strategies for verification Currently incorporating speaker recognition into existing applications Using speaker verification with Orion (Hazen) Combining speaker ID with face ID on Ipaq for Oxygen (Hazen and Weinstein) Use phone-specific models for formant and F0 features Incorporate duration as an additional feature for speaker adaptive approach