Download presentation
Presentation is loading. Please wait.
Published byHope Rose Modified over 9 years ago
1
Nick Wang, 25 Oct. 2000 Speaker identification and verification using EigenVoices O. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000 Presented by 王瑞璋 Nick Wang Philips Research East Asia-Taipei Speech processing laboratory, NTU 25 October 2000 Presented by 王瑞璋 Nick Wang Philips Research East Asia-Taipei Speech processing laboratory, NTU 25 October 2000
2
Nick Wang, 25 Oct. 2000 Speaker identification and verification Speaker identification –to identify the speaker, as one of the clients, via speech input Speaker verification –to verify the speaker, as the claimed one, via speech input Problem definition: the amount of available data is limited for each speaker –60 seconds ==> enough to train GMM –5 seconds ==> not enough to train GMM, but enough to estimate EigenVoices coefficients Aim: to incorporate EigenVoices into GMM speaker modeling
3
Nick Wang, 25 Oct. 2000 When GMM meets EigenVoices GMM –one mixture Gaussian p.d.f. per client –for example, 32 Gaussian multi-variant p.d.f. in a GMM Given acoustic feature vector of 26 components (13+13) Model size: 32 x 26 = 832 variables EigenVoices -- principle axes of GMM parameter supervectors –to reduce the dimensionality of GMM model by PCA, LDA, or MLES –to eliminate the effect of estimation error (noise) by removing the axes with lower variation (signal) ==> subspace selection with SNR > threshold (nick) –or fixed dimension of EigenVoices space: 20 to 70 EigenVoices (higher variation axes) –speaker location in EigenVoices space ==> reconstruct adapted GMM Model size: 20 to 70 variables
4
Nick Wang, 25 Oct. 2000 When GMM meets EigenVoices Benefit -- principle axes –robust & fast: keep higher variation axes ==> produce less estimation error; show improvement immediately –obvious & small speaker distribution representation (v.s. MAP or MLLR) –more applications: e.g. SPID, telephony, embedded system,... Corpora –Extra training data (to train SI model and/or EigenVoices) large-amounts of data from a large and diverse set of speakers –Client data (to train client models) small-amounts of data per speaker –Test set small-amounts of data per speaker (from clients or imposters)
5
Nick Wang, 25 Oct. 2000 When GMM meets EigenVoices Training procedure –train GMMs for each speaker in extra training data large-amounts data per speaker –train EigenVoices (principle axes of GMMs) using PCA, LDA or MLES on model parameters supervectors –apply environmental adaptation to all EigenVoices by client data using MLLR by all client data –apply MLED to estimate eigen-coefficients for each client small-amounts data per speaker –compose client models for each client by EigenVoices & their coefficients
6
Nick Wang, 25 Oct. 2000 Speaker identification/verification Measurement –eigenDistance decoding: eigenDist(test, client) test speaker’s distance from client speaker in eigenspace –eigenGMM decoding: eigenGMM client (test) test speaker’s likelihood of client speaker eigen-adapted GMMs Speaker identification –decision(test) = argmin client eigenDist(test, client) –or decision(test) = argmax eigenGMM client (test) Speaker Verification –decision(test,claim) = accept if eigenDist(test, claim) < thr, otherwise reject –or decision(test,claim) = accept if eigenGMM claim (test) > thr, otherwise reject
7
Nick Wang, 25 Oct. 2000 Experiments Setup –Corpora TIMIT: mismatched extra training data, 630 speakers x 10 sentences YOHO: extra training, client and test data, 82 speakers x 96 sentences Results for abundant (360 sec) enrollment data in SPID –82 clients of 360 seconds enrollment speech –5 seconds test speech –GMM: 98.8% correct identification –No eigenGMM model is better than GMM under the constraint of at most 71 EigenVoices. –Since: enough enrollment data, and constrained 71/832 axes. –The best is 98.0% with LDA EigenVoices, 71 (the most) axes, eigenGMM decoding.
8
Nick Wang, 25 Oct. 2000 Experiments Results for sparse (10 sec) enrollment data in SPID
9
Nick Wang, 25 Oct. 2000 Experiments Results for sparse (10 sec) enrollment data in speaker verification –SI impostor model for eigenGMM decoding –40 EigenVoices on 64-GMMs supervectors over 72 speakers –EigenVoices helps –LDA-EigenVoices –eigenDistance
10
Nick Wang, 25 Oct. 2000 Experiments Results for matched/mismatched extra training data in SPID –MLLR adaptation helps to solve environment mismatch. –TIMIT is not suitable for LDA-EigenVoices because of: 10 sentences per speaker more allophonic variability
11
Nick Wang, 25 Oct. 2000 Conclusions EigenVoices provides a confined subspace. –For abundant client data, it is worse than conventional GMM because of the loss of degrees of freedom. –For sparse client data, it performs better than conventional GMM. –In the case of eigenDistance speaker verification, there is no need for an impostor model to normalize for utterance likelihood dependencies eigenspace itself implicitly normalizes for utterance likelihood: two utterances with very different likelihood may map to the same point in the eigenspace. Environment mismatch will hurt the client models. –even applied MLLR adaptation LDA for EigenVoices generation will not work if –there are less utterances per speaker –or there are strong allophonic variability
12
Nick Wang, 25 Oct. 2000 Comments & my future work Since EigenVoices is a confinement, can we enlarge speaker models before applying it ? –GMM: no use of fine speech structure –LVCSR (segmentation => adaptation => SA score difference from SI one) : using of speech structure info hurt speaker recognition performance –Sequential Non-Parametric (SNP) or DTW distances: SNP+GMM work best in all To try EigenMLLRs speaker recognition –1/1000 memory requirement of EigenVocies Separate test data to several fragments, each one is very small –eigenspace decoding –joint decision
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.