Download presentation
Presentation is loading. Please wait.
Published byWendy Wood Modified over 9 years ago
1
Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition Thilo Stadelmann, Bernd Freisleben, Ralph Ewerth University of Marburg, Germany International Conference on Pattern Recognition, Istanbul, Turkey 24. August 2010
2
2 Content 1. Introduction 2. Related Work 3. Idea and Justification 4. Implementation 5. Results 6. Conclusions
3
3 Introduction Why speaker recognition? n Speaker recognition is useful for –User authentication (e.g., telephone services) –Video indexing by person (e.g., movies) –Preprocessing for automatic speech recognition (e.g., speaker adaptation) n Scenarios have in common: –Additional training and testing data is unavailable… E.g., movies: typical speaker turn duration of 1-2s –…or costly E.g., access control and speaker adaptation: user has to provide enrollment data, but just wants to proceed with his/her actual purpose
4
4 Introduction And why on short utterances? n But: typical speaker recognition systems need: –30-100s of speech on average for training (evaluation: 10s) –7-10s as a minimum for training in specialized forensic software n => Furui [„40 Years of…“, 2009]: „The most pressing issues […] for speaker recognition are rooted in […] insufficient data.“
5
5 Related Work How is this dealt with normally? n Use of additional data, assumptions or modalities: –Anchor models, phonetic structure [Merlin et al., 1999] –Speech content, word dependencies [Larcher et al., 2008] –Video in multimodal data streams [Larcher et al., 2008] –Subspace models, confidence intervals [Vogt et al., 2008/2009] n Combine this with the typical Gaussian Mixture Model (GMM) approach
6
6 Idea and Justification How is this dealt with here? n The typical approach to speaker recognition is to use a statistical voice model n If it is possible to find a similar model formulation using less parameters… n => fewer data necessary for reliable estimates n => side effect: improved runtime with compact model n The typical (almost omnipresent) statistical voice model is the GMM n => optimize the GMM for employing less parameters
7
7 Idea and Justification Idea Observations: n Some dimensions (e.g.: 0, 1, 4, 7) are multimodal/ skewed => need Gaussian mixture to be modeled accurately n Some dimensions (6, 11, 13, 18) look Gaussian itself => why spend parameters of 31 more mixtures on them? Per-dimension plot of 32-mixture GMM modeling 19 dim. MFCCs (Upper blue curve: joint density)
8
8 Idea and Justification Justification n Idea: model each dimension individually with the optimal number of mixtures for its marginal density n Promising: –LPCCs are similar to MFCCs in (non-)Gaussianity of individual dim. –LSPs are Gaussian/like in any dimension –Pitch is quite non/Gaussian –Combinations are common=> method is generally applicable n Permissible: –Standard GMM already treats dimensions as decorrelated via diagonal covariance –Chances are that additionally treating them as independent doesn’t miss important information for speaker recognition
9
9 Implementation How is it put into existence? n Wrapper around existing GMM implementation: –Build single GMM per dimension of feature set –Optimize number of mixtures in each dimension via Bayesian Information Criterion (BIC) –Apply orthogonal transform prior to training/test to further decorrelate data n => Dimension-Decoupled GMM (DD-GMM) is tupel (#mixtures, GMM) per dimension plus transformation matrix n => easily integratable with existing GMM implementations n => combinable with other short utterance schemes from related work
10
10 Results Speaker recognition performance n Until 45% removal: nearly no difference n >50% removal: DD-GMM 7.56% (avg.) better as best competitor with same amount of data n >50% removal: DD-GMM as good as best competitor with 4.17% (avg.) less data n Effect stronger with only less training data n Effect still visible with only less test data % train./test data removed (100%: ca. 20/5s) Speaker identification rate on TIMIT n Competitors in 630-speaker identification experiment : n BIC-GMM: GMM with #multimodal-mixtures optimized via BIC n 32-GMM: Multimodal GMM with always 32 mixtures n DD-GMM: New dimension-decoupled GMM
11
11 Results Evolution of parameter count n DD-GMM uses 90.98% (avg.) less parameters than BIC/GMM n Best in literature [Liu and He, 1999]: 75% without better identification rate
12
12 Results Run time n DD-GMM train time: 2.5 times longer than 32-GMM (best), but still 3.5 times faster than real time n DD-GMM test time: 2.1 times faster than BIC-GMM (best), that is 54.5% real time n Test phase is practically more relevant (occurs more frequently)
13
13 Conclusions What remains n DD-GMM gives more reliable speaker recognition results in case of lacking data n DD-GMM is computationally more efficient in case of plenty of data n DD-GMM performs speaker recognition where std. GMM approaches aren’t useable anymore –>80% identification rate with <5.5/1.3s training/test data n DD-GMM is easily integratable with other systems –Wrapper comprises effectively 80 lines of code around existing GMM –Approach is supplemental to other short utterance schemes n Future work: –Apply and test on other features and data sets beyond speaker recognition
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.