1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University
2 Who am I?
3 Speaker Recognition VerificationIdentification Text Dependent Text Independent Types of speaker recognition
4 Speaker Recognition Why is it hard? Minimal training data Background noise Transducer mismatch Channel distortions People’s voices change over time and under stress Performance
5 Feature Extraction Extract speech Spectral analysis Cepstrum: Cepstral means removal
6 Hidden Markov Models Statistical pattern recognition State dependent modeling –Distribution/state –Radial basis functions common State sequence unobservable
7 HMM Efficient decoders: Training –EM algorithm –Convergence to local maxima guaranteed
8 Recognition Model for each speaker Maximum a priori (MAP) decision rule Arg Max Features Models Scores
9 The MAP decision rule Optimal decision rule provided we have accurate distribution parameters & observations. Problem: –Corruption of feature vectors. –Distribution known to be inaccurate.
10 A case of mistaken identity
11 Integral decode Goal: Include uncorrupted observation ô t. Problem: ô t unobservable. Determine a local neighborhood t about o t and use a priori information to weight the likelihood:
12 Integral decode issues Problems approximating the integral –High frame rate * number of models –Non-trivial dimensionality Selection of the neighborhood
13 Approximating the integral Monte Carlo impractical Use simplified cubature technique:
14 Neighborhood choice Choosing an appropriate neighborhood: –Upper bound difference neighborhoods [Merhav and Lee 93] –Error source modeling
15 Upper bound difference neighborhoods Arbitrary signal pairs with a few general conditions. PSD Cepstra
16 Taking the upper bound Asymptotic difference between cepstral parameters:
17 Error source modeling Multiple error sources Simplifying assumption of one normal distribution with zero mean Use time series analysis to estimate the noise Trend
18 Error Source Modeling Estimate variance from detrended signal
19 Error source modeling Problem: – is infinite Solution: –Most of the points are outliers –Set percentage of distribution beyond which points are culled.
20 Complexity of integration Expensive Ways to reduce/cope –Implemented Top K processing Principle Components Analysis –Possible Gaussian Selection Sub-band Models SIMD or MIMD parallelism
21 Top K Processing 1 second3 seconds 5 seconds
22 Principal Component Analysis Choose P most important directions
23 Principal Component Analysis Integrate using new basis set for step function
24 Speech Corpus King-92 –Used San Diego subset 26 male speakers Long distance telephone speech Quiet room environment 5 sessions recorded one week apart –1-3 train –Sessions 4-5 partitioned into test segments
25 Baseline performance
26 Integral decode performance 1 second3 seconds5 seconds
27 Integral decode with other conditions Performance on –high quality speech –transducer mismatch
28 Future work Extensions to the integral decode –Automatic parameter selection –Gaussian selection –distributed computation Efficient multiple class preclassifiers
29
30 Optimal/utterance hyperparameters – 5 seconds KingNB26KingWB51 SpidreF18XDR SpidreM27XDR
31 95% Confidence Intervals Caveat: –Per speaker means –Large granularity
32 Pattern Recognition Long term statistics [Bricker et al 71, Markel et al 77] Vector Quantization [Soong et al 87] HMM [Rosenberg et al 90, Tishby 91, Matsui & Furui 92, Reynolds et al 95] Connectionist frameworks Feed forward [Oglesby & Mason 90] Learning vector quantization [He et al 99]
33 Pattern Recognition Contd. Hybrid/Modified HMMs Min Classification Error discriminant [Liu et al 95] Tree structured neural classifiers [Liou & Mammone 95] Trajectory modeling [Russell et al 85, Liu et al 95, Ostendorf et al 96, He et al 99] Sub-band recognition [Besacier & Bonastre 97]