Download presentation
Presentation is loading. Please wait.
Published byKerrie Horn Modified over 9 years ago
1
Institute of Information Science, Academia Sinica, Taiwan Speaker Verification via Kernel Methods Speaker : Yi-Hsiang Chao Advisor : Hsin-Min Wang
2
2 OUTLINE Current Methods for Speaker Verification Proposed Methods for Speaker Verification Kernel Methods for Speaker Verification Experiments Conclusions
3
3 What is speaker verification ? Goal: To determine if a speaker is who he or she claims to be. Speaker verification is a hypothesis testing problem. Given an input utterance U, two hypotheses have to be considered as H 0 : U is from the target speaker. (the null hypothesis) H 1 : U is not from the target speaker. (the alternative hypothesis) The Likelihood Ratio (LR) test: Mathematically, H 0 and H 1 can be represented by parametric models denoted as and, respectively. is often called an anti-model. (1)
4
4 Current Methods for Speaker Verification is usually ill-defined, since H 1 does not involve any specific speaker, and thus lacks explicit data for modeling. Many approaches have been proposed in attempts to characterize H 1 : One simple approach is to train a single speaker-independent model , named the world model or the Universal Background Model (UBM) [D. A. Reynolds, et al., 2000]: The training data are collected from a great amount of speakers, generally irrelevant to the clients.
5
5 Current Methods for Speaker Verification Picking the likelihood of the most competitive model: [A. Higgins, et al., 1991] Averaging the likelihoods of the B cohort models arithmetically: [D. A. Reynolds, 1995]: Averaging the likelihoods of the B cohort models geometrically : [C. S. Liu, et al., 1996]: Instead of using a single model, an alternative way is to train a set of cohort models { 1, 2,…, B }. This gives the following possibilities in computing LR:
6
6 Selection of cohort set Two cohort selection methods [D. A. Reynolds, 1995] are used: One selects the B closest speakers to each client. (such as L 2, L 3, L 4 ) The other selects the B/2 closest speakers to, plus the B/2 farthest speakers from, each client. (such as L 3 ) The selection is based on the speaker distance measure [D. A. Reynolds, 1995], computed by where and are speaker models trained using the i-th speaker’s training utterances and the j-th speaker’s training utterances, respectively. Current Methods for Speaker Verification
7
7 The Null Hypothesis Characterization The client model is represented by a Gaussian Mixture Model (GMM): can be trained via the ML criterion by using the Expectation- Maximization (EM) algorithm. can be derived from the UBM using MAP adaptation. (the adapted GMM). The adapted GMM + L 1 measure => we term the GMM-UBM system. [D. A. Reynolds, et al., 2000] Currently, GMM-UBM is the state-of-the-art approach. This method is appropriate for the Text-Independent (TI) task. –Advantage: cover unseen data.
8
8 Motivation: However, none of the LR measures developed so far has proved to be absolutely superior to the others in any tasks and applications. We propose two perspectives in attempts to better characterize the ill-defined alternative hypothesis. Perspective 1: Optimal combination of the existing LRs. Perspective 2: On the design of the novel alternative hypothesis characterization. Proposed Methods for Speaker Verification
9
9 The pros and cons of different LR measures motivate us to try to combine them into a unified framework by virtue of the complementary information that each LR can contribute. Given N different LR measures L i (U), i = 1, 2, …, N. We define a combined LR measure by Perspective 1: The Proposed Combined LR (ICPR2006) where x = [L 1 (U), L 2 (U),…, L N (U)] T is an N × 1 vector in the space R N, w = [w 1, w 2,…, w N ] T is an N × 1 weight vector, and b is a bias. (2)
10
10 forms a so-called linear discriminant classifier. This classifier translates the goal of solving an LR measure into the optimization of w and b, such that the utterances of clients and impostors can be separated. To realize this classifier, three distinct data sets are needed: One for generating each client’s model. One for generating each client’s anti-models. One for optimizing w and b. Linear Discriminant Classifier
11
11 The bias b actually plays the same role as the decision threshold of the LR defined in Eq. (1). it can be determined through a trade-off between false acceptance and false rejection, The main goal here is to find w. f(x) can be solved via linear discriminant training algorithms, such as: Fisher’s Linear Discriminant (FLD). Linear Support Vector Machine (Linear SVM). Perceptron. Linear Discriminant Classifier
12
12 Using Fisher’s Linear Discriminant (FLD) Suppose the i-th class has n i data samples,, i = 1, 2. The goal of FLD is to seek a direction w such that the following Fisher ’ s criterion function J(w) is maximized: Linear Discriminant Classifier where S b and S w are, respectively, the between-class scatter matrix and the within-class scatter matrix defined as where is the mean vector of the i-th class.
13
13 Using Fisher’s Linear Discriminant (FLD) The solution for w, which maximizes the Fisher’s criterion J(w), is the leading eigenvector of. w can be directly calculated as Linear Discriminant Classifier (3)
14
14 The LR approaches that have been proposed to characterize H 1 can be collectively expressed in the following general form : where F( ) is some function of the likelihood values from a set of so- called background models { 1, 2, …, N }. For example, F( ) can be the average function for L 3 (U), the maximum for L 2 (U) or the geometric mean for L 4 (U), and the background model set here can be obtained from a cohort. A special case arises when F( ) is an identity function and N = 1. In this instance, a single background model is used for L 1 (U). Analysis of the Alternative Hypothesis (4)
15
15 We redesign the function F( ) as This function gives N background models different weights according to their individual contribution to the alternative hypothesis. It is clear that Eq. (5) is equivalent to a geometric mean function when It is also clear that Eq. (5) will reduce to a maximum function when Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006) (5) where is an N × 1 vector and is the weight of the likelihood p(U | i ), i = 1,2, …, N.
16
16 Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006) By substituting Eq. (5) into Eq. (4) and letting (6) where is an N × 1 weight vector and x is an N × 1 vector in the space R N, expressed by (7)
17
17 The implicit idea in Eq. (7) is that the speech utterance U can be represented by a characteristic vector x. If we replace the threshold in Eq. (6) with a bias b, the equation can be rewritten as Analogous to the combined LR method in Eq. (2). f(x) in Eq. (8) forms a linear discriminant classifier again, which can be solved via linear discriminant training algorithms, such as FLD. Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006) (8)
18
18 Relation to Perspective 1: The combined LR measure If the anti-models are instead of the background models for the characteristic vector x defined in Eq. (7): Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006) We obtain f(x) forms a linear combination of N different LR measures, which is the same as the combined LR measure.
19
19 can be solved via linear discriminant training algorithms. However, such methods are based on the assumption that the observed data of different classes is linearly separable. It is obviously not feasible in most practical cases with nonlinearly separable data. From this point of view, we hope The data from different classes, which is not linearly separable in the original input space R N. They can be separated linearly in a certain implicit higher dimensional (maybe infinite) feature space F via a nonlinear mapping Φ. Let Φ(x) denote a vector obtained by mapping x from R N to F. f(x) can be re-defined as Kernel Methods for Speaker Verification which constitutes a linear discriminant classifier in F. (9)
20
20 In practice, it is difficult to determine the kind of mapping Φ that would be applicable. Therefore, the computation of Φ(x) can be infeasible. We propose using the kernel method: It is to characterize the relationship between the data samples in F, instead of computing Φ(x) directly. This is achieved by introducing a kernel function: which is the inner product of two vectors Φ(x) and Φ(y) in F. Kernel Methods for Speaker Verification (10)
21
21 The kernel function k( ) must be symmetric, positive definite and conform to Mercer’s condition. For example: The dot product kernel : The d-th degree polynomial kernel : The Radial Basis Function (RBF) kernel : Existing kernel-based classification techniques can be applied to implement. such as : Support Vector Machine (SVM). Kernel Fisher Discriminant (KFD). Kernel Methods for Speaker Verification σ is a tunable parameter.
22
22 Support Vector Machine (SVM) Techniques based on SVM have been successfully applied to many classification and regression tasks. Conventional LR: If the probabilities are perfectly estimated (which is usually not the case), then the Bayes Decision rule is the optimal decision. However, a better solution should in theory be to use a discriminant framework [V. N. Vapnik, 1995]. [S. Bengio, et al., 2001] proposed that the probability estimates are not perfect and that a better version would be, where a 1,a 2 and b are adjustable parameters estimated using an SVM. Kernel Methods for Speaker Verification
23
23 Support Vector Machine (SVM) [S. Bengio, et al., 2001] incorporated the two scores obtained from GMM and UBM with an SVM. Compare with our approach: [S. Bengio, et al., 2001] only used one simple background model, the UBM, as the alternative hypothesis characterization. Our approach is considered to integrate multiple background models for the alternative hypothesis characterization in a more effective and robust way: Kernel Methods for Speaker Verification
24
24 Support Vector Machine (SVM) The goal of SVM is to seek a separating hyperplane in the feature space F that maximizes the margin between classes. Kernel Methods for Speaker Verification xx yy Optimal hyperplane Optimal margin Support vectors (a) (b) Classifier in (b) has greater separation distance than (a) r
25
25 Support Vector Machine (SVM) Following the theory of SVM, w can be expressed as which yields where each training sample x j belongs to one of the two classes identified by the label y j { 1,1}, j=1, 2, …, l. Kernel Methods for Speaker Verification
26
26 Support Vector Machine (SVM) Let T = [ 1, 2, …, l ]. Our goal now changes from finding w to finding . We can find the coefficients j by maximizing the objective function, subject to the constraints where C is a penalty parameter. The above optimization problem can be solved using the quadratic programming techniques. Kernel Methods for Speaker Verification
27
27 Support Vector Machine (SVM) Note that most j are equal to zero, and the training samples with non-zero j are called support vectors. A few support vectors act as the key to deciding the optimal margin between classes in the SVM. An SVM with a dot product kernel function, i.e., is known as a linear SVM. Kernel Methods for Speaker Verification x y Optimal hyperplane Optimal margin Support vectors
28
28 Kernel Methods for Speaker Verification Kernel Fisher Discriminant (KFD) Alternatively, can be solved with KFD. In fact, the purpose of KFD is to apply FLD in the feature space F. we also need to maximize the Fisher ’ s criterion: where and are, respectively, the between-class and the within-class scatter matrices in F, i.e., where is the mean vector of the i-th class in F.
29
29 Kernel Methods for Speaker Verification Kernel Fisher Discriminant (KFD) Let and. According to the theory of reproducing kernels, the solution of w must lie in the span of all training data samples mapped in F, w can be expressed as Accordingly, can be re-written as Let T = [ 1, 2, …, l ]. Our goal therefore changes from finding w to finding , which maximizes
30
30 Kernel Methods for Speaker Verification Kernel Fisher Discriminant (KFD) I ni is an n i × n i identity matrix, and 1 ni is an n i × n i matrix with all entries 1/n i. The solution for is analogous to FLD in Eq. (3): which is also the leading eigenvector of N -1 M. where
31
31 Experiments Formation of the Characteristic Vector In our methods, we use B+1 background models, consisting of B cohort set models, One world model, to form the characteristic vector x. Two cohort selection methods are used in the experiments: B closest speakers. B/2 closest speakers + B/2 farthest speakers. To yield the following two (B+1)×1 characteristic vectors: where and are, respectively, the i-th closest model and the i-th farthest model of the client model.
32
32 Detection cost Function (DCF) The NIST Detection Cost Function (DCF), which reflects the performance at a single operating point on the DET curve. The DCF is defined as and are the miss probability and the false-alarm probability, respectively. and are the respective relative costs of detection errors. is the a priori probability of the specific target speaker. A special case of the DCF is known as the Half Total Error Rate (HTER), where and are both equal to 1, and = 0.5, i.e., Experiments
33
33 Experiments - XM2VTSDB “Training” subset to build the individual client’s model and anti-models. “Evaluation” subset to estimate , w and b. “Test” subset for the performance evaluation. 1. “0 1 2 3 4 5 6 7 8 9”. 2. “5 0 6 9 2 8 1 3 7 4”. 3. “Joe took father’s green shoe bench out”.
34
34 Experimental results (ICPR2006) XM2VTSDB For perspective 1: The proposed combined LR Figure 1. Baselines vs. the Combined LRs : DET curves for “Test”. Further analysis of the results via the equal error rate (EER) showed that a 13.2% relative improvement was achieved by KFD (EER = 4.6%), compared to 5.3% of L 3 (U).
35
35 Experimental results (submitted to ISCSLP2006) XM2VTSDB For perspective 2: The novel alternative hypothesis characterization A 30.68% relative improvement was achieved by KFD_w_20c, compared to L3_10c_10f – the best baseline system.
36
36 Experimental results (submitted to ISCSLP2006) XM2VTSDB For perspective 2: The proposed novel alternative hypothesis characterization Figure 2. Best baselines vs. our proposed LRs : DET curves for “Test” subset.
37
37 For perspective 2: The proposed novel alternative hypothesis characterization. In the text-independent speaker verification task. We observe that KFD_w_50c_50f achieved a 34.08% relative improvement over GMM-UBM. Evaluation on the ISCSLP2006-SRE database with
38
38 Evaluation on the ISCSLP2006-SRE database We participated in the text-independent speaker verification task of the ISCSLP2006 Speaker Recognition Evaluation (SRE) plan. The evaluation results are given as follows
39
39 Conclusions We have introduced current LR systems for speaker verification. We have presented two proposed LR systems: The combined LR system. The new LR system with the novel alternative hypothesis characterization. Both proposed LR systems can be formulated as a linear or non- linear discriminant classifier. Non-linear classifiers can be implemented by using kernel methods: Kernel Fisher Discriminant (KFD) Support Vector Machine (SVM) Experiments conducted on two speaker verification tasks The XM2VTSDB task The ISCSLP2006-SRE task The superiority of our methods over conventional approaches.
40
Institute of Information Science, Academia Sinica, Taiwan THANK YOU!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.