Linear Discriminant Feature Extraction for Speech Recognition Hung-Shin Lee Master Student Spoken Language Processing Lab National Taiwan Normal University.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimension reduction (1)
Visual Recognition Tutorial
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Factor Analysis Purpose of Factor Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
CS 790Q Biometrics Face Recognition Using Dimensionality Reduction PCA and LDA M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Visual Recognition Tutorial
Ch. 10: Linear Discriminant Analysis (LDA) based on slides from
An Introduction to Support Vector Machines Martin Law.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Outline Separating Hyperplanes – Separable Case
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Object Orie’d Data Analysis, Last Time
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
An Introduction to Support Vector Machines (M. Law)
Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,
Power Linear Discriminant Analysis (PLDA) M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
PCA, LDA, HLDA and HDA Reference:
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Discriminant Analysis
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Linear Discriminant Feature Extraction for Speech Recognition Hung-Shin Lee Master Student Spoken Language Processing Lab National Taiwan Normal University 2008/08/14

Arborescence 2 Linear Discriminant Feature Extraction for Speech Recognition (Campbell, 1984) (Liang, 2007)(Li, 2000)(Loog, 2001)(Loog, 2004) WLDA (Lee, 2008a) DE-LDA (Lee, 2008b) HDA (Saon, 2000) PLDA (Sakai, 2007) HLDA (Kumar, 1998) Linear Discriminant AnalysisDiscriminant Analysis Error-Rate Related Limitations HomoscedasticityOveremphasisDistance MeasureEmpirical ErrorFormulationsOptimizations Generalized Variance Total Variation Alternative Formulations

Outline History (till 1998) - LDA and Speech Recognition Statistical Facts Linear Discriminant Analysis (LDA) - Formulations - Optimizations Some Remarks - Trace & Determinant- Solvability - Decorrelation- Geometry Error-Rate Related Limitations - Overemphasis - Distance Measure - Empirical Error Alternative Formulations Conclusion 3 Linear Discriminant Feature Extraction for Speech Recognition

History (till 1998) - LDA and Speech Recognition 4 Linear Discriminant Feature Extraction for Speech Recognition Hunt first used LDA for separating syllables Brown verified that LDA is superior to PCA for a DHMM classifier. Brown captured the contextual information by applying LDA on an augmented feature vector Researchers applied LDA to continuous HMM speech recognition system and reported improved performance on small vocabulary tasks. 1989~ On large vocabulary phoneme-based systems, LDA led to mixed results. 1990~ Haeb-Umbach Defined sub-phone units (states) as classes to be discriminated in the LDA transform proved most effective for a continuous mixture density based speech recognizer. 1992~

Statistical Facts (1) Class mean vector (sample): Class covariance (sample): Total mean vector (sample): 5 Linear Discriminant Feature Extraction for Speech Recognition Note that here the sample covariance is a biased estimate of the population covariance.

Statistical Facts (2) Within-class scatter: Between-class scatter: 6 Linear Discriminant Feature Extraction for Speech Recognition cf. Appendix A class-mean difference An estimate of the prior probability for class i Note that in general a covariance matrix is symmetric and positive definite with positive eigenvalues.

Statistical Facts (3) Total covariance (sample): 7 Linear Discriminant Feature Extraction for Speech Recognition cf. Appendix B

Linear Discriminant Analysis - Problem (1) PROBLEM: To separate populations. – Suppose we have C classes from n-dimensional distributions. We want to project these classes onto a p-dimensional subspace (p < n) so that the variation among the classes is as large as possible, relative to the variation within the classes. (Wilks, 1963) 8 Linear Discriminant Feature Extraction for Speech Recognition X1X X2X2 ? ?

Linear Discriminant Analysis - Problem (2) Practically speaking, after the projection, we want 9 Linear Discriminant Feature Extraction for Speech Recognition class means to be as far apart form each other as possible the between-class scatter to be large samples from the same class to be as close to their mean as possible the within-class scatter to be small

Linear Discriminant Analysis - Solution (1) SOLUTION: Linear Discriminant Analysis (LDA) – Thus, The goal of LDA is to seek a linear transformation that reduces the dimensionality of a given n-dimensional feature vector to p (p < n) by maximizing the discrimination criteria: (Fukunaga, 1990) where 10 Linear Discriminant Feature Extraction for Speech Recognition This technique was developed by R. A. Fisher (1936) for the two-class case and extended by C. R. Rao (1948) to handle the multiclass case. This technique was developed by R. A. Fisher (1936) for the two-class case and extended by C. R. Rao (1948) to handle the multiclass case.

Linear Discriminant Analysis - Solution (2) Theoretically, LDA can be interpreted from two aspects: (Johnson et al., 2002), (Gao et al., 2006), (Welch, 1939) – One comes from the Bayesian decision theory with LDA as a straightforward application of C Gaussians with equal covariance. – The other is Fisher’s LDA, which is defined by maximizing the ratio of the between-class and within-class scatter in a linear feature space. – Here only the latter is discussed. 11 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Trace (1) In, what is the meaning of “trace”? – The Mahalanobis distance can be used as a measurement of class separation. (Fisher, 1936) – Why don’t we first use principal component analysis (PCA) to decorrelate variables, and then use the Euclidean distance and a measure of class separation? 12 Linear Discriminant Feature Extraction for Speech Recognition Mahalanobis distance is a distance measure based on correlations between variables and is scale-invariant.

Linear Discriminant Analysis - Trace (2) In, what is the meaning of “trace”? (cont.) – Assume all classes share the same covariance, the square of the Mahalanobis distance between and is – The average of for all classes is 13 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Determinant (1) In, what is the meaning of “determinant”? – The determinant is the product of the eigenvalues, and hence is the product of the “variances” in the principal directions, thereby measuring the square of the hyperellipsoidal scattering volume. (Duda et al., 2001) – The determinant of a nonsingular covariance matrix can also be viewed as the generalized variance. (Wilks, 1963) – The volume of space occupied by the cloud of data points is proportional to the square root of the generalized variance. 14 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Determinant (2) The concepts of the generalized variance ( ): 15 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Determinant (3) The concepts of the generalized variance ( ): (cont.) 16 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Determinant (4) The concepts of the generalized variance ( ): (cont.) 17 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Determinant (5) The concepts of the generalized variance ( ): – Assume all classes share the same covariance 18 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Determinant (6) The concepts of the generalized variance ( ): (cont.) – Assume all classes share the same covariance 19 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Optimization (1) Optimization of : (Fukunaga, 1990) – Differentiating with respect to, and setting the result to zero. 20 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Optimization (2) Optimization of : (cont.) – Simultaneously diagonalize and. – are the eigenvalues of. 21 Linear Discriminant Feature Extraction for Speech Recognition cf. Appendix C Note that the eigenvector matrix is not unique.

Linear Discriminant Analysis - Optimization (3) Optimization of : (cont.) – Note that the original criterion becomes – That is, we can maximize by selecting the largest p eigenvalues of. The corresponding eigenvectors form the transformation matrix. – Note also that, since has a maximum rank of C-1. is the sum of C rank one or less matrices, and because only C-1 of these are independent, 22 Linear Discriminant Feature Extraction for Speech Recognition Note that all of the eigenvalues are positive.

Linear Discriminant Analysis - Optimization (4) Optimization of : – Differentiating with respect to, and setting the result to zero. 23 Linear Discriminant Feature Extraction for Speech Recognition

Linear Discriminant Analysis - Optimization (5) Optimization of : (cont.) – We can see that the procedure is similar to that of – That is, we can maximize by selecting the largest p eigenvalues of. The corresponding eigenvectors form the transformation matrix. 24 Linear Discriminant Feature Extraction for Speech Recognition Note that all of the eigenvalues are positive.

Some Remarks - Trace & Determinant Comparisons between two measures of class separation: (Pena, 2003) 25 Linear Discriminant Feature Extraction for Speech Recognition TraceDeterminant Total variation (Seber, 1984) Generalized variance (Wilks, 1963) a measure of the average of the Mahalanobis distance between each class-mean pair a measure of the hyper-volume that the distribution of the random variables occupies in the space related to principal components analysis related to maximum likelihood estimation

Some Remarks - Solvability The solving procedure of LDA is lightweight. We can see that if a optimization problem can be expressed as then it can be solved as a generalized eigen-analysis problem with and being the i-th eigenvector and eigenvalue of. 26 Linear Discriminant Feature Extraction for Speech Recognition Prieto (2003) demonstrated a general solution of the optimization problem of LDA.

Some Remarks - Decorrelation (1) Unlike principal component analysis (PCA), the linear discriminant transformation from the original variates to the new variates is not orthogonal. But the linear discriminant transformation makes the transformed variates statistically uncorrelated. (Krzanowski, 1988) That’s why we sometimes use LDA to replace discrete cosine transform (DCT). (Li, 2004) 27 Linear Discriminant Feature Extraction for Speech Recognition

Some Remarks - Decorrelation (2) From Any two particular eigenvalue/eigenvector pairs and, – Pre-multiplying by and, respectively. 28 Linear Discriminant Feature Extraction for Speech Recognition

Some Remarks - Decorrelation (3) – To overcome arbitrary scaling of, it is usual to adopt the normalization The optimization problems and can be equivalent to – Many researchers tried to modify the constraint to make their tasks applicable. (Sammon, 1970), (Foley, 1975), (Duchene, 1988) 29 Linear Discriminant Feature Extraction for Speech Recognition With the constraint, will be unique.

Some Remarks - Decorrelation (4) Thus, the LDA transformation are uncorrelated both within and between classes, and are scaled to have unit variance within classes. With the constraint, we can give LDA a geometrical meaning. 30 Linear Discriminant Feature Extraction for Speech Recognition

Some Remarks - Geometry (1) The derivation of the LDA transformation matrix can be geometrically viewed as a two-stage procedure: (Campbell, 1981) – In the first stage, is used as an orthogonal and whitening transformation of the original feature vectors (or variables). 31 Linear Discriminant Feature Extraction for Speech Recognition X1X1 X2X2 The distribution contour for each class turns to be a unit circle - that’s convenient for measuring class separation by the Euclidean distance Y1Y1 Y2Y2

Some Remarks - Geometry (2) – The second stage involves a principal component analysis (PCA) on the transformed class means, which seeks new axes that coincide with the directions having the maximum variations of the class means. 32 Linear Discriminant Feature Extraction for Speech Recognition Z1Z Y1Y1 Y2Y2 PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on

Some Remarks - Geometry (3) The algebraic meaning: (Lee, 2008a) – After the first stage (the whitening stage): – The second stage (the PCA stage): 33 Linear Discriminant Feature Extraction for Speech Recognition

Some Remarks - Geometry (4) – We can see that after the two stages, both between- and within-class scatter matrices become diagonal. – To get back the transformation matrix for the original variables, has to be pre-multiplied by. – It can be algebraically proven that also maximizes the LDA criterion in the original feature space. 34 Linear Discriminant Feature Extraction for Speech Recognition

Some Remarks - Geometry (5) Thus, we can give an alternative algorithm for deriving an unique LDA transformation. According to the geometric analysis of LDA, at least two possible directions are offered to further generalize LDA. – To obtain more effective estimates of the within-class scatter – To modify the between-class scatter for better class discrimination. 35 Linear Discriminant Feature Extraction for Speech Recognition Algorithm I. An alternative procedure of LDA 1.Find matrix, which is made up of the eigenvectors corresponding to the p largest eigenvalues of. 2.Derive.

Error-Rate Related Limitations - Overemphasis (1) After the whitening stage, we can see: 36 Linear Discriminant Feature Extraction for Speech Recognition Z1Z Y1Y1 Y2Y2 After the projection, the distance between the class pairs with larger distance in the original space are still larger.

Error-Rate Related Limitations - Overemphasis (2) – The larger, the more the direction becomes visible in the eigenvectors corresponding to the larger eigenvalues of. – Thus, there are large distances between class pairs completely dominate the eigenvalue decomposition. – Consequently, there is a overlap among the remaining classes, leading to an overall low and suboptimal classification error rate. 37 Linear Discriminant Feature Extraction for Speech Recognition Y1Y1 Y2Y2 Z1Z1 The principal axe is dominantly determined by the distance between classes 2 and 3, but classes 1 and 2 need to get more separation for better classification accuracy. overlap

Error-Rate Related Limitations - Overemphasis (3) To alleviate the overemphasis of the influence of classes that are already well-separated, some weighting based approaches were proposed. Modifying the LDA criterion by replacing with the following weighted form: 38 Linear Discriminant Feature Extraction for Speech Recognition weighting function

Error-Rate Related Limitations - Overemphasis (4) The weighting based LDA has a good solvability, since is invariant to any linear transformations (whitening). Thus, similar to LDA, we can give an algorithm for deriving an unique transformation. 39 Linear Discriminant Feature Extraction for Speech Recognition Algorithm II. WLDA 1.Find matrix, which is made up of the eigenvectors corresponding to the p largest eigenvalues of 2.Derive.

Error-Rate Related Limitations - Overemphasis (5) A simple and intuitional weighting function is: (Li, 2000), (Liang, 2007) – The above function in essence is a monotonically decreasing functions of such that those class pairs with large will not be overemphasized. 40 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Distance Measure (1) As a distance-measure based approach, LDA tries to maximize the Mahalanobis distance between each class-mean pairs. LDA is not directly associated with the classification error. But, LDA is optimal for classification in a Bayesian sense on the following conditions: – The two-class problem – The classes are normal-distributed with equal-covariance. 41 Linear Discriminant Feature Extraction for Speech Recognition

Limitations & Improvements - Distance Measure (2) Under the equal-covariance assumption, for two normal- distributed classes, LDA is an optimal approach on two-class classification. – The Bayes error: 42 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Distance Measure (3) – The Chernoff bound: – That is, maximizing the LDA criterion is equivalent to minimizing the Bayes error. 43 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Distance Measure (4) On multi-class classification, LDA is suboptimal. – The multi-class classification error rates have not been fully investigated. – Geometrically speaking, after the whitening stage, even if the equal- covariance assumption is satisfied, LDA can not guarantee the smaller (not necessarily minimum) overlap among overall classes. Loog (2001) proposed a new criterion, the approximate pairwise accuracy criterion (aPAC), to solve this problem. – aPAC is derived from an attempt to approximate the Bayes error for pairs of classes. – aPAC still retains the equal-covariance assumption of LDA, and simultaneously retains the computational simplicity of LDA. 44 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Distance Measure (5) – The criterion of aPAC is. – means the error function that is twice the integral of the Gaussian distribution with 0 mean and variance of 1/2. – aPAC can well confine the influence of outlier classes - that makes them more robust than LDA. – But, it cannot be guaranteed that aPAC always lead to improved classification rate. 45 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Distance Measure (6) Loog’s method (aPAC) (2001) can also solve the overemphasis problem. – The above function in essence is a monotonically decreasing functions of such that those class pairs with large will not be overemphasized. 46 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Distance Measure (7) To deal with the heteroscedastic data, Loog (2004) modified the aPAC, and proposed a new idea of directed distance matrices (DDMs). – DDMs can be considered as generalizations of, which is based on the Chernoff distance. – The Chernoff distance gives an upper bound on the error probability for two normally distributed densities. 47 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Distance Measure (8) – The Chernoff criterion is 48 Linear Discriminant Feature Extraction for Speech Recognition It can be showed that

Error-Rate Related Limitations - Distance Measure (9) Although Loog’s method (2004) skillfully transformed the distance- measure based approach into classification-accuracy based one, it suffers some limitations. – It is not distribution-free. – Similar to Loog’s method (2001), it approximates the theoretical C-class Bayes error by sum of the two-class errors, which is an upper bound to the C-class error. – Not all of the classifiers are completely designed as Bayesian contextual ones. 49 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Empirical Error Rate (1) Lee (2008a) incorporated the empirical classification information from the training data into the derivation of LDA to form a classifier-related objective function. – Advocating the concept of pairwise class confusion information, Lee’s method takes the empirical classification error rates resulted from a given classifier into consideration, rather than operating merely in the Bayesian sense. – The corresponding weighting function is expressed as 50 Linear Discriminant Feature Extraction for Speech Recognition the number of samples that originally belong to class i but are misallocated to class j by the classifier.

Error-Rate Related Limitations - Empirical Error Rate (2) – can be used to measure the confusability between classes i and j. That is, for class i, the higher, the more confusable it would be with class j. – denotes an adjustable factor trading off between the empirical classification error rates and the class-mean distances, which can only be set heuristically. – We can see that if, this approach apparently is reduced to LDA; and if, not only will the class pairs with dominate the whole weightings, but the class pairs with will get completely ignored. 51 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Empirical Error Rate (3) By investigating the relationship between the empirical classification error rates of a given ASR and the Mahalanobis distances of the respective class pairs of speech features, Lee (2008b) proposed a reformulation of the LDA criterion, called distance-error coupled LDA (DE-LDA). – Define the empirical pairwise classification error rate as – Divide all of the class pairs into two groups: phone-phone and silence- phone according to their corresponding class labels. 52 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Empirical Error Rate (4) Observations: 53 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Empirical Error Rate (5) Observations: (cont.) – The error rates of most of the class pairs in the silence-phone group are much lower than that in the phone-phone group. – The correlation between the two variables, i.e., the distance and the error rate, in the silence-phone group is less pronounced than that in the phone- phone group. – In Fig. 2, we can roughly depict the relationship between these two variables: class pairs with shorter distances tend to have higher error rates; class pairs with larger distances are likely to have lower error rates. – Such a phenomenon, to some extent, confirms to our expectation: the statistics of class pairs with shorter distances need to be emphasized, while those of the class pairs with larger-distances should be deemphasized instead when deriving the LDA-based feature transformation matrix. 54 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Empirical Error Rate (6) Observations: (cont.) – it is reasonable to disregard the contributions of the class pairs in the silence-phone group to the LDA derivation, due to their irregularities in the distance-error distribution and less influence on the overall error rates. Data-fitting: – Using the data-fitting (or regression) scheme to find out a function of the Mahalanobis distance, which hopefully can approximate the relationship between the empirical pairwise classification error rate and the corresponding Mahalanobis distance. 55 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Empirical Error Rate (7) Data-fitting: (cont.) – if is supposed to be a quadratic polynomial, – The error function can be used as a new weighting function: 56 Linear Discriminant Feature Extraction for Speech Recognition

Error-Rate Related Limitations - Empirical Error Rate (8) – The curves of the error functions, derived on the basis of data-fitting for polynomials of degree 1 up to degree 5, are almost monotonically decreasing with the distance. 57 Linear Discriminant Feature Extraction for Speech Recognition

Alternative Formulations - Homoscedasticity (1) From the formulations of the LDA criteria, we can see that LDA assumes that all classes share the same covariance, and the common covariance is. (But LDA dose not have any distribution assumptions.) If the assumption is not satisfied, LDA will not perform well. 58 Linear Discriminant Feature Extraction for Speech Recognition X1X1 X2X X1X1 X2X2

Alternative Formulations - Homoscedasticity (2) Campbell (1984a) tried to generalize LDA by proposing an alternative formulation of LDA, which he called the weighted between-class criterion: 59 Linear Discriminant Feature Extraction for Speech Recognition weighted mean The objective function needs to be solved vector by vector with some constraints, say, by gradient descent.

Alternative Formulations - Homoscedasticity (3) Similarly, to generalize, Saon (2000) proposed a new criterion considering the individual covariances of the classes in the objective function, which is named as Heteroscedastic Discriminant Analysis (HDA). – HDA can be interpreted as a constrained ML projection, the constraint being given by the maximization of the projected between-class scatter volume. 60 Linear Discriminant Feature Extraction for Speech Recognition constraint distribution-free and the solution is not unique.

Alternative Formulations - Homoscedasticity (4) Sakai (2007) found that the difference between LDA and HDA is the denominator part of their criteria. – The equal-covariance assumed by LDA is the weighted arithmetic mean of the class covariances,. – The equal-covariance assumed by HDA is the weighted geometric mean of the class covariances. – A new criterion named Power Linear Discriminant Analysis (PLDA): – PLDA tries to generalize LDA by re-estimate the winthin-class covariance scatter. But it’s difficult to determine a most appropriate power mean as a estimate of the within-class covariance. 61 Linear Discriminant Feature Extraction for Speech Recognition

Alternative Formulations - Homoscedasticity (5) Also, LDA is related to the maximum-likelihood estimation of parameters for a Gaussian model, with two a priori assumptions on the structure of the model. (Campbell, 1984b) – All the class-discrimination information resides in a p-dimensional subspace of the n-dimensional feature space. – The within-class variances are equal for all the classes. Under the maximum likelihood framework, Kumar (1998) proposed an approach, Heteroscedastic Linear Discriminant Analysis (HLDA), and showed that HLDA is a maximum likelihood solution for normal populations with common covariances in the rejected (n-p) subspace. 62 Linear Discriminant Feature Extraction for Speech Recognition

Alternative Formulations - Homoscedasticity (6) Some Remarks – From PLDA, we can see HDA seems not to successfully relax the equal- covariance assumption of LDA. – PLDA also implicitly showed that HDA dose not necessarily outperform LDA. – LDA, HDA, and PLDA have no distribution assumption, but HLDA does. – The difference between HDA and HLDA lies in that HDA tries to maximize the between-class separation in the projected (p-dim.) space, but HLDA tries to minimize the between-class separation in the rejected ((n-p)-dim.) space. 63 Linear Discriminant Feature Extraction for Speech Recognition

Conclusions Linear discriminant analysis (LDA) is a simple approach with lightweight solvability. Up till now, It has been shown that LDA can lead to consistent performance improvements for small- vocabulary recognition tasks and mixed results on large- vocabulary applications. (Haeb-Umbach, 1992) We roughly unveiled the relationship between the frond-end processing by LDA and the back-end recognition by recognizer and verified that the increasing of class separability is indeed helpful to speech recognition. 64 Linear Discriminant Feature Extraction for Speech Recognition

Appendix A 65 Linear Discriminant Feature Extraction for Speech Recognition

Appendix B 66 Linear Discriminant Feature Extraction for Speech Recognition

Appendix C Simultaneous Diagonalization: ( and are symmetric.) – Whiten – Diagonalize – So we derive the transformation 67 Linear Discriminant Feature Extraction for Speech Recognition

References (1) M. Hunt, “A Statistical Approach to Metrics for Word and Syllable Recognition,” 98th Meeting of the Acoustical Society of America, P. F. Brown, “The Acoustic-Modeling Problem in Automatic Speech Recognition,” Ph.D. Thesis, Carnegie Mellon University, R. Haeb-Umbach el al., “Linear Discriminant Analysis for Improved Large Vocabulary Continuous Speech Recognition,” ICASSP, K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, R. A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, vol. 7, C. R. Rao, “The Utilization of Multiple Measurements in Problems of Biological Classification,” J. Royal Statistic al Soc., Series B, vol. 10, R. A. Johnson et al, Applied Multivariate Statistical Analysis, Prentice Hall, 5 th Ed., Linear Discriminant Feature Extraction for Speech Recognition

References (2) H. Gao et al., “Why Direct LDA is not equivalent to LDA,” Pattern Recognition, vol. 39, B. L. Welch, “Note on Discriminant Functions,” Biometrika, vol. 31, R. O. Duda et al., Pattern Classification, John Wiley and Sons, New York, G. A. F. Seber, Multivariate Observations, John Wiley, New York, D. Pena et al., “Descriptive Measures of Multivariate Scatter and Linear Dependence,” Journal of Multivariate Analysis, vol. 85, issue 2, R. E. Prieto, “A General Solution to the Maximization of the Multidimensional Generalized Rayleigh Quotient Used in Linear Discriminant Analysis for Signal Classification,” ICASSP, W. J. Krzanowski, Principles of Multivariate Analysis - A user’s Perspective, Oxford Press, X. B. Li, “Dimensionality Reduction Using MCE-Optimized LDA Transformation,” ICASSP, Linear Discriminant Feature Extraction for Speech Recognition

References (3) J. W. Sammon, “An Optimal Discriminant Plane,” IEEE Trans. on Computers, D. H. Foley et al., “An optimal Set of Discriminant Vectors,” IEEE Trans. on Computers, J. Duchene et al., “An Optimal Transformation for Discriminant and Principal Component Analysis,” IEEE Trans. on PAMI, N. A. Campbell et al., “The Geometry of Canonical Variate Analysis,” Syst. Zool., 30(3), H. S. Lee and Berlin Chen, “Linear Discriminant Feature Extraction Using Weighted Classification Confusion Information,” INTERSPEECH, N. A. Campbell, “Canonical Variate Analysis with Unequal Covariance Matrices - Generalizations of the Usual Solution,” Mathematical Geology, 16(2), R. Saon et al., “Maximum Likelihood Discriminant Feature Spaces,” ICASSP, Linear Discriminant Feature Extraction for Speech Recognition

References (4) N. A. Campbell, “Canonical Variate Analysis - A general Model Formulation,” Australian Journal of Statistics, vol. 26, N. Kumar et al., “Heteroscedastic Discriminant Analysis and Reduced Rank HMMs for Improved Speech Recognition,” Speech Communication, vol. 26, M. Sakai et al. “Generalization of Linear Discriminant Analysis Used in Segmental Unit Input Hmm for Speech Recognition,” ICASSP, Y. Li et al., “Weighted Pairwise Scatter to Improve Linear Discriminant Analysis,” ICSLP, Y. Liang, “Uncorrelated Linear Discriminant Analysis Based on Weighted Pairwise Fisher Criterion,” Pattern Recognition, vol. 40, M. Loog et al., “Multiclass Linear Dimension Reduction by Weighted Pairwise Fisher Criteria,” IEEE Trans. on PAMI, M. Loog et al., “Linear Dimensionality Reduction via a Heteroscedastic Extension of LDA,” IEEE Trans. on PAMI, Linear Discriminant Feature Extraction for Speech Recognition

References (5) H. S. Lee and Berlin Chen, “Improved Linear Discriminant Analysis Considering Empirical Pairwise Classification Error Rates,” ISCSLP, 2008, submitted. 72 Linear Discriminant Feature Extraction for Speech Recognition