Download presentation
1
PCA, LDA, HLDA and HDA Reference:
E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. S. R. Searle, “Matrix Algebra Useful for Statistics,” Wiley Series in Probability and Mathematical Statistics, New Yourk, 1982. Berlin Chen’s Sliders. N. Kumar and A. G. Andreou, “Heteroscedastic Discriminant Analysis and Redued Rank HMMs for Improved Speech Recognition,” Speech Communication, 26: , 1998. G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum Likelihood Discriminant Feature Spaces,” ICASSP, 2000. X. Liu, “Linear Projection Schemes for Automatic Speech Recognition,” Master of Philosophy, University of Cambridge, 2001. 張志豪, “強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究,” 碩士論文, 2005.
2
PCA: Introduction PCA (Principle Component Analysis) is a one-group and unsupervised projection method for reducing data dimensionality (feature extraction). We make use of PCA to find a mapping from the inputs in the original d-dimensional space to a new (k < d)-dimensional space, with: minimum loss of information. maximum amount of information measured in terms of variability. That is, we’ll find the new variables or linear transformations (major axes, principal components) to reach the goal. The projection of x on the direction of w is, z = wTx. input space X (d-dimensionality) feature space Z (k-dimensionality) linear transform W
3
PCA: Methodology (General)
For maximizing the amount of information, PCA centers the sample and then rotates the axes to line up with the directions of highest variance. That is, find w such that Var(z) is maximized, which is the criterion. Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(wTx – wTμ)] = E[wT(x – μ)(x – μ)Tw] = wT E[(x – μ)(x –μ)T]w = wT ∑ w Var(x) = E[(x – μ)2] = E[(x – μ)(x –μ)T] = ∑
4
PCA: Methodology (General) (cont.)
Maximize Var(z1) = w1T ∑ w1 subject to ||w1|| = 1. For a unique solution and to make the direction the important factor, we require ||w|| = 1, i.e., w1Tw1 = 1. w1 is an eigenvector of ∑, λ is an eigenvalue associated with w1. Var(z1) = Var(w1Tx) = w1T ∑ w1 = w1T λ w1 = λ w1Tw1 = λ max Var(z1) = max λ, so choose the one with the largest eigenvalue for Var(z) to be max. λ is Lagrange multiplier.
5
PCA: Methodology (General) (cont.)
Second principal component: max Var(z2), s.t., ||w2||=1 and orthogonal to w1. That is, max Var(z1) = max λ. Conclusions: wi is the eigenvector of ∑ associated with the ith largest eigenvalue. Var(zi) = λi. (The above can be proved by mathematical induction.) wi’s are uncorrelated (orthogonal). w1 explains as much as possible of original variance in data set. w2 explains as much as possible of remaining variance, etc.
6
PCA: Some Discussions About dimensions:
If the dimensions are highly correlated, there will be a small number of eigenvectors with large eigenvalues, and k will be much smaller than d and a large reduction in dimensionality may be attained. If the dimensions are not correlated, k will be as large as d and there is no gain through PCA.
7
PCA: Some Discussions (cont.)
About ∑: For two different eigenvalues, the eigenvectors are orthogonal. If ∑ is positive definite (xT ∑ x > 0, x != null), all eigenvalues are positive. If ∑ is singular, then its rank, the effective dimension is k < d, and λi = 0, i > k. About scaling: Different variables have completely different scaling. Eigenvalues of the matrix is scale dependent. If scale of the data is unknown then it is better to use correlation matrix instead of the covariance matrix. The interpretation of the principal components derived by these two methods can be completely different.
8
PCA: Methodology (SD) z = WTx, or z = WT(x - m): center the data on the origin. To find a matrix w such that when we have z = WTx, we will get Cov(z) = D is any diagonal matrix. We would like to get uncorrelated zi. Let C = [ci] be the normalized eigenvectors of S, then CTC = I S = SCCT = S (c1, c2, … , cd) CT = (Sc1, Sc2, … , Scd) CT = (λc1, λc2, … , λcd) CT = λc1 c1T + λc2 c2T + … + λcd cdT = CDCT D = CTSC Spectral Decomposition is the factorization of a positive definite matrix S into S = CDCT where D is a diagonal matrix of eigenvalues, and the C matrix has the eigenvectors. D(kk) = CT(kd) S(dd) C(dk) d: input space dim; k: feature space dim WT ∑ W
9
Appendix A Another criterion for PCA is MMSE (Minimum Mean-Squared Error) criterion which will reach the same destination as the above two methods do. But there may be an interesting difference among them. Some important properties of symmetric matrices: Eigenvalues are all real. Symmetric matrices are diagonable. Eigenvectors are orthogonal. Eigenvectors corresponding to different eigenvalues are orthogonal. mk LIN eigenvectors corresponding to any eigenvalue λk of multiplicity mk can be obtained such that they are orthogonal. Rank equals number of nonzero eigenvalues.
10
LDA: Introduction LDA (Linear Discriminant Analysis) (Fisher, 1936) (Rao, 1935) is a supervised method for dimension reduction for classification problem. To obtain features suitable for speech sound classification, the use of LDA was proposed (Hunt, 1979). Brown showed that the LDA transform is superior to the PCA transform by using DHMM classifier and incorporating context information (Brown, 1987). The later researchers have applied LDA to DHMM and CHMM speech recognition systems and have reported improved performance on small vocabulary tasks but with mixed results on large vocabulary phoneme-based systems.
11
LDA: Assumptions LDA is related to the MLE (Maximum Likelihood Estimation) of parameters for a Gaussian model, with two a priori assumptions (Campbell, 1984). First, all the class-discrimination information resides in a p-dimensional subspace of the n-dimensional feature space. Second, the within-class variances are equal for all classes. Another notable assumption is that class distributions is mixture of Gaussians (Hastie & Tibshirani, 1994). (Why not single Gaussian?) That means LDA is optimal if the classes are normally distributed. But we can still use LDA for classification.
12
LDA: Methodology Criterion: Given a set of sample vectors with labeled (class) information, try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal. After projection, for all classes to be well separated, we would like the means to be as far apart as possible and the examples of classes be scatteres in as small a region as possible.
13
LDA: Methodology (cont.)
Let x be an n-dimensional feature vector. We seek a linear transformation Rn Rp (p < n) of the form yp = θpTx, where θp is an np matrix. Let θ be a nonsingular nn matrix used to define the linear transformation y = θTx. Let’s partition as First, we apply a nonsingular linear transformation to x to obtain y = θTx. Second, we retain only the first p rows of y to give yp.
14
LDA: Methodology (cont.)
Let there be a total of J classes, and let g(i) {1…J} indicate the class that is associated with xi. Let {xi} be the set of training examples available. The sample mean: The class sample means: The class sample covariances:
15
LDA: Methodology (cont.)
The average within-class variation: The average between-class variation: The total sample covariance
16
LDA: Methodology (cont.)
To get a p-dimensional transformation, we maximize the ratio To obtain , we choose those eigenvectors of that correspond to the largest p eigenvalues, and let be an np matrix of these eigenvectors. The p-dimensional features thus obtained by that are uncorrelated.
17
HLDA: ML framework For LDA, since the final objective is classification, the implicit assumption is that the rejected subspace does not carry any classification information. For Gaussian models, the assumption of lack of classification information is equivalent to the assumption that the means and the variances of the class distributions are the same for all classes in the rejected (n-p)-dimensional subspace. Now in an alternative way, let the full rank linear transformation θ be such that the first p columns of θ span the p-dimensional subspace in which the class means, and probably the class variances, are different. When rank(θnn) = n, θ is said to have full rank, or to be of full rank. Its rank equals its order, it is nonsingular, its inverse exists. θ obtained by LDA can be full rank? Since the data variables x are Gaussian, their linear transformation y are also Gaussian.
18
HLDA: ML framework (cont.)
The goal of HLDA (Heteroscedastic Discriminant Analysis) is to generalize LDA under ML (Maximum Likelihood) framework. For notational convenience, we define: where μj represents the class means and Σj represents the class covariances after transformation.
19
HLDA: ML framework (cont.)
The probability density of xi under the preceding model is given as where xi belongs to the group g(i). Note that although the Gaussian distribution is defined on the transformed variable yi, we are interested in maximizing the likelihood of the original data xi. The term |θ| comes from the Jacobian of the linear transformation y = θTx.
20
HLDA: ML framework (cont.)
21
HLDA: Full rank The log-likelihood of the data under the linear transformation θ and under the constrained Gaussian model assumption for each class is Doing a straightforward maximization with respect to various parameters is computationally intensive. (Why?)
22
HLDA: Full rank (cont.) We simplify it considerably by first calculating the values of the mean and variance parameters that maximize the likelihood in terms of a fixed linear transformation θ. We’ll get Transformations vs. ML estimators?
23
HLDA: Full rank (cont.) By replacing the two parameters in terms of θ, the log-likelihood will be
24
HLDA: Full rank (cont.) We can simplify the above log-likelihood to get θ: Proposition 1: Let F be any full-rank nn matrix. Let t be any (np) rank-p matrix (p < n). Then, Trace(t(tTFt)-1tTF) = p. Proposition 2:
25
HLDA: Full rank (cont.) Since there is no closed-form solution for maximizing the likelihood with respect to h, the maximization has to be performed numerically. An initial guess of θ: the LDA solution. Quadratic programming algorithms in MATLABTM tool-box. After optimization, we use only the first p columns of θ θ to obtain the dimension-reduction transformation.
26
HLDA: Diagonal In speech recognition, we often assume that the within-class variances are diagonal. The log-likelihood of the data can be written as
27
HLDA: Diagonal (cont.) Using the same method as before, and maximizing the likelihood with respect to means and variances, we get
28
HLDA: Diagonal (cont.) Substituting values of the maximized mean and variance parameters gives the maximized likelihood of the data in terms of θ.
29
HLDA: Diagonal (cont.) We can simplify this maximization to the following
30
HLDA: with equal parameters
We finally consider the case where every class has an equal covariance matrix. Then, the maximum-likelihood parameter estimates can be written as follows:
31
HLDA: with equal parameters (cont.)
The solution that we obtain by taking the eigenvectors corresponding to largest p eigenvalues of also maximizes the expression above, thus asserting the claim that LDA is the maximum-likelihood parameter estimate of a constrained model.
32
HDA: Introduction The same as HLDA, the essence of HDA (Heteroscedastic Discriminant Analysis) to remove the equal within-class covariance constraint. HDA defines an objective function similar to LDA’s which maximizes the class discrimination in the projected subspace while ignoring the rejected dimensions. The assumptions of HDA: Being the intuitive heteroscedastic extension of LDA, HDA shares the same assumptions as LDA (Chang, 2005). But why? First, all of the classification information lies in the first p-dimensional feature subspace. Second, every class distribution is normal.
33
HDA: Derivation Considering the uniform class specific variance assumption removed for HDA, then we’ll try to maximize: By taking log and rearranging terms, we get:
34
HDA: Derivation (cont.)
H has useful properties of invariance: For every nonsingular matrix φ, H(φθ) = H(θ); This means that subsequent feature space transformations of the range of θ will not affect the value of the objective function. So, like LDA, the HDA solution is invariant to linear transformations of the data in the original space. No special provisions have to be made for θ during the optimization of H except for |θTθ| != 0. The objective function is invariant to row or column scalings of θ or eigenvalue scalings of θTθ. Using matrix differentiation, the derivative of H is given by: There is no close-form solution for H’(θ) = 0. Instead, we used a quasi-Newton conjugate gradient descent routine from the NAG2 Fortran library for the optimization of H.
35
HDA: Derivation (cont.)
36
HDA: Likelihood interpretation
Assuming a single full covariance Gaussian model for each class, the log likelihood of these samples according to the induced ML model. It may be seen that the summation in H is related to the log likelihood of the projected samples. Thus, θ can be interpreted as a constrained ML projection, the constraint being given by the maximization of the projected between-class scatter volume.
37
HDA: diagonal variance
Consider the case when diagonal variance modeling constraints are present in the final feature space. MLLT (Maximum Likelihood Linear Transform) is introduced when the dimensions of the original and the projected space are the same. MLLT aims at minimizing the loss in likelihood between full and diagonal covariance Gaussian models. The objective is to find a transformation φ, that maximizes the log likelihood difference of the data:
38
HDA: HDA vs. HLDA Consider the diagonal constraint in the projected feature space: For HDA: For HLDA: To be maximized To be minimized
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.