Power Linear Discriminant Analysis (PLDA) M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit.

Slides:



Advertisements
Similar presentations
Active Appearance Models
Advertisements

Component Analysis (Review)
Dimension reduction (1)
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
Visual Recognition Tutorial
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Ch. 10: Linear Discriminant Analysis (LDA) based on slides from
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Presented By Wanchen Lu 2/25/2013
Isolated-Word Speech Recognition Using Hidden Markov Models
Principles of Pattern Recognition
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Linear Discriminant Feature Extraction for Speech Recognition Hung-Shin Lee Master Student Spoken Language Processing Lab National Taiwan Normal University.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Linear Models for Classification
Discriminant Analysis
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
2D-LDA: A statistical linear discriminant analysis for image matrix
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Statistical Models for Automatic Speech Recognition
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Introduction PCA (Principal Component Analysis) Characteristics:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

Power Linear Discriminant Analysis (PLDA) M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit Input Hmm for Speech Recognition,” Proc. ICASSP, 2007 M. Sakai, N. Kitaoka and S. Nakagawa, “Selection of Optimal Dimensionality Reduction Method Using Chernoff Bound for Segmental Unit Input HMM,” Proc. INTERSPEECH, 2007 Presented by Winston Lee Reference: S. Nakagawa and K. Yamamoto, “Evaluation of Segmental Input Unit HMM,” Proc. ICASSP, 1996 K. Fukunaga, Introduction to Statistical Pattern Recognition, 2 nd Ed.

M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit Input Hmm for Speech Recognition,” Proc. ICASSP,

Abstract To precisely model the time dependency of features is one of the important issues for speech recognition. Segmental unit input HMM with a dimensionality reduction method is widely used to address this issue. Linear discriminant analysis (LDA) and heteroscedastic discriminant analysis (HDA) are classical and popular approaches to reduce dimensionality. However, it is difficult to find one particular criterion suitable for any kind of data set in carrying out dimensionality reduction while preserving discriminative information. In this paper, we propose a new framework which we call power linear discriminant analysis (PLDA). PLDA can describe various criteria including LDA and HDA with one parameter. Experimental results show that the PLDA is more effective than PCA, LDA, and HDA for various data sets. 3

Introduction Hidden Markov Models (HMMs) have been widely used to model speech signals for speech recognition. However, HMMs cannot precisely model the time dependency of feature parameters. –Output-independent assumption of HMMs: All observations are dependent on the state that generated them, not on neighboring observations. Segmental unit input HMM is widely (?) used to overcome this limitation. In segmental unit input HMM, a feature vector is derived from several successive frames. The immediate use of several successive frames inevitably increases the dimensionality of parameters. Therefore, a dimensionality reduction method is performed to spliced frames. 4

Segmental Unit Input HMM The observation sequence The state sequence The expression of output probability computation of HMM is : 5 Bayes’ Rule Marginalizing

Segmental Unit Input HMM (cont.) 6 conditional density HMM of 4-frame segments conditional density HMM of 2-frame segments segmental unit input HMM of 2-frame segments the standard HMM

Segmental Unit Input HMM (cont.) The segmental unit input HMM in (Nakagawa, 1996) is approximation of Using segmental unit input HMM wherein several successive frames are inputted as one vector, since the dimensions of vector increases, it results in a lesser precision in the estimation of the covariance matrix. In (Nakagawa, 1996), Karhunen-Loeve (K-L) expansion and Modified Quadratic Discriminant Function (MQDF) are used to deal with the above problem. 7 segmental unit input HMM of 4-frame segments

K-L Expansion Estimation of covariance matrix from samples Computation of eigenvalues and eigenvectors Sort of eigenvalues and eigenvectors corresponding to them: Computation of parameters having compressed dimension, by using where the transformation matrix is as follows 8

K-L Expansion (cont.) In the statistical literature, K-L expansion is generally called principal components analysis (PCA). Some criteria of K-L expansion: –minimum mean-square error (MMSE) –maximum scatter measure –minimum entropy Remarks: –Why orthonormal linear transformations? Ans: To maintain the structure of the distribution. 9

10 Review on LDA Given n-dimensional features e.g., let us find a transformation matrix that maps these features to p-dimensional features where and N denotes the number of features. Within-class covariance matrices: Between-class covariance matrices:

Review on LDA (cont.) In LDA, the objective function is defined as follows: LDA finds a transformation matrix B that maximizes the above function. The eigenvectors corresponding to the largest eigenvalues of are the solution. 11

Review on HDA LDA is not the optimal transform when the class distributions are heteroscedastic. HLDA: Kumar incorporated the maximum likelihood estimation of parameters for differently distributed Gaussians. HDA: Saon proposed another objective function similar to Kumar’s and showed its relationship with a constrained maximum likelihood estimation. Saon’s HDA objective function: 12

Dependency on Data Set Figure 1(a) shows that HDA has higher separability than LDA for the data set. Figure 1(b) shows that LDA has higher separability than HDA for another data set. Figure 1(c) shows the case with another data set where both LDA and HDA have low separabilities. All results show that the separabilities of LDA and HDA depend significantly on data sets. 13

Dependency on Data Set (cont.) 14

Relationship between LDA and HDA The denominator in Eq. (1) can be viewed as a determinant of the weighted arithmetic mean of the class covariance matrices. The denominator in Eq. (2) can be viewed as a determinant of the weighted geometric mean of the class covariance matrices. 15

PLDA The difference between LDA and HDA is the definitions of the mean of the class covariance matrices. As extension of this interpretation, their denominators can be replaced by a determinant of the weighted harmonic mean, or a determinant of the root mean square, etc. In this paper, a more general definition of a mean is often used, called the weighted mean of order m, or the weighted power mean. The new approach using the weighted power mean as the denominator of the objective function is called Power Linear Discriminant Analysis (PLDA). 16

PLDA (cont.) The new objective function is as follows: It can be seen that both of LDA and HDA are the subsets of PLDA. m=1 (arithmetic mean) m=0 (geometric mean) 17

Appendix A weighted power mean: –If are positive real numbers such that we define the r-th weighted power mean of the as: 18 symbolweighted mean minMinimum HHarmonic mean GGeometric mean AArithmetic mean RMSRoot-mean-square maxMaximum

Appendix B Let we want to find First we take logarithm of : Then So 19 l’Hôpital’s rule

PLDA (cont.) Assuming that a control parameter m is constrained to be an integer, the derivatives of the PLDA objective function are formulated as follows: 20

Appendix C m > 0 21

Appendix C (cont.) m = 0 (too trivial!) m < 0 22

The Diagonal Case Because of computational simplicity, the covariance matrix in the class k is often assumed to be diagonal. Since a diagonal matrix multiplication is commutative, the derivatives of the PLDA objective function are simplified as follows: 23

Experiments Corpus: CENSREC-3 –The CENSREC-3 is designed as an evaluation framework of Japanese isolated word recognition in real driving car environments. –Speech data was collected using 2 microphones, a close-talking (CT) microphone and a hands-free (HF) microphone. –For training, a total of 14,050 utterances spoken by 293 drivers (202 males and 91 females) were recorded with both microphones. –For evaluation, a total of 2,646 utterances spoken by 18 speakers (8 males and 10 females) were evaluated for each microphone. 24

Experiments (cont.) 25

P.S. Apparently, the deviation of PLDA is merely an induction from LDA and HDA. The authors doesn’t seem to give any expressive statistical or physical meaning about PLDA. The experimental results shows PLDA (with some parameter m) overperforms the other two approaches, but it does not explained why in this paper. The revised version of Fisher’s criterion!!!!! The concepts of MEAN!!!!! 26

M. Sakai, N. Kitaoka and S. Nakagawa, “Selection of Optimal Dimensionality Reduction Method Using Chernoff Bound for Segmental Unit Input HMM,” Proc. INTERSPEECH,

Abstract To precisely model the time dependency of features, segmental unit input HMM with a dimensionality reduction method has been widely used for speech recognition. Linear discriminant analysis (LDA) and heteroscedastic discriminant analysis (HDA) are popular approaches to reduce the dimensionality. We have proposed another dimensionality reduction method called power linear discriminant analysis (PLDA) to select the best dimensionality reduction method that yields the highest recognition performance. This selection process on the basis of trial and error requires much time to train HMMs and to test the recognition performance for each dimensionality reduction method. In this paper we propose a performance comparison method without training or testing. We show that the proposed method using the Chernoff bound can rapidly and accurately evaluate the relative recognition performance. 28

Performance Comparison Method Instead of using a recognition error, The class separability error of the features in the projected space is used as a criterion to estimate the parameter m of PLDA. 29

Performance Comparison Method (cont.) Two-class problem: –Bayes error of the projected features on an evaluation data: –The Bayes error ε can represent a classification error, assuming that the training data and the evaluation data come from the same distributions. –But, it’s hard to measure the Bayes error. 30

Performance Comparison Method (cont.) Two-class problem (cont.): –Instead, we use the Chernoff bound between class 1 and class 2 as a class separability error –We can rewrite the above equation as where 31 s = 0.5: Bhattacharyya bound Covariance matrices are treated as diagonal ones here

Performance Comparison Method (cont.) 32

Performance Comparison Method (cont.) Multi-class problem: –it is possible to define several error functions for multi-class data. –Sum of pairwise approximated errors: –Maximum pairwise approximated error 33

Performance Comparison Method (cont.) Multi-class problem (cont.): –Sum of maximum approximated errors in each class 34

Experimental Results 35

Experimental Results (cont.) 36

Experimental Results (cont.) No comparison method could predict the best dimensionality reduction methods simultaneously for both of the two evaluation sets. –It is supposed that this results from neglecting time information of speech feature sequences to measure a class separability error and modeling a class distribution as a unimodal normal distribution. Computational costs 37

P.S. The experimental results didn’t explicitly explain the relationship between WER and class separatability error for a given m. That is, better class separatability error cannot explicitly guarantee better WER. (The authors said, they “agree well”.) In the experiment, the authors didn’t explain the differences among the three criteria when calculating approximated errors. But this is a good try to take something out from the black boxes (WERs). 38