Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang Huo 1 1 Microsoft Research Asia, Beijing, China 2 The University of Hong Kong, Hong Kong, China ICASSP-2010, Dallas, Texas, U.S.A., March 14-19, 2010
Outline Background What’s our new approach How does it work Conclusions
Background of Minimum Classification Error (MCE) Formulation for Pattern Classification Pioneered by Amari and Tsypkin in late 1960s – S. Amari, “A theory of adaptive pattern classifiers,” IEEE Trans. On Electronic Computers, Vol. EC-16, No. 3, pp , – Y. Z. Tsypkin, Adaptation and learning in automatic systems, – Y. Z. Tsypkin, Foundations of the theory of learning systems, Proposed originally for supervised online adaptation of a pattern classifier – to minimize the expected risk (cost) – via a sequential probabilistic descent (PD) algorithm Extended by Juang and Katagiri in early 1990s – B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. on Signal Processing, Vol. 40, No. 12, pp , 1992.
MCE Formulation by Juang and Katagiri (1) Define a proper discriminant function of an observation for each pattern class To enable a maximum discriminant decision rule for pattern classification Largely an art and application dependent
MCE Formulation by Juang and Katagiri (2) Define a misclassification measure for each observation – to embed the decision process in the overall MCE formulation – to characterize the degree of confidence (or margin) in making decision for this observation – a differentiable function of the classifier parameters A popular choice: where Many possible ways => which one is better? => an open problem!
MCE Formulation by Juang and Katagiri (3) Define a loss (cost) function for each observation – a differentiable and monotonically increasing function of the misclassification measure – many possibilities => sigmoid function most popular for approximating MCE MCE training via minimizing – empirical average loss (cost) by an appropriate optimization procedure, e.g., gradient descent (GD), Quickprop, Rprop, etc., or – expected loss (cost) by a sequential probabilistic descent (PD) algorithm (a.k.a. GPD)
Some Remarks Combinations of different choices for each of the previous three steps and optimization methods lead to various MCE training algorithms. The power of MCE training has been demonstrated by many research groups for different pattern classifiers in different applications. How to improve the generalization capability of an MCE-trained classifier?
One Possible Solution: SSM-based MCE Training Sample Separation Margin (SSM) – Defined as the smallest distance of an observation to the classification boundary formed by the true class and the most competing class – There is a closed-form solution for piecewise linear classifier Define misclassification measure as negative SSM – Other parts of the formulation is the same as “traditional” MCE A happy result – Minimized empirical error rate, and – Improved generalization Correctly recognized training samples have a large margin from the decision boundaries! For more info: – T. He and Q. Huo, “ A study of a new misclassification measure for minimum classification error training of prototype-based pattern classifiers, ’’ in Proc. ICPR-2008
What’s New in This Study? Extend SSM-based MCE training to pattern classifier with a quadratic discriminant function (QDF) – No closed-form solution to calculate SSM Demonstrate its effectiveness on a large-scale Chinese handwriting recognition task – Modified QDF (MQDF) is widely used in state-of-the-art Chinese handwriting recognition systems
Two Technical Issues How to calculate the SSM efficiently? – Formulated as a nonlinear programming problem – Can be solved efficiently because it is a quadratically constrained quadratic programming (QCQP) problem with a very special structure: A convex objective function with one quadratic equality constraint How to calculate the derivative of the SSM? – Using a technique known as sensitivity analysis in nonlinear programming – Calculated by using the solution to the problem in Eq. (1) Please refer to our paper for details
Experimental Setup Vocabulary: – 6763 simplified Chinese characters Dataset: – Training: 9,447,328 character samples # of samples per class: 952 – 5,600 – Testing: 614,369 character samples Feature extraction: – 512 “8-directional features” – Use LDA to reduce dimension to 128 Use MQDF for each character class – # of retained eigenvectors: 5 and 10 SSM-based MCE Training – Use maximum likelihood (ML) trained model as seed model – Update mean vectors only in MCE training – Optimize MCE objective function by batch-mode Quickprop (20 epochs) Distribution of writing styles in testing data
Experimental Results (1) MQDF, K=5 Regular (error in %) ML1.73 MCE1.29 SSM-MCE1.19 Cursive (error in %)
Experimental Results (2) MQDF, K=10 Regular (error in %) ML1.39 MCE1.30 SSM-MCE1.07 Cursive (error in %)
Experimental Results (3) Histogram of SSMs on training set – SSM-based MCE-trained classifier vs. conventional MCE-trained one – Training samples are pushed away from decision boundaries – Bigger the SSM, better the generalization
Conclusion and Discussions SSM-based MCE training offers an implicit way of minimizing empirical error rate and maximizing sample separation margin simutaneously – Verified for quadratic classifiers in this study – Verified for piecewise linear classifiers previously (He&Huo, ICPR-2008) Ongoing and future works – SSM-based MCE training for discriminative feature extraction – SSM-based MCE training for more flexible classifiers based on GMM and HMM – Searching for other (hopefully better) methods to combine MCE training and maximum margin training