Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Design Compact Recognizers of Handwritten Chinese Characters Using Precision Constrained Gaussian Models, Minimum Classification Error Training and Parameter.
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Optimization in Engineering Design Georgia Institute of Technology Systems Realization Laboratory 123 “True” Constrained Minimization.
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Support Vector Machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Optimizing Learning with SVM Constraint for Content-based Image Retrieval* Steven C.H. Hoi 1th March, 2004 *Note: The copyright of the presentation material.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Models for Classification
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
The Chinese University of Hong Kong Learning Larger Margin Machine Locally and Globally Dept. of Computer Science and Engineering The Chinese University.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
This whole paper is about...
Neural networks and support vector machines
Learning Recommender Systems with Adaptive Regularization
Classification Discriminant Analysis
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Classification Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Discriminative Training
Presentation transcript:

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang Huo 1 1 Microsoft Research Asia, Beijing, China 2 The University of Hong Kong, Hong Kong, China ICASSP-2010, Dallas, Texas, U.S.A., March 14-19, 2010

Outline Background What’s our new approach How does it work Conclusions

Background of Minimum Classification Error (MCE) Formulation for Pattern Classification Pioneered by Amari and Tsypkin in late 1960s – S. Amari, “A theory of adaptive pattern classifiers,” IEEE Trans. On Electronic Computers, Vol. EC-16, No. 3, pp , – Y. Z. Tsypkin, Adaptation and learning in automatic systems, – Y. Z. Tsypkin, Foundations of the theory of learning systems, Proposed originally for supervised online adaptation of a pattern classifier – to minimize the expected risk (cost) – via a sequential probabilistic descent (PD) algorithm Extended by Juang and Katagiri in early 1990s – B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. on Signal Processing, Vol. 40, No. 12, pp , 1992.

MCE Formulation by Juang and Katagiri (1) Define a proper discriminant function of an observation for each pattern class To enable a maximum discriminant decision rule for pattern classification Largely an art and application dependent

MCE Formulation by Juang and Katagiri (2) Define a misclassification measure for each observation – to embed the decision process in the overall MCE formulation – to characterize the degree of confidence (or margin) in making decision for this observation – a differentiable function of the classifier parameters A popular choice: where Many possible ways => which one is better? => an open problem!

MCE Formulation by Juang and Katagiri (3) Define a loss (cost) function for each observation – a differentiable and monotonically increasing function of the misclassification measure – many possibilities => sigmoid function most popular for approximating MCE MCE training via minimizing – empirical average loss (cost) by an appropriate optimization procedure, e.g., gradient descent (GD), Quickprop, Rprop, etc., or – expected loss (cost) by a sequential probabilistic descent (PD) algorithm (a.k.a. GPD)

Some Remarks Combinations of different choices for each of the previous three steps and optimization methods lead to various MCE training algorithms. The power of MCE training has been demonstrated by many research groups for different pattern classifiers in different applications. How to improve the generalization capability of an MCE-trained classifier?

One Possible Solution: SSM-based MCE Training Sample Separation Margin (SSM) – Defined as the smallest distance of an observation to the classification boundary formed by the true class and the most competing class – There is a closed-form solution for piecewise linear classifier Define misclassification measure as negative SSM – Other parts of the formulation is the same as “traditional” MCE A happy result – Minimized empirical error rate, and – Improved generalization Correctly recognized training samples have a large margin from the decision boundaries! For more info: – T. He and Q. Huo, “ A study of a new misclassification measure for minimum classification error training of prototype-based pattern classifiers, ’’ in Proc. ICPR-2008

What’s New in This Study? Extend SSM-based MCE training to pattern classifier with a quadratic discriminant function (QDF) – No closed-form solution to calculate SSM Demonstrate its effectiveness on a large-scale Chinese handwriting recognition task – Modified QDF (MQDF) is widely used in state-of-the-art Chinese handwriting recognition systems

Two Technical Issues How to calculate the SSM efficiently? – Formulated as a nonlinear programming problem – Can be solved efficiently because it is a quadratically constrained quadratic programming (QCQP) problem with a very special structure: A convex objective function with one quadratic equality constraint How to calculate the derivative of the SSM? – Using a technique known as sensitivity analysis in nonlinear programming – Calculated by using the solution to the problem in Eq. (1) Please refer to our paper for details

Experimental Setup Vocabulary: – 6763 simplified Chinese characters Dataset: – Training: 9,447,328 character samples # of samples per class: 952 – 5,600 – Testing: 614,369 character samples Feature extraction: – 512 “8-directional features” – Use LDA to reduce dimension to 128 Use MQDF for each character class – # of retained eigenvectors: 5 and 10 SSM-based MCE Training – Use maximum likelihood (ML) trained model as seed model – Update mean vectors only in MCE training – Optimize MCE objective function by batch-mode Quickprop (20 epochs) Distribution of writing styles in testing data

Experimental Results (1) MQDF, K=5 Regular (error in %) ML1.73 MCE1.29 SSM-MCE1.19 Cursive (error in %)

Experimental Results (2) MQDF, K=10 Regular (error in %) ML1.39 MCE1.30 SSM-MCE1.07 Cursive (error in %)

Experimental Results (3) Histogram of SSMs on training set – SSM-based MCE-trained classifier vs. conventional MCE-trained one – Training samples are pushed away from decision boundaries – Bigger the SSM, better the generalization

Conclusion and Discussions SSM-based MCE training offers an implicit way of minimizing empirical error rate and maximizing sample separation margin simutaneously – Verified for quadratic classifiers in this study – Verified for piecewise linear classifiers previously (He&Huo, ICPR-2008) Ongoing and future works – SSM-based MCE training for discriminative feature extraction – SSM-based MCE training for more flexible classifiers based on GMM and HMM – Searching for other (hopefully better) methods to combine MCE training and maximum margin training