Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
Pattern Recognition and Machine Learning
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Chapter 4: Linear Models for Classification
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Classification and application in Remote Sensing.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Linear Discriminant Functions Chapter 5 (Duda et al.)
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Isolated-Word Speech Recognition Using Hidden Markov Models
Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Non-Bayes classifiers. Linear discriminants, neural networks.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Linear Models for Classification
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Data Mining and Decision Support
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Statistical Models for Automatic Speech Recognition
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
EE513 Audio Signals and Systems
Pattern Recognition and Machine Learning
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 15: REESTIMATION, EM AND MIXTURES
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression Chapter 7.
Linear Discrimination
Presentation transcript:

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002

1Introduction ¥ Speech recognition: map dynamic speech instantiation to a class ¥ Adjust recognizer parameters for best future recognition accuracy ¥ Bayes approach: decision rule is iff ¥ Estimate class probability as precisely as possible ¥ Direct estimation with ANNs not successful ¥ Typical modules: feature extractor, classifier (language model, acoustic model)

2Introduction ¥ Feature extraction is crucial - feature sets: LPC, cepstrum, filter-bank spectrum ¥ ML problems: form of class distribution unknown, likelihood not directly linked to classification error ¥ Discriminative training: discriminant function in place of conditional class probability, trained to minimize loss ¥ More closely linked to classification error, but optimality often not proven well, limited versatility ¥ No interaction between feature extractor and classifier, no guarantee of optimal feature classification

3 Discriminative Pattern Classification ¥ Bayes decision theory (static samples): feature pattern ¥ Individual loss: (classification ), ¥ Expected loss: ¥ Overall risk: (defines accuracy) ¥ Error count loss: ¥ Minimize resp. overall risk minimum error rate simulate expected loss: ML approach (difficult)

4 Discriminative Pattern Classification ¥ Discriminative training: evaluate error count loss accurately ¥ Classification criterion: iff ¥ Training characterized by - discriminant function: pattern type specific - design objective (loss): e.g. n. of class errors - optimization method: heuristic/proven - consistency/generalization: discriminant choice ¥ Ideal overall loss: ¥ Empirical average loss:

5 Discriminative Pattern Classification ¥ Other loss forms: - perceptron loss, squared error loss, mutual information - suboptimal, results may be inconsistent with minimum classification error (MCE) ¥ Optimization: batch/sequential, error correction, stochastic approximation, simulated annealing, gradient search ¥ Purpose: accurate classification for task at hand, not just training data - additional information needed - ML introduces a parametric probability function - discriminative training is more moderate: consistency from choice of discriminant function

6 Generalized Probabilistic Descent  Problems with existing recognizers: -lack of optimality results (LVQ, corrective training) -minimal squared error or maximal mutual minimal misclassifications -concentrate on acoustic modeling, not overall process ¥ Generalized probabilistic descent (GPD): approximate classification error count loss using sigmoidal functions and norm

7 GPD Basics  Input sample: sequence of T F -dimensional vectors  Classifier: M classes, trainable prototypes each ¥ Decision rule based on distances : iff ¥ Distances are minimal paths to closest prototypes:

8 GPD Basics ¥ Design target: find optimal, adjust based on individual loss ¥ Adjustment is gradient -based, but classification error is not differentiable w.r.t. parameters ¥ Solution: replace discriminant by ¥ Misclassification measure: - norm over classes - norm over paths

9 Probabilistic Descent Theorem ¥ If classifier parameters are adjusted by ¥ Then the overall loss decreases on average: ¥ Parameters converge to a local loss minimum if, ¥ Functions are smooth, so gradient adjustment can be used ¥ Adjustment is done for all paths, all prototypes ¥ loss function: sigmoidal function of - if probability form known, smooth loss yields MAP

10Experiments ¥ E-set task: classify E-rhyme letters (9 classes) ¥ modified k-means, 3 prototypes: 64.1% to 64.9% MCE/GDP, 3 prototypes: 74.0% to 77.2% (up to 84.4% for 4 prototypes) ¥ P-set task: classify Japanese phonemes (41 classes) ¥ segmental k-means, 5 prototypes: 86.8% MCE/GDP, 5 prototypes: 96.2%

11Derivatives ¥ More realistic applications needed: connected word recognition, open-vocabulary speech recognition... ¥ GPD has been extended to a family of more suitable methods ¥ Segmental GPD: classify continuous speech - sub-word models: divide input into segments - HMM-based acoustic models - discriminant function: class membership of a connected word sample: likelihood measure - misclassification: norm over competing sequences - softmax reparameterization

12 Open-Vocabulary Speech Recognition ¥ Recognize only selected keywords ¥ Approaches: 1) keyword spotting (treshold comparison), 2) 'filler' (nonkeyword) model, continuous recognition ¥ 1) Design model & treshold to minimize spotting error - if discriminant is low, keyword exists - error types: false detection, false alarm - loss function adjustable to emphasize types This illustrates the mechanism of keyword spotting Speech Recognizer mechanism keyword keywords

13 Open-Vocabulary Speech Recognition ¥ 2) Two classifiers: target and 'imposter' - target must be more likely than 'imposter' (ratio above treshold) to accept keyword - GPD used to minize false detections and alarms - distance: log of likelihood ratio - loss selected by error type - speaker recognition is similar Likelihood ratio test Target Classifier Alternate Classifier Speech Alternate hypothesis Keyword hypothesis in context Alternate models Language and filler models Train- ing Keyword models

14 Discriminative Feature Extraction  Replace classification rule with recognition rule including feature extraction process T : iff ¥ Overall recognizer optimized (extraction parameters by chain rule) ¥ Also applicable to intermediate features Language Model Feature Extractor Acoustic Model Speech Classifier Training Class Evaluation using loss

15 Discriminative Feature Extraction ¥ Example: Cepstrum-based speech recognition ¥ Lifter shape: low-quefrency conponents important, shape found by trial and error ¥ DFE: design lifter to minimize recognition errors - error reduction from 14.5% to 11.3%

16 Discriminative Feature Extraction ¥ Discriminative metric design: - each class has its own metric (feature extractor) iff ¥ Minimum error learning subspace method: - PCA subspace design does not guarantee low error - iterative training better but not rigorous DFE used

17Exercise ¥ Give example problems (training situations) where these alternate design objectives give suboptimal solutions in the minimum classification error sense: - maximum likelihood - minimum perceptron loss - minimum squared error - maximal mutual information