Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.

Slides:

Advertisements

Similar presentations

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)

Pattern Recognition and Machine Learning

Managerial Decision Modeling with Spreadsheets

Yazd University, Electrical and Computer Engineering Department Course Title: Machine Learning By: Mohammad Ali Zare Chahooki Bayesian Decision Theory.

A Short Introduction to Curve Fitting and Regression by Brad Morantz

Confidence Measures for Speech Recognition Reza Sadraei.

Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.

Visual Recognition Tutorial

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.

Visual Recognition Tutorial

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Isolated-Word Speech Recognition Using Hidden Markov Models

Principles of Pattern Recognition

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Optimal Bayes Classification

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 8 Sept 23, 2005 Nanjing University of Science & Technology.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.

Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.

INTRODUCTION TO HYPOTHESIS TESTING From R. B. McCall, Fundamental Statistics for Behavioral Sciences, 5th edition, Harcourt Brace Jovanovich Publishers,

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Machine Learning 5. Parametric Methods.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.

Decision Analysis Pertemuan Matakuliah: A Strategi Investasi IT Tahun: 2009.

Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.

CS479/679 Pattern Recognition Dr. George Bebis

Lecture 15. Pattern Classification (I): Statistical Formulation

Decisions Under Risk and Uncertainty

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Data Mining Lecture 11.

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Pattern Recognition PhD Course.

EE513 Audio Signals and Systems

Pattern Recognition and Machine Learning

LECTURE 23: INFORMATION THEORY REVIEW

Parametric Methods Berlin Chen, 2005 References:

Information Theoretical Analysis of Digital Watermarking

1 Chapter 8: Introduction to Hypothesis Testing. 2 Hypothesis Testing The general goal of a hypothesis test is to rule out chance (sampling error) as.

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Discriminative Training

Presentation transcript:

Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School of Electrical & Computer Engineering Georgia Institute of Technology ASRU 2007

Outline Introduction Weighted word error rate The minimum risk decision rule & weighted MCE method Training scenarios & weighting strategies in ASR Experiment results for weighted MCE Conclusion & future work

Review of Bayes decision theory A conditional loss for classifying into a class event : Expected loss function If we impose the assumption that the error loss function is uniform maximum a posteriori (MAP) decision rule –It transforms the classifier design problem into a distribution estimation problem Several limitations !!

Introduction In a variety of ASR applications, some errors should be considered more critical than others in terms of system objective –Keyword spotting system, speech understanding system,… –The difference of the significance of the recognition error is necessary and a nonuniform error cost function becomes appropriate –This transforms the classifier design into an error cost minimization problem instead of a distribution estimation problem

An example for non-uniform error rate Here is an example for using non-uniform error rate : The weighted word error rate (WWER) can be calculated as 0 AT N. E. C. THE NEED FOR INTERNATIONAL MANAGERS WILL KEEP RISING 1 AT ANY SEE THE NEED FOR INTERNATIONAL MANAGERS WILL KEEP RISING 2 AT N. E. C. NEEDS FOR INTERNATIONAL MANAGER ’ S WILL KEEP RISING Two recognition results with equal-significance word error rate. But, which is better ?

An example for non-uniform error rate cont. An example of weighted word error rate : 0 AT N. E. C. THE NEED FOR INTERNATIONAL MANAGERS WILL KEEP RISING AT ANY SEE THE NEED FOR INTERNATIONAL MANAGERS WILL KEEP RISING AT N. E. C. NEEDS FOR INTERNATIONAL MANAGER ’ S WILL KEEP RISING

The Minimum Risk decision rule minimum risk (MR) decision rule : –involves a weighted combination of the a posteriori probabilities for all the classes

A practical MR rule

A practical MR rule cont. We can prescribe a discriminant function for each class,, and define the practical decision rule for the recognizer as The alternative system loss is then

A practical MR rule cont. The approximation then needs to be made to the summands

The weighted MCE method The objective function of the weighted MCE is

Training Scenarios Intra-level training –The training and recognition decisions are on the same semantic level with the performance measure Inter-level training –The training and recognition decisions are on the different semantic level with the performance metric –Minimizing the cost of the wrong recognition decisions does not directly optimize the recognizer’s performance in term of the evaluation metric –To alleviate this inconsistency, the error weighting strategy could be built in a cross-level fashion

Two types of error cost User-defined cost –Usually characterized by the system requirement and relatively straightforward Data-defined cost –More complicated –The wrong decisions occur because the underlying data observation deviates form the distribution represented –“bad” data ? or “bad” models ? –It is possible to measure the “reliability” of the errors by introducing the data-defined weighting

Error weighting for intra-level training In the intra-level training situation, the system performance is directly measured by the loss of wrong recognition decisions We can absorb both types of the error weighting into the error cost function as one universal functional form The objective function for the weighted MCE could be written as :

Error weighting for inter-level training We need to use cross-level weighting in this case to break down the high level cost and impose the appropriate weights upon the low level models The user-defined weighting of the weighted MCE in the inter-level training can be written as :

Error weighting for inter-level training cont. The data-defined weighting of the weighted MCE in the inter-level training can be written as : A W-MCE objective function including both weighting function under the inter-level training scenario can be written as

Weighted MCE & MPE/MWE method The MPE/MWE is a training method with a weighted objective function to mimic training errors :

Weighted MCE & MPE/MWE method cont. To maximize the original MPE/MWE objective function is equivalent to minimize the modified objective function : In summary, MPE/MWE builds a objective function that incorporates the non-uniform error cost of each training utterance –W-MCE & MPE/MWE are both rooted in the Bayes decision theory, directing to the same aim of designing the optimal classifier to minimize the non-uniform error cost

W-MCE implementation In our experiments, we assume that the weighting function only contains the data-defined weighting for simplicity

Experiments Database : WSJ0