An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training Mr. Yik-Cheung Tam Dr. Brian Mak.

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

Introduction to Hypothesis Testing

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Evaluating Classifiers

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:

Hidden Markov Models Theory By Johan Walters (SR 2003)

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Fast Temporal State-Splitting for HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore.

Advances in WP1 and WP2 Paris Meeting – 11 febr

Linear Discriminant Functions Chapter 5 (Duda et al.)

Online Learning Algorithms

Sensys 2009 Speaker:Lawrence.  Introduction  Overview & Challenges  Algorithm  Travel Time Estimation  Evaluation  Conclusion.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Isolated-Word Speech Recognition Using Hidden Markov Models

Artificial Neural Networks

A New Algorithm for Improving the Remote Sensing Data Transmission over the LEO Satellite Channels Ali Payandeh and Mohammad Reza Aref Applied Science.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

Model representation Linear regression with one variable

Andrew Ng Linear regression with one variable Model representation Machine Learning.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者：汪逸婷 1.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Ch 5b: Discriminative Training (temporal model) Ilkka Aho.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Test of a Population Median. The Population Median (  ) The population median ( , P 50 ) is defined for population T as the value for which the following.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Today’s Lecture Neural networks Training

Learning, Uncertainty, and Information: Learning Parameters

Goal We present a hybrid optimization approach for solving global optimization problems, in particular automated parameter estimation models. The hybrid.

Handwritten Characters Recognition Based on an HMM Model

LECTURE 15: REESTIMATION, EM AND MIXTURES

Backpropagation David Kauchak CS159 – Fall 2019.

Linear regression with one variable

Combination of Feature and Channel Compensation (1/2)

Logistic Regression Geoff Hulten.

Presentation transcript:

An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training Mr. Yik-Cheung Tam Dr. Brian Mak

 Motivation  Overview of MCE training  Problem using N-best hypotheses  Alternative:1-nearest hypothesis  What?  Why?  How?  Evaluation  Conclusion Outline

MCE Overview  The MCE loss function:  Distance measure:  G(X) may be computed using the N-best hypotheses.  l(.) = 0-1 soft error-counting function (Sigmoid)  Gradient descent method to obtain a better estimate.

 When d(X) gets large enough,  It falls out of the steep trainable region of Sigmoid. Trainable region Problem Using N-best Hypotheses

What is 1-nearest Hypothesis?  d(1-nearest) <= d(1-best)  The idea can be generalized to N-nearest hypotheses.

 Keep the training data inside the steep trainable region. Trainable region Using 1-nearest Hypothesis

 Method 1 (exact approach)  Stack-based N-best decoder Drawback: N may be very large => memory problem Need to limit the size of N.  Method 2 (approximated approach)  Modify the Viterbi algorithm with a special pruning scheme. How to Find 1-nearest Hypothesis?

Approximated 1-nearest Hypothesis  Notation:  V(t+1, j) : accumulated score at time t+1 and state j  : transition probability from state i to j  : observation probability at time t+1 and state j  : accumulated score of the Viterbi path of the correct string at time t+1.  Beam(t+1) : beam width applied at time t+1

 There exists some “nearest” path in the search space (shaded area). Approximated 1-nearest Hypothesis (.)

System Evaluation

Corpus: Aurora  Aurora  Noisy connected digits derived from TIDIGIT.  Multi-condition training: (Train on noisy condition)  {subway, babble, car, exhibition} x {clean, 20, 15, 10, 5} (5 noise levels)  8440 training utterances.  Testing: (Test on matched noisy condition)  Same as above except with additional samples with 0 and –5 dB (7 noise levels)  28,028 testing utterances.

System Configuration  Standard 39-dimension MFCC (cep +  +  )  11 Whole-word digit HMM (0-9, oh)  16 states, 3 Gaussians per state  3-state silence HMM, 6 Gaussians per state  1-state short pause HMM tied to the 2 nd state of the silence model.  Baum-Welch training to obtain the initial HMM.  Corrective MCE training on HMM parameters.

 Compare 3 kinds of competing hypotheses:  1-best hypothesis  Exact 1-nearest hypothesis  Approx. 1-nearest hypothesis  Sigmoid parameters:  Various (control slope of Sigmoid)  Offset = 0 System Configuration (.)

 Learning rate = 0.05, with different  0.1 (best test performance)  0.5 (steeper)  0.02, (more flat) Experiment I: Effect of Sigmoid slope Baseline: 12.71% 1-best: 11.01% Approx. 1-nearest: 10.71% Exact 1-nearest: 10.45%

 Soft error < 0.95 is defined to be “effective”.  1-nearest approach has more training data when the Sigmoid slope is relatively steep. Effective Amount of Training Data 1-best (40%) Approx. 1-nearest (51%) Exact. 1-nearest (67%)

 With 100% effective training data, apply more training iterations:  = 0.004, learning rate = 0.05  Result: Slow improvement compared to the best case. Experiment II: Compensation With More Training Iterations Exact 1-nearest with gamma = 0.1

 Use a larger learning rate (0.05 -> 1.25)  Fix = (100% effective training data)  Result: 1-nearest approach is better than one-best approach after compensation. Experiment II: Compensation Using a Larger Learning Rate SystemBefore compensation After compensation Baseline12.71% 1-best12.07%11.55% Approx 1-nearest12.27%10.70% Exact 1-nearest12.16%10.79%

Using a Larger Learning Rate (.)  Training performance: MCE loss versus # of training iterations. 1-best Approx. 1-nearest Exact. 1-nearest

Using a Larger Learning Rate (..)  Test performance: WER versus # of training iterations. Approx. 1-nearest (10.70%) Exact. 1-nearest ( Exact. 1-nearest ( 10.79%) 1-best ( 1-best ( 11.55%)

Conclusion  1-best and 1-nearest methods were compared in MCE training.  Effect of Sigmoid slope.  Compensation on using a flat sigmoid.  1-nearest method is better than 1-best approach.  More trainable data are available in the 1-nearest approach.  Approx. and exact 1-nearest methods yield comparable performance.

Questions and Answers