Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Slides:

Advertisements

Similar presentations

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Advertisements

Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Confidence Measures for Speech Recognition Reza Sadraei.

ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.

Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Chapter 8 Introduction to Hypothesis Testing

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

Chapter 8 Introduction to Hypothesis Testing

7-Speech Recognition Speech Recognition Concepts

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

Speaker Authentication Qi Li and Biing-Hwang Juang, Pattern Recognition in Speech and Language Processing, Chap 7 Reporter : Chang Chih Hao.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Ch 5b: Discriminative Training (temporal model) Ilkka Aho.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.

© Copyright McGraw-Hill 2004

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Olivier Siohan David Rybach

An overview of decoding techniques for LVCSR

Conditional Random Fields for ASR

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

LECTURE 15: REESTIMATION, EM AND MIXTURES

A maximum likelihood estimation and training on the fly approach

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝

2 Reference [1] Eduardo Lleida, Richard C. Rose, “Utterance Verification in Continuous Speech Recognition: Decoding and Training Procedures”, IEEE Trans. SAP [2] J. K. Chan and F. K. Soong, “An N-best candidates-based discriminative training for speech recognition applications”, Computer Speech and Language, 1995 [3] W. Chou, B. H. Juang, and C. H. Lee, “Segmental GPD training of HMM based speech recognizer,” ICASSP 1992 [4] B. H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. Signal Processing 1992

3 Outline Introduction to Utterance Verification (UV) Utterance Verification Paradigms Utterance Verification Procedures Confidence Measures Likelihood Ratio-based Training Experimental results Summary and Conclusions

4 Introduction to Utterance Verification Utterance Verification Paradigms

5 Introduction to Utterance Verification (cont) Utterance Verification Paradigms Some problems of UV The observation vectors Y might be associated with a hypothesized word that is embedded in a string of words. The lack of language model

6 Introduction to Utterance Verification (cont) Utterance Verification Procedures Two-Pass Procedure : Fig. 1. Two-pass utterance verification where a word string and associated segmentation boundaries that are hypothesized by a maximum likelihood CSR decoder are verified in a second stage using a likelihood ratio test.

7 Introduction to Utterance Verification (cont) Utterance Verification Procedures One-Pass Procedure : Fig. 2. One-pass utterance verification where the optimum decoded string is that which directly maximizes a likelihood ratio criterion

8 Introduction to Utterance Verification (cont) Utterance Verification Procedures Likelihood Ratio Decoder

9 Introduction to Utterance Verification (cont) Utterance Verification Procedures Likelihood Ratio Decoder

10 Introduction to Utterance Verification (cont) Utterance Verification Procedures Likelihood Ratio Decoder There are two issues that must be addressed if the LR decoding is to be applicable to actual speech recognition tasks. 1. computation complexity. 2. the definition of the alternate hypothesis model.

11 Introduction to Utterance Verification (cont) Utterance Verification Procedures computation complexity Fig. 3. A possible three-dimensional HMM search space.

12 Introduction to Utterance Verification (cont) Utterance Verification Procedures computation complexity Unit level constraint : the target model and alternate model must occupy their unit initial states and unit final states at the same time instant : where corresponds to the state sequence for unit

13 Introduction to Utterance Verification (cont) Utterance Verification Procedures computation complexity state level constraint :

14 Introduction to Utterance Verification (cont) Utterance Verification Procedures Definition of alternative Models The alternative hypothesis model has two roles in UV 1. to reduce the effect of sources of variability. 2. to be more specifically to represent the incorrectly decoded hypotheses that are frequently confused with a given lexical item.

15 Introduction to Utterance Verification (cont) Utterance Verification Procedures Definition of alternative Models The alternate model must somehow “cover” the entire space of out-of-vocabulary lexical unit. If OOV utterances that are easily confused with vocabulary words are to be detected, the alternate model must provide a more detailed representation of the utterances that likely to be decoded as false alarms for individual vocabulary words

16 Introduction to Utterance Verification (cont) Utterance Verification Procedures Definition of alternative Model

17 Introduction to Utterance Verification (cont) Utterance Verification Procedures Confidence measures It was suggested that modeling errors may result in extreme values in local likelihood ratios which may cause undo influence at the word or phrase level. In order to minimize these effects, we investigated several word level likelihood ratio based confidence measures that can be computed using a non-uniform of sub-word level confidence measures.

18 Introduction to Utterance Verification (cont) Utterance Verification Procedures Confidence measures

19 Introduction to Utterance Verification (cont) Utterance Verification Procedures Confidence measures

20 Likelihood Ratio-based Training The goal of the training procedure is to increase the average value of for correct hypotheses and decrease the average value of for false alarms. LR based training is a discriminative training algorithm that based on a cost function which approximates a log likelihood ratio.

21 Likelihood Ratio-based Training (cont) Using distance measure to underlie the cost function.

22 Likelihood Ratio-based Training (cont)

23 Likelihood Ratio-based Training (cont) Imposters with scores greater than and targets with scores lower than tend to increase the average cost function. Therefore, if we minimize this function we can reduce the misclassification between targets and imposters.

24 Likelihood Ratio-based Training (cont)

25 Likelihood Ratio-based Training (cont)

26 Likelihood Ratio-based Training (cont)

27 Likelihood Ratio-based Training (cont)

28 Likelihood Ratio-based Training (cont)

29 Likelihood Ratio-based Training (cont)

30 Likelihood Ratio-based Training (cont) The complete likelihood ratio based training procedure : Train initial ML HMMs, and for each unit. For each iteration over the training database : Obtain hypothesized sub-word unit string, segmentation using the LR decoder Align the decoded sub-word unit as correct or false alarm, to obtain indicator function Update gradient of the expected cost, Update the model parameter in (17)

31 Experimental results Speech corpora : movie locator task In a trial of the system over the public switched telephone network, the service was configured to accept approximately 105 theater names, 135 city names, and between 75 and 100 current movie titles. A corpus of 4777 spontaneous spoken utterances from the trial were used in our evaluation.

32 Experimental results (cont) A total of 3025 sentences were used for training acoustic models and 1752 utterances were used for testing. The sub-word models used in the recognizer consisted of 43 context independent units. Recognition was performed using a finite state grammar built form the specification of the service, with a lexicon of 570 different words.

33 Experimental results (cont) The total number of words in the test set was 4864, where 134 of them were OOV. Recognition performance of 94% word accuracy was obtained on the “in-grammar” utterance. The feature set used for recognition included 12 mel- cepstrum, 12 delta mel-cepstrum, 12 delta-delta, mel- cepstrum, energy, delta energy, delta-delta energy coefficients, and cepstral mean normalization was applied.

34 Experimental results (cont) A single “background” HMM alternate model,, containing three states with 32 mixtures per state was used. A separate “imposter” alternative HMM model, was trained for each sub-word unit. These models contained three states with eight mixtures state.

35 Experimental results (cont) Performance is described both in terms of the receiver operating characteristic curves (ROC) and curves displaying type I + type II error plotted against the decision threshold setting.

36 Experimental results (cont) Experiment 1 : Comparison of UV Measures Fig. 4. ROC curve comparing performance of confidence measures using W 1 (w); (dashed line) and W 2 (w); (solid line) (left figure). and using W 3 (w); (dashed line) and W 4 (w); (solid line) (right figure).

37 Experimental results (cont) Experiment 1 : Comparison of UV Measures Fig. 5. type I + type II comparing performance of confidence measures usingW 3 (w); (dashed line) and W 4 (w); (solid line) It appears from the error plot in fig. 5 that W 4 is less sensitive to the setting of the confidence threshold. In the remain simulation, the W 4 will be used.

38 Experimental results (cont) Experiment 2 :Investigation of LR Training and UV strategies TABLE I Utterance Verification performance: type I + type II minimum error rate for the one-pass (OP) and the two-pass (TP) utterance verification procedure. b% number of mixtures for the background model, i% number of mixtures for the imposter model Fig. 6. Likelihood ratio training, ROC curves for initial models (dash-dot line), one iteration (dash line) and two iterations (solid line). The *-points are the minimum type I + type II error.

39 Experimental results (cont) Experiment 2 :Investigation of LR Training and UV strategies Fig. 7. One-pass versus two-pass UV comparison with the b32.i8 configuration and two iterations of the likelihood ratio training.

40 Experimental results (cont) Experiment 3 : whether or not the LR training procedures actually improved speech recognition performance TABLE II speech recognition performance given in terms of word accuracy without using utterance verification and utterance verification performance given as the sum of type I and type II error

41 Experimental results (cont) Experiment 4 : measured over in-grammar and out- of-grammar utterances, respectively. Fig. 8. In-grammar and out-of-grammar sentences. Initial models: dot-dash line, one iteration: dash line and two iterations: solid line.

42 Summary and Conclusions The one-pass decoding procedure improved UV performance over the two pass approach. Likelihood ratio training and decoding has also been successfully applied to other task including speaker dependent voice label recognition. Further research should involve the investigation of decoding and training paradigms for UV that incorporate additional, non-acoustic sources of knowledge.