NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Dahl, Yu, Deng, and Acero Accepted in IEEE Trans. ASSP , 2010
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Tom Ko and Brian Mak The Hong Kong University of Science and Technology.
I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Statistical Models for Automatic Speech Recognition
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
LECTURE 15: REESTIMATION, EM AND MIXTURES
Presentation transcript:

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong YAN 2015/08/11 Ming-Han Yang 1

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Meixu SONGJielin PANQingwei ZHAOYonghong YAN 2

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Summary Introduction Pronunciation Models Incorporate PMs into MPE training Experiments and Results Conclusion and Discussion 3 Outline

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Introducing pronunciation models into decoding has been proven to be benefit to LVCSR In this paper, a discriminative pronunciation modeling method is presented, within the framework of the Minimum Phone Error (MPE) training for HMM/GMM pronunciation models In order to bring the pronunciation models into the MPE training, the auxiliary function is rewritten at word level and decomposes into two parts – One is for co-training the acoustic models – the other is for discriminatively training the pronunciation models On Mandarin conversational telephone speech recognition task: – the discriminative pronunciation models reduced the absolute Character Error Rate (CER) by 0.7% on LDC test set, – the acoustic model co-training, 0.8% additional CER decrease had been achieved 4 Summary

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Current LVCSR technology aims at transferring real-world speech to sentence. Due to data sparsity, it is almost impossible to find a sufficiently direct conversion between speech and sentence Therefore, this conversion is divided into three parts: a)the conversion between speech feature vectors and subwords (phones for example) described by Acoustic Models (AMs); b)the conversion between words and sentence described by Language Model (LM); lexicon c)the conversion between subwords and words described by a lexicon lexicon We consider that a lexicon is composed of three parts: words, pronunciations, and Pronunciation Models (PMs). In many LVCSR systems, the lexicon is hand-crafted, that usually means the pronunciations are in canonical forms – the probability in PMs could be considered as constant 1. 5 Introduction (1/3)

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY As to pronunciation learning : As to pronunciation learning : [1] presented a discriminative pronunciation learning method using phonetic decoder and minimum classification error criterion And previous work [2], [3] made use of a state-of-the-art letter-to-sound (L2S) system based on joint-sequence modeling [4] to generate pronunciations Specifically for Mandarin pronunciation learning, the pronunciation variants of each constituent character in a word were enumerated to construct a pronunciation dictionary in [5] – This method is used to generate pronunciations for words in this paper, and the implementation details will be described. 6 Introduction (2/3) [2] I. Badr, I. McGraw, and J. Glass, “Pronunciation learning from continuous speech” [3] I. McGraw, I. Badr, and J. Glass, "Learning lexicons from speech using a pronunciation mixture model“ [4] M. Bisani and H. Ney, "Joint-sequence models for grapheme-to-phoneme conversion" [1] O. Vinyals, L. Deng, D. Yu, and A. Acero, “Discriminative pronounciation learning using phonetic decoder and minimum classification-error criterion,” [5] X. Lei, W. Wang, and A. Stolcke, "Data-driven lexicon expansion for mandarin broadcast news and conversation speech recognition"

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY As to PM training: As to PM training: In [2], [3], a pronunciation mixture model (PMM) was presented by treating pronunciations of a particular word as components in a mixture – and the distribution probabilities were learned by maximizing the likelihood of acoustic training data By contrast, in our work we modify the auxiliary function of the standard MPE training [6] to incorporate PMs. MPEPM – By doing so, a discriminative pronunciation modeling method using minimum phone error criterion is proposed, called MPEPM. co-trained – In this method, the acoustic models and pronunciation models are co-trained in an iterative fashion under the MPE training framework. In the experiment on two Mandarin conversational telephone speech test sets, compared to the baseline using a canonical lexicon, the proposed method has 1.5% and 1.1% absolute decrease in CER respectively 7 Introduction (3/3) [6] D. Povey, "Discriminative training for large vocabulary speech recognition"

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY 8 Pronunciation Models

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY The incorporation of PMs into MPE training is investigated. The MPE objective function is : To make Eq. (2) tractable, the auxiliary function of the MPE objective function is: 9 Incorporate PMs into MPE training (1/2) the average phone accuracy of all paths in lattice of the r-th training utterance [6] D. Povey, "Discriminative training for large vocabulary speech recognition"

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY To incorporate PMs, as in Eq. (1), P(O|W) is expanded to P(O|B) P(B|W), then the MPE objective function is: 10 Incorporate PMs into MPE training (2/2)

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY 11 Co-train Ams (1/2)

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY 12 Co-train Ams (2/2)

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY 13 MPEPM Define Referring to the auxiliary function used to update weight in the MPE training, we use an auxiliary function for Eq. (13):

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY We utilized the method employed in [5] to construct a pronunciation dictionary for 43k Chinese words – A character pronunciation dictionary with 7.8k pronunciations for 6.7k Chinese characters was used, to construct a full pronunciations set with 85k pronunciations After performing a forced alignment of the acoustic training data, a 0.5 threshold relative to the maximum frequency of pronunciations of every word was set to prune out low frequent pronunciations – Finally, the pronunciation dictionary used to train PMs consisted of 47k pronunciations. – The frequencies of remaining pronunciations of every word are normalized to form the initial pronunciation models Mandarin conversational speech recognition task 14 Experiments and Results (1/3) – Construct Pronunciation Dictionary [5] X. Lei, W. Wang, and A. Stolcke, "Data-driven lexicon expansion for mandarin broadcast news and conversation speech recognition"

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Training : 400 hours (2 parts) – LDC : CallHome & CallFriends(45.9 hours) & LDC04 training sets(150 hours) – 200 hours speech data Three steps for the front-end process 1)reduced bandwidth analysis, Hz --> generate 56-dimensional feature vectors (13-dimensional PLP and smoothed F0 appended with the first, second and third order derivatives) 2)utterance-based cepstra mean and variance normalization (CMS / CVN) 3)Finally, a heteroscedastic linear discriminant analysis (HLDA) was directly applied to projected 56-dimensional feature vectors into 42-dimensions The phone set for HMMs modeling consists of 179 tonal phones The final HMMs are cross-word triphone models with 3-state left-to-right topology, which are trained via the Minimum Phone Error (MPE) criteria A robust state clustering with phonetic decision trees is used, and finally 7995 tied triphone states are empirically determined with 16-component Gaussian components per state 15 Experiments and Results (2/3) – Baseline

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY 2 test sets : – HTest04 : HKUST, 4 hours (24 phone calls) – GDTest : 0.5 hour (354 conversations by phone) From these results, MPEPM shows its effectiveness 16 Experiments and Results (3/3) – Recognition Results

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY In this work, we presented a discriminative pronunciation modeling method based on the MPE training We rewrote the auxiliary function of the MPE training at word level, and incorporated PMs into it – By doing this, we explored a way to discriminatively co-train the acoustic models and the pronunciation models in an iterative fashion We demonstrated that the required statistics could be obtained in the standard MPE training – Thus, this method is easy and efficient to implement. – Finally, experimental results on Mandarin conversational speech recognition task demonstrated the effectiveness of this method. 17 Conclusions and Discussion (1/2)

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Since the current state-of-the-art systems make use of Deep Neural Networks (DNNs), we would like to discuss the possibilities of this MPEPM to be used in DNNs based framework. Currently, there are two main approaches to incorporate DNNs in acoustic modeling: 1)the TANDEM system For TANDEM system, DNNs act as feature extractors to derive bottleneck features, which can be used to train traditional HMM / GMM. Thus, this MPEPM implementation keeps constant 2)the hybrid system For the hybrid system, DNNs estimate posterior probabilities of states of HMMs. This MPEPM can be efficiently implemented within the sequence- discriminative training of DNNs, – as they are all based on reference and hypothesis lattices, especially for the one using the state-level minimum Bayes risk (sMBR) criterion, which is derived from the MPE criterion 18 Conclusions and Discussion (2/2)

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY 19