Woo Kyeong Seong, Ji Hun Park, and Hong Kook Kim

Slides:

Advertisements

Similar presentations

FEATURE PERFORMANCE COMPARISON FEATURE PERFORMANCE COMPARISON y SC is a training set of k-dimensional observations with labels S and C b C is a parameter.

Advertisements

Building an ASR using HTK CS4706

Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.

Speech Recognition with Hidden Markov Models Winter 2011

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):

Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.

Non-native Speech Languages have different pronunciation spaces

1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.

LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Introduction to Automatic Speech Recognition

Speech and Language Processing

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Compensating speaker-to-microphone playback system for robust speech recognition So-Young Jeong and Soo-Young Lee Brain Science Research Center and Department.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Tom Ko and Brian Mak The Hong Kong University of Science and Technology.

ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Olivier Siohan David Rybach

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Statistical Models for Automatic Speech Recognition

Byung Joon Park, Sung Hee Kim

International Conference on Computer and Applications (CCA 2012), Proceedings Abstract: Client/Server-Based Cultural Tourist Guide System Using Voice.

EEG Recognition Using The Kaldi Speech Recognition Toolkit

Statistical Models for Automatic Speech Recognition

AN ANALYSIS OF TWO COMMON REFERENCE POINTS FOR EEGS

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.

LECTURE 15: REESTIMATION, EM AND MIXTURES

Research on the Modeling of Chinese Continuous Speech Recognition

Automatic Speech Recognition

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

Network Training for Continuous Speech Recognition

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Presentation transcript:

Woo Kyeong Seong, Ji Hun Park, and Hong Kook Kim Performance Improvement of Dysarthric Speech Recognition Using Context-Dependent Pronunciation Variation Modeling Based on Kullback-Leibler Distance Woo Kyeong Seong, Ji Hun Park, and Hong Kook Kim School of Information and Communications Gwangju Institute of Science and Technology (GIST) 1 Oryong-dong, Buk-gu, Gwangju 500-712, Korea {wkseong, jh_park, hongkook}@gist.ac.kr Abstract. In this paper, we propose context-dependent pronunciation variation modeling based on the Kullback-Leibler (KL) distance for improving the performance of dysarthric automatic speech recognition (ASR). To this end, we construct a triphone confusion matrix based on KL distances between triphone models, and build a weighted finite state transducer (WFST) from the triphone confusion matrix. Then, dysarthric speech is recognized by a baseline ASR system. The corresponding phoneme sequence of the recognized sentence is then passed through the WFST to correct recognition errors. It is shown from dysarthric ASR experiments that average word error rate of an ASR system employing an error correction based on the proposed method is relatively reduced by 16.54% and 3.34%, compared to those of an ASR system without any error correction and using an error correction based on a conventional context- dependent phoneme confusion matrix, respectively. Keywords: Dysarthric speech recognition, Kullback-Leibler distance, context- dependent pronunciation variation model, weighted finite state transducer, speech recognition error correction 1 Introduction Dysarthria refers to impairment of speech resulting from damaged control of the oral, pharyngeal, or laryngeal articulators [1]. The impairment of a dysarthric speech re-sults in pronunciation variations such as deletion, substitution, insertion, and distor-tion of particular phonemes. These pronunciation variations can cause degradation in dysarthric automatic speech recognition (ASR) performance [2]. In order to compen-sate for the pronunciation variations, there have been reported several works which automatically extracted pronunciation variation rules based on a data-driven approach [3][4]. However, in the case of dysarthric speech, the pronunciation variation rules extracted by the data-driven approach are not reliable because of the insufficiency of the database [5]. For this reason, a model-based approach can be considered as an alterna- 53

Fig. 1. An example of the weighted finite state transducer obtained from a triphone confusion matrix. tive for extracting pronunciation variation rules for dysarthric speech. Therefore, a context-dependent (CD) pronunciation modeling method is proposed in order to correct speech recognition errors. Here, the proposed method is based on Kullback- Leibler (KL) distance between acoustic models, and it is incorporated into the con- struction of a weighted finite state transducer for error correction. 2 Proposed KL Distance Based Context-Dependent Pronunciation Variation Modeling Method In order to obtain pronunciation variation rules, we first calculate the KL distances between acoustic models. In this paper, acoustic models are composed of triphones, where each triphone is represented as a three-state, left-to-right hidden Markov model (HMM) with four Gaussian mixtures. Thus, a KL distance is defined as [6] −C +R||L−C +R)= i j 4 4 KL(L ∑ ωi,kKL(bi k ||bj k) k= 1 ∑ωi,klogωi,k (1) k =1ω j,k where L − Ci + R means a triphone whose center phoneme is Ci while its left and right context are defined as phoneme L and phoneme R, and ωi ,k is the k-th Gaussian mixture weight of the triphone of L − Ci + R. In addition, KL(bi,k || b j,k) is the KL distance between the k-th Gaussian mixtures of L − Ci + R and L − Cj + R, defined as 1 | Σ j , k | -1 ( || )= (ln( ) tr( ) ( − ) ( T -1 KL b b + Σ Σ + µ µ Σ µ − µ ) − K) (2) i,k j,k j,k i,k j,k i,k j,k j,k i,k 2 |Σi,k | 54

3 Performance Evaluation where µi ,k and Σ i ,k represent a mean vector and a covariance matrix of bi ,k , respectively, and K indicates the feature vector dimension that is 39 in this paper. Therefore, KL(L − C i + R | | L − C j + R) in Eq. (1) corresponds to the degree of pronunciation variation from phoneme Ci to phoneme C j with the same left and right context as L and R. Next, we obtain a triphone confusion matrix whose ij-th ele-ment is KL(L − C i + R | | L − C j + R). Finally, the triphone confusion matrix is trans- formed to a weighted finite state transducer (WFST). Each element in the triphone confusion matrix is represented as a self-loop, in which the WFST outputs a phoneme C j with KL(L − C i + R | | L − C j + R) when a phoneme Ci with L − C i + R comes into the WFST. Fig. 1 illustrates an example of WFST obtained from the triphone confusion matrix. If a phoneme /SH/ with /AH/-/SH/+/IH/ passes through the WFST, the WFST outputs the phoneme /CH/, /SH/, or /S/ with the KL distance values of 4.145, 5.254, or 8.013, respectively. Here, a higher value indicates more confusability. 3 Performance Evaluation To evaluate the performance of the proposed method, we constructed the following three ASR systems: a baseline ASR system (Baseline), an ASR system with error correction based on a data-driven CD phoneme confusion matrix (EC-DD) [4], and an ASR system with error correction based on the proposed CD phoneme confusion matrix (EC-KL). For the baseline ASR system, 10,000 utterances of British English sentences from 92 non-dysarthric speakers were used as a training database [7]. As a recognition feature, 39-dimensional mel-frequency cepstral coefficient vectors were used. The acoustic models were composed of 22,433 triphones, where each triphone was represented as a three-state, left-to-right hidden Markov model (HMM) with four Gaussian mixtures. In addition, triphone models in the baseline ASR system were adapted to each dysarthric speaker with 34 utterances from the Nemours database [8]. The lexicon size was 112 words, and a back-off bigram model was employed as a language model. For EC-DD and EC-KL, CD phoneme confusion matrices were constructed by using 34 utterances that were identical to those used for the adaptation of the triphone models. As a test database, we used 400 utterances from ten dysarthric speakers [8]. Table 1 compares average word error rates (WERs) of Baseline, EC-DD, and EC-KL. As shown in the table, EC-KL provided the lowest WER and relatively reduced average WER by 3.34%, compared to EC-DD. 4 Conclusion In this paper, we proposed a context-dependent pronunciation variation modeling method based on KL distance for improving the performance of dysarthric speech recognition. First, a triphone confusion matrix was constructed based on KL distances 55

Table 1. Comparison of average word error rates (%) of the baseline ASR system (Baseline) and ASR systems with error correction based on a conventional data-driven CD phoneme confusion matrix (EC-DD) and proposed CD phoneme confusion matrix based on KL distances (EC-KL). System Baseline EC-DD EC-KL Word error rate (%) 36.75 31.73 30.67 between triphone models, and a weighted finite state transducer (WFST) was built from the triphone confusion matrix. Then, the WFST was used to correct speech recognition errors by passing through a phoneme sequence of a recognized sentence. It was shown from speech recognition experiments that a dysarthric ASR system employing the proposed pronunciation variation modeling method provided a relative WER reduction by by 3.34%, compared to that employing a data-driven pronunciation variation modeling method. Acknowledgments. This research was supported in part by the R&D Program of MKE/KEIT [1003641, Development of an embedded key-word spotting speech recognition system individually customized for disabled persons with dysarthria], and by the basic research project through a grant provided by the Gwangju Institute of Science and Technology (GIST) in 2012. References Haines, D.: Neuroanatomy: an Atlas of Structures, Sections, and Systems. Lippingcott Williams and Wilkins, Hagerstown, MD (2004) Hasegawa-Johnson, M., Gunderson, J., Perlman, A., Huang, T.: HMM-based and SVM-based recognition of the speech of talkers with spastic dysarthria. In: Proceedings of ICASSP, Toulouse, France, pp. 1060-1063 (2006) Morales, S. O. C., Cox, S. J.: Modeling errors in automatic speech recognition for dysarth-ric speakers. EURASHIP Journal on Advances in Signal Processing, Article ID 308340, 14 pages (2009) Seong, W. K., Park, J. H., Kim, H. K.: Dysarthric speech recognition error correction using weighted finite state transducers based on context-dependent pronunciation variation. In: Proceedings of International Conference on Computers Helping People with Special Needs, Linz, Austria, pp. 475-482 (2012) You, H., Alwan, A.: A statistical acoustic confusability metric between hidden Markov models. In: Proceedings of ICASSP, Los Angeles, CA, pp. 745-748 (2007) Do, M.: Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models. IEEE Signal Processing Letters, 10(4), 115-118 (2003) Fransen, T. J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of ICASSP, Detroit, MI, pp. 81-84 (1995) Menendez-Pidal, X., Polikoff, J. B., Peters, S. M., Leonzio, J. E., Bunnell, H. T.: The Nemours database of dysarthric speech. In: Proceedings of ICSLP, Philadelphia, PA, pp. 1962-1965 (1996) 56