1 A Two-pass Framework of Mispronunciation Detection & Diagnosis for Computer-aided Pronunciation Training Xiaojun Qian, Member, IEEE, Helen Meng, Fellow,

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Confidence Measures for Speech Recognition Reza Sadraei.

Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Introduction to Automatic Speech Recognition

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

Automated Scoring of Picture- based Story Narration Swapna Somasundaran Chong Min Lee Martin Chodorow Xinhao Wang.

1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.

1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Olivier Siohan David Rybach

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Conditional Random Fields for ASR

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

A maximum likelihood estimation and training on the fly approach

Presenter: Shih-Hsiang(士翔)

2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.

Presentation transcript:

1 A Two-pass Framework of Mispronunciation Detection & Diagnosis for Computer-aided Pronunciation Training Xiaojun Qian, Member, IEEE, Helen Meng, Fellow, IEEE, and Frank Soong, Fellow, IEEE presented by Yao-Chi Hsu 2016/03/27 IEEE TRANSACTION ON AUDIO, SPEECH AND LANGUAGE PROCESSING, 2016

Abstract 2 This paper presents a two-pass framework with discriminative acoustic modeling for mispronunciation detection and diagnoses (MD&D). 1.The first pass of mispronunciation detection does not require explicit phonetic error pattern modeling. 2.The second pass of mispronunciation diagnosis expands the detected insertions and substitutions into phone networks, and another recognition pass attempts to reveal the phonetic identities of the detected mispronunciation errors.

Introduction (1/5) 3 Second-language learners are eager to attain high pronunciation proficiency of the target language. However, learning the unfamiliar sounds and prosody of a new language is not easy. High proficiency is attained through extensive practice with guidance from language instructors. But good (especially native) language instruction is often a scarce resource. Automatic speech recognition (ASR) technologies can support an online computer-aided pronunciation training (CAPT) platform that supplements teachers’ instructions with round-the-clock accessibility and individualized feedback for second-language learners.

Introduction (2/5) 4 CAPT platforms integrate various spoken language technologies, among which an essential technology is mispronunciation detection, which refers to locating a phone that is incorrectly articulated, and involves a binary decision. To provide useful feedback for the learner, another important technology is mispronunciation diagnosis, which identifies the incorrect phone(s) produced in place of the canonical phone.

Introduction (3/5) 5 Predominant approaches to CAPT primarily performs mispronunciation detection by extending ASR technologies, especially through post processing recognition scores (e.g., by thresholding [Witt and Young 2000] or classification [Wei et al. 2009]). Some work on modeling error patterns in the learners’ speech using linguistic knowledge [Herron et al. 1999] or adopting a data-driven approach [Wang and Lee 2013]. Error pattern modeling enables generation of a set of possible phonetic error patterns for a given text prompt.

Introduction (4/5) 6 A single-pass forced-alignment using the recognition network expanded by the error patterns enables phonetic error detection and diagnosis of erroneous phonetic production in an integrated fashion. The major consideration behind such paradigm of MD&D is to properly constrain the search space in ASR and thus avoiding a shift towards the free-phone recognition task (which is a more intractable problem).

Introduction (5/5) 7 A first-pass recognition in this network of pairs detects phonetic substitutions. Furthermore, this first-pass can also detect insertions and deletions by introducing a filler model and designing the network topology which allows phone skips. Once the insertions and substitutions are detected, a second pass performs free-phone recognition on the segments of detected insertions and substitutions to identify the actual phones. Correspondingly, we refine the two sets of HMM-based acoustic models by individualized discriminative training to minimize the expected full- sequence phone-level errors.

Previous work (1/3) 8 A.Mispronunciation Detection using Phone-level scores the “Goodness of Pronunciation” (GOP), introduced by [Witt and Young 2000], which is an approximation to the posterior score and uses either a frame-based or phone-based likelihood ratio; binary classification problem (an array of phone-level confidence measures which span a “pronunciation space” [Wei et al. 2009] can be utilized as the features for classification,)

Previous work (2/3) 9 B.Mispronunciation Detection using Other Features A considerable amount of feature design and engineering has been done in the literature according to the a priori knowledge on identifying the set of features which is more effective in distinguishing confusable phone pairs. For example, 1.segment duration 2.acoustic-phonetic features including −log root mean-square (RMS) energy, the first-order derivative of log RMS energy and zero-crossing-rate 3.adaptively-warped cepstrum 4.low-dimensional sub-space features

Previous work (3/3) 10 C.Single-pass Mispronunciation Detection and Diagnosis using Extended Recognition Networks The “extended recognition network” (ERN) framework achieves mispronunciation detection and diagnosis at the same time, by explicit incorporation of the common phonetic error patterns in addition to the canonical transcription per prompted word. Thus, it constrains the search space during phone recognition of non-native prompted speech.

The Two-pass Framework (1/3) 11 The two-pass framework attempts to abandon explicit error pattern modeling and aims for relaxation of the search space. 1.Therefore, the first-pass mispronunciation detection tells whether a phone is mispronounced or not. Also, insertions and deletions are taken into account by a careful design of the recognition network. 2.Once the insertions and substitutions are detected, the second-pass mispronunciation diagnosis performs free-phone recognition on the segments of detected insertions and substitutions to identify the actual phone mispronunciations.

The Two-pass Framework (2/3) 12 A.Mispronunciation Detection anti-phones ： The concept is simple – in addition to fitting the labelled data belonging to a phone, we directly construct a model to fit data that do not belong to the phone. To capture insertions, we introduce a universal phone model (UPM, also known as the filler model) which covers all the non-silence phones.

The Two-pass Framework (3/3) 13 B.Mispronunciation Diagnosis It is much more constrained than using free-phone recognition. This is because the mispronunciation diagnosis network is built upon the results of the first-pass of mispronunciation detection.

Phone-level Performance Metrics (1/3) 14 We propose to evaluate the performance of the two-pass framework using phone error rate (PER) which is the normalized aggregation of insertion, deletion and substitution. In mispronunciation detection, PER1 can be calculated by aligning the converted manual transcription with the recognition transcription and extracting the mismatches.

Phone-level Performance Metrics (2/3) 15 We propose to evaluate the performance of the two-pass framework using phone error rate (PER) which is the normalized aggregation of insertion, deletion and substitution. mispronounced phone (appearing as an anti-phone in the converted transcription) is recognized as the canonical phone (false acceptance) canonical phone is recognized as an anti-phone (false rejection)

Phone-level Performance Metrics (3/3) 16 In mispronunciation diagnosis, PER2 is calculated by aligning the manual transcription with the recognition transcription and extracting the mismatches. It is noted that the accuracy of mispronunciation diagnosis is dependent on the accuracy of mispronunciation detection. [part (a)] “perfect” detection

Corpus and Experimental Setup 17 Experiments are set up on the Chinese University – Chinese Learners of English (CU-CHLOE) corpus - Cantonese subset. Recordings are phonetically transcribed and cross-checked by three trained linguists. We use all the TIMIT symbols for transcription except for [hv]. The 7.3-hour training data contain recordings from 25 male speakers and 24 female speakers. Correspondingly, the 7.8-hour test data contains recordings from the other 25 males speakers and 26 female speakers.

Experiments (1/7) 18 A.Single-pass Framework Setup and Baseline HMM Results A direct comparison between the recognition transcription and the manual transcription gives a PER2 of 31.7%. We also align the recognition transcription from the single-pass framework with the canonical transcription and convert it to a form which replaces insertions with UPMs and substitutions with anti-phones. The resulting transcription is compared with the converted manual transcription, yielding a PER1 of 27.2% (FRR=14.1%, FAR=20.0%).

Experiments (2/7) 19 B.Two-pass Framework Setup and Baseline HMM Results (FRR=13.8%, FAR=20.5%) PER2 (%) Based on the transcription out of the is pronunciation detection 27.7 transcription of “perfect” detection 6.9

Experiments (3/7) 20

Experiments (4/7) 21 det. (PER1)before DTafter DT two-pass26.1%14.9%

Experiments (5/7) 22

Experiments (6/7) 23 D.Minimum Diagnosis Error Training and Results We generate mispronunciation diagnosis lattices based on “perfect” detection of errors using the baseline diagnosis HMMs.

Experiments (7/7) 24

Conclusions (1/2) 25 Maximum likelihood training of the detection HMMs, especially the anti- phones, fails to make a sharp distinction between the canonical phones and their corresponding anti-phones, with a benchmark PER1 of 46.0%. This is because the universal phone model (UPM), which is used to account for the phone insertions, overlaps with the anti-phones greatly, so the UPM and anti-phone compete to control the speech segments accessible to both of them. The problem can be partially solved by attaching a penalty to the UPM, which leads to a PER1 reduction to 26.1%.

Conclusions (2/2) 26 Discriminative training (DT) of the mispronunciation detection HMMs reduces the PER1 further to 14.9%. Visualization shows that DT separates not only the canonical phones and their anti-phones but the anti-phones and the UPM as well. Discriminative training of the diagnosis HMMs is performed on lattices based on “perfect” detection, as the training focuses on discriminating the correct phone and the rest of the competing diagnoses, instead of being biased towards handling insertions and deletions due to errors in detection. Future directions include experimenting on replacing the GMMs by with deep neural networks for further acoustic model replacement.

27