Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.

Speech Recognition with Hidden Markov Models Winter 2011

The Computerised FDA Application Formulating A System of Acoustic Objective Measures for the Frenchay Dysarthria Assessment Tests.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :

: Recognition Speech Segmentation Speech activity detection Vowel detection Duration parameters extraction Intonation parameters extraction German Italian.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.

Speaker Clustering using MDL Principles Kofi Boakye Stat212A Project December 3, 2003.

Speaker Adaptation for Vowel Classification

Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

A Multimedia English Learning System Using HMMs to Improve Phonemic Awareness for English Learning Yen-Shou Lai, Hung-Hsu Tsai and Pao-Ta Yu Chun-Yu Chen.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.

Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.

Design and Implementation of Speech Recognition Systems Spring 2014 Class 13: Training with continuous speech 26 Mar

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

Copyright  2014 Pearson Education, Inc. or its affiliate(s). All rights reserved. Automatic Assessment of the Speech of Young English Learners Jian Cheng,

Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Performance Comparison of Speaker and Emotion Recognition

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

1 A Two-pass Framework of Mispronunciation Detection & Diagnosis for Computer-aided Pronunciation Training Xiaojun Qian, Member, IEEE, Helen Meng, Fellow,

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Statistical Models for Automatic Speech Recognition

Lecture 15: Text Classification & Naive Bayes

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.

Presentation transcript:

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter: Davidson Date: 2009/07/08, 2009/07/15

Contents  Introduction  Goodness of Pronunciation (GoP) algorithm Basic GoP algorithm Phone dependent thresholds Explicit error modeling  Collection of a non-native database  Performance measures  The labeling consistency of the human judges  Experimental results  Conclusions and future work

Introduction (1/3)  CAPT systems (Computer-Assisted Pronunciation Training)  Word and phrase level scoring ( ’ 93, ’ 94, ’ 97) Intonation, stress, and rhythm Requires several recordings of native utterances for each word Difficult to add new teaching material  Selected phonemic error teaching (1997) Uses duration information or models trained on non-native speech

Introduction (2/3)  HMM has been used to produce sentence-level scores (1990, 1996)  Eskenazi ’ s system (1996) produces phone-level scores but no attempt to relate this to human judgement  Author ’ s proposed system: Measures pronunciation quality for non- native speech at the phone level

Introduction (3/3)  Other issues GoP algorithms with refinements Performance measures for both GoP scores and scores by human judges Non-native database Experiments on these performance measures

Goodness of Pronunciation (GoP) algorithm: Basic GoP algorithm  A score for each phone  = likelihood of the acoustic segment corresponding to each phone  GoP = duration normalized log of the posterior probability for a phone given the corresponding acoustic segment

Basic GoP algorithm (2/5)  = the set of all phone models  = number of frames in  By assuming equal phone priors and approximating by its maximum:

Basic GoP algorithm (3/5)  Numerator term is computed using forced alignment with known transcription  Denominator term is determined using an unconstrained phone loop

Basic GoP algorithm (4/5)  If a mispronunciation has occurred, it is not reasonable to constrain the acoustic segment used to compute the maximum likelihood phone to be identical to the assumed phone  Hence, the denominator score is computed by summing the log likelihood per frame over the duration of  In practice, this will often mean that more than one phone in the unconstrained phone sequence has contributed to the computation of

Basic GoP algorithm (5/5)  Intuitive to use speech data from native speakers to train the acoustic models  However, non-native speech is characterized by different formant structures compared to those from a native speaker for the same phone  Adapt Gaussian means by MLLR  Use only one single global transform of the HMM Gaussian component mean to avoid adapting to specific phone error patterns

Phone dependent thresholds  The acoustic fit of phone-based HMMs differs from phone to phone E.g. fricatives tend to have lower log likelihood than vowels  2 ways to determine phone-specific thresholds By using mean and variance for phone By approximating human labeling behavior

Explicit error modeling (1/3)  2 types of pronunciation errors Individual mispronunciations Systematic mispronunciations  Consists of substitutions of native sounds for sounds of the target language, which do not exist in the native language  Knowledge of the learner ’ s native language is included in order to detect systematic mispronunciation

Explicit error modeling (2/3)  Solution: a recognition network incorporating both correct pronunciation and common pronunciation errors in the form of error sublattices for each phone.  E.g. “ but ”

Explicit error modeling (3/3)  Target phone posterior probability  Scores for systematic mispronunciations  GoP that includes additional penalty for systematic mispronunciation

Collection of a non-native database (1/2)  Based on the procedures used for the WSJCAM0 corpus  Texts are composed of a limited vocabulary of 1500 words  6 females and 4 males whose mother- tongues are Korean (3), Japanese (3), Latin-American Spanish (3), and Italian (1).  Each speaker reads 120 sentences 40 common set of phonetically-balanced sentences 80 sentences varied from session to session

Collection of a non-native database (2/2)  6 human judges who speaks native British English Each speaker was labeled by 1 judge  20 sentences from a female Spanish speakers are used as calibration sentences Annotated by all 6 judges  Transcriptions reflect the actual sound uttered by the speakers Including phonemes from other languages

Performance measures (1/3)  Compares 2 transcriptions of the same sentence Transcriptions are either transcribed by human judges or generated automatically  4 types of performance measures Strictness Agreement Cross-correlation Overall phone correlation

Performance measures (2/3)  Compared on a frame by frame basis  Each error is marked as 1 or 0 otherwise. Yields a vector of length with  Apply a Hamming window Transition between 0 and 1 is too abrupt where as in practice the boundary is often uncertain Forced alignment might be erroneous due to poor acoustic modeling of non-native speech Window length

Performance measures (3/3)

Strictness (S)  Measures how strict the judge was in marking pronunciation errors  Relative strictness

Overall Agreement (A)  Measures the agreement of all frames between 2 transcriptions  Defined in terms of cityblock distance between 2 transcription vectors

Cross-correlation (CC)  Measures the agreement between the error frames in either or both transcriptions is the Euclidean distance

Phoneme Correlation (PC)  Measures the overall agreement of overall rejection statistics for each phone between 2 judges/systems  PC is defined as is a vector of rejection count for each phone denotes the mean rejection counts

Labeling consistency of the human judges (1/4)

Labeling consistency of the human judges (2/4)  All results are within an acceptable range 0.85<A<0.95, mean = <CC<0.65, mean = <PC<0.85, mean = < <0.14, mean = 0.06  These mean values can be used as a benchmark values

Labeling consistency of the human judges (3/4)

Labeling consistency of the human judges (4/4)

Experimental results (1/7)  Multiple mixture monophone models  Corpus: WSJCAM0  Range of rejection threshold was restricted to lie within one standard deviation of the judges strictness where

Experimental results (2/7)

Experimental results (3/7)

Experimental results (4/7)

Experimental results (5/7)

Experimental results (6/7)  Add error handling with Latin-American Spanish models to detect systematic mispronunciations

Experimental results (7/7)  Transcriptions comparison between human judges and the system with error network

Conclusions and future work  2 GoP scoring mechanism Basic GoP GoP with systematic mispronunciation penalty  Refinement methods MLLR adaptation Independent thresholds trained from human judgement Error network  Future work Information about the type of mistake