Presented by, K.L Srinivas (M.Tech 2 nd year) Guided by, Prof. Preeti Rao (Elect. Dept) Department of Electrical Engineering, IIT Bombay Mumbai, India.

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
EE3P BEng Final Year Project – 1 st meeting SLaTE – Speech and Language Technology in Education Martin Russell
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Speaker Adaptation for Vowel Classification
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
Why is ASR Hard? Natural speech is continuous
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Introduction to Automatic Speech Recognition
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
World Languages Mandarin English Challenges in Mandarin Speech Recognition  Highly developed language model is required due to highly contextual nature.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 A Two-pass Framework of Mispronunciation Detection & Diagnosis for Computer-aided Pronunciation Training Xiaojun Qian, Member, IEEE, Helen Meng, Fellow,
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
A NONPARAMETRIC BAYESIAN APPROACH FOR
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Presentation transcript:

Presented by, K.L Srinivas (M.Tech 2 nd year) Guided by, Prof. Preeti Rao (Elect. Dept) Department of Electrical Engineering, IIT Bombay Mumbai, India CS : Lecture 34 Pronunciation Scoring For Language Learners Using A Phone Recognition System

Department of Electrical Engineering, IIT Bombay2 Introduction Pronunciation refers to the manner in which a particular word of a language is uttered. Motivation  Accurate pronunciation or articulation is a vital component of a language acquisition process.  Fluency in speech of a non-native speaker of a language can be judged by pronunciation and prosody.  Non availability of a classroom environment for learners. Subjective Evaluation Word spoken: Kaleidoscopic Speaker 1 Speaker 2

Department of Electrical Engineering, IIT Bombay3 Problem statement  Developing computer based automatic pronunciation scoring system.  Accessing the closeness of language learner pronunciation to that of reference speaker (already stored in system).  To provide language learner with pronunciation score and feedback.

A brief on Automatic Speech Recognition

Department of Electrical Engineering, IIT Bombay5 Introduction  Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words.  Getting a computer to understand spoken language.  Approaches to ASR  Template matching  Knowledge-based (rule based approach)  Statistical approach (machine learning)

Department of Electrical Engineering, IIT Bombay6 Statistical based approach :  Collect a large corpus of transcribed speech recordings.  Train the computer to learn the corresponding instances (Machine learning).  At run time, apply statistical processes to search through the space of all possible solutions and pick the statistically most likely one.

Department of Electrical Engineering, IIT Bombay7 Speech recognition tool kits :  Sphinx and HTK are two widely accepted and used speech recognition tools. –CMU sphinx : Carnegie Mellon University (CMU) –HTK : Cambridge University  Both the frameworks are used for developing, training and testing a speech model from existing corpus speech data.  Both use Hidden Markov Modeling techniques.

Department of Electrical Engineering, IIT Bombay8 MFCC feature vector : The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice Frame size : 25 msec Hop size : 10 msec 39 feature per 10ms frame Absolute : Log Frame Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13 absolute coefficients Delta-Delta : Second-order derivatives of the 13 absolute coefficients

Department of Electrical Engineering, IIT Bombay9 Sphinx 3 : Training : Testing / Decoding:

Department of Electrical Engineering, IIT Bombay10 Decoder ouput : Recognition Hypothesis : –This gives the single best recognition result for each utterance processed. –Linear word sequence with their time segmentation and their scores. Output format :

Department of Electrical Engineering, IIT Bombay11  Non-native speech characters:  Phone substitutions: S in word ‘she’ pronounced as s  Phonotactic constraints: Stop cluster sk in school pronounced as iskUl  Use of language model masks out the non-nativeness during recognition.  Accuracy of state-of-the-art phone recognition systems as low as 50%-70%  Traditional ASR techniques cannot be used for non-native speech  Phone recognition to be carried out in constrained mode Automatic Speech Recognition for non-native speech

Back to pronunciation scoring

Department of Electrical Engineering, IIT Bombay13 Pronunciation Scoring System Generation of Pronunciation Variants Constrained Phone Decoder Variant Selection Boundary Refinement and Prosodic analysis Canonical transcription of the utterance Input speech signal Prosody ScoreArticulation Score Pronunciation Score

Department of Electrical Engineering, IIT Bombay14 Pronunciation Variants  Challenges No ready database of speakers of Indian English Multiple L1s for Indian speakers poses further challenges. Native Hindi and native English databases are available Canonical form: SIL f aa n d aa m ee n’ clt t aa l s SIL Variant_1: SIL f aa n d a m ee n’ clt t aa l s SIL Variant_2: SIL f aa n d ee m ee n’ clt t aa l s SIL Word: Fundamentals

Department of Electrical Engineering, IIT Bombay15 Constrained Phone Decoding HMM based recognizers  HTK 3.4  Sphinx 3 Decoding Input Speech Utterance Acoustic Models from training Variants Aligned Phone Sequence with likelihood for each variant

Department of Electrical Engineering, IIT Bombay16 Variant Selection Aligned Phone Sequence with likelihood for each variant Visual Feedback and Articulation Score  Strik and Cucchiarini (2000): Pronunciation variations and modeling  Goronzy, Rapp and Komp (2004 ): Non-native pronunciation variations and generation ( native English speakers speaking German)  Wesenick and Schiel (1994 ), Wesenick (1996): Generation of rules for German pronunciation variations  Franco et al. (1997), Franco et al. (2000): A paradigm for automatic assessment of pronunciation quality.  Witt and Young (2000): Presented likelihood based goodness of pronunciation scheme

Department of Electrical Engineering, IIT Bombay17 Databases  TIMIT database 630 speakers of 8 major dialects of American English. Each speaking 10 phonetically rich sentences.  TIFR database 100 native speakers of Hindi. Each speaking 10 phonetically rich sentences.  Indian English database - Testing 30 Indian college students each speaking the 2 common sentences from TIMIT database.  47 class TIMIT models: Entire phone set from TIMIT.  52 class Union models: Entire TIMIT phone set(47 phones) and 5 additional phones from the TIFR phone set making a total of 52 phones.  48 class Union models: Entire TIFR Hindi phone set(36 phones) and 12 phones from TIMIT. Training

Department of Electrical Engineering, IIT Bombay18 Experiments and Evaluation The focus of this work is to investigate the effect of selection of phone models from one of 47, 52-union and 48- union phone models  Method I: The number of instances in which the surface transcription is within the top N decoded sequences in terms of likelihood score.  Method II: The edit distance between the most likely phone sequence and the surface transcription in terms of %correct and %accuracy  Method III: Normalized likelihood error. A value of “0” for this measure indicates the best achievable performance.

Department of Electrical Engineering, IIT Bombay19 Performances of Method I and II Tabulation of Method I and Method II of evaluation for HTK 3.4 and Sphinx 3 HTK 3.4Sphinx3 Method IMethod IIMethod IMethod II Decoder models # of Unique variants Reference transcription in %Corr%AccReference transcription in %Corr%Acc Top 1Top 5Top 1Top 5 SA1 47class class class SA2 47class class class

Department of Electrical Engineering, IIT Bombay20 Performance of Method III Distribution of the likelihood scores across the 60 utterances 48-phone class has average likelihood error closest to zero of the three phone sets.

Department of Electrical Engineering, IIT Bombay21 Articulation Scoring methods :  Articulation score indicates the closeness of language learner’s pronunciation with native speaker (of target language) pronunciation.  Detects phoneme level mispronunciation and extent to which phoneme has been mispronounced.  Algorithm uses speech models derived from speech database of native speakers.  Uses forced align tools in the background to get acoustic scores (quantitative measure indicating acoustic fit for that particular speech segment).  Two methods investigated: GOP (Goodness of Pronunciation) score [2]. Method by Sunil K. Gupta [9].

Department of Electrical Engineering, IIT Bombay22 GOP scoring method :  Confidence with which particular phone has been recognized.  Also called as Goodness Of Pronunciation (GOP) score.  GOP score is given by normalized log posterior probability

Department of Electrical Engineering, IIT Bombay23 GOP scoring method (cont.) : Block diagram for Articulation scoring

Department of Electrical Engineering, IIT Bombay24 Method by Sunil K. Gupta :  Shortcoming of GOP score : Threshold selection was based on subjective rating of human judges. Not providing any quantitative measure to measure extent of mispronunciation. Free decoder not accurate enough leading to alignment errors.  In this method two speech models have to be prepared: 48 class phone models ( 36 TIFR Hindi + 12 TIMIT English) Garbage model ( all phonemes of speech data combined to get one speech model)

Department of Electrical Engineering, IIT Bombay25 Garbage model :  A single speech model combining all the phonemes of speech data.  Entire speech corpus trained with garbage transcription.

Department of Electrical Engineering, IIT Bombay26 Methodology (cont.) :  Utterance is force aligned using Sphinx3_align with the reference transcription using 48 class phone models. Each phoneme of the transcription will have its own acoustic score. These log-likelihood scores are duration normalized given by  Similarly, utterance is force aligned with garbage transcription using Garbage model.  Difference between these two likelihood is current phoneme likelihood  This difference score (d) is used for coming up with phone articulation score using lookup score table.(explained in next slide)

Department of Electrical Engineering, IIT Bombay27 Formation of score table :  For each utterance “In-grammar” and “Out-grammar” is formed In-grammar : When the transcription is conforming to target acoustic waveform. Out-grammar : Transcription selected is some random phrase from training database not conforming to target acoustic waveform.  In-grammar and Out-grammar transcriptions are force aligned to come up with log-likelihood scores: In-grammar : Out-grammar :

Department of Electrical Engineering, IIT Bombay28 Score table (cont.) :  Using all the in-grammar points and out-grammar points, pdf is formed for each phoneme.  Using these probability density functions are used for coming up with score table.(table shown in results section)

Department of Electrical Engineering, IIT Bombay29 Results :  Histograms and Gaussian pdf (approximating data points) for both In- grammar and Out-grammar for phoneme “aa”:

Department of Electrical Engineering, IIT Bombay30 Results (cont.) :  Histograms and Gaussian pdf (approximating data points) for both In- grammar and Out-grammar for phoneme “ee”:

Department of Electrical Engineering, IIT Bombay31 Results (cont.) :  Combined PDF of In-grammar and Out-grammar for “aa” :

Department of Electrical Engineering, IIT Bombay32 Results (cont.) :  Combined PDF of In-grammar and Out-grammar for “ee” :

Department of Electrical Engineering, IIT Bombay33 Results (score table) :  Below calculations and table is for phoneme “aa” :  f denotes probability density function.  and are In-grammar and Out-grammar mean respectively.  For In-grammar and Out-grammar points :  Score table for phoneme “aa” in next slide :

Department of Electrical Engineering, IIT Bombay34 Score table (phoneme “aa” ) : Dh (x)Score

Department of Electrical Engineering, IIT Bombay35 Score table (phoneme “aa” ) : Dh (x)Score

Department of Electrical Engineering, IIT Bombay36 Result (Speaker 1 : Fundamentals) : PhoneCorrect Pronun.Incorrect Pronun. dScored h aa % % n d aa % % m ee n’ SI t aa % % l

Department of Electrical Engineering, IIT Bombay37 Result (Speaker 2 : Fundamentals) : PhoneCorrect Pronun.Incorrect Pronun. dScored h aa % % n d aa % % m ee n’ SI t aa % % l

Department of Electrical Engineering, IIT Bombay38 Duration scoring :  Duration score provides feedback on normalized relative duration difference between language learner speech and reference speaker speech.  Denotes whether a particular syllable is stressed or not.  If L i and R i are their respective durations corresponding to phoneme q i then,utterance consisting of N phones can be denoted by:  Normalized durations given by:

Department of Electrical Engineering, IIT Bombay39 Duration scoring (cont.) :  Overall duration score given by :  Maximum duration score is ‘1’ and minimum is ‘0’.

Department of Electrical Engineering, IIT Bombay40 Duration scoring (Results) :  Speaker_1 was taken as reference and duration scores were calculated for other speakers.  Speaker_1

Department of Electrical Engineering, IIT Bombay41 Duration scoring (Results) :  Speaker_1 Vs Speaker_2  Duration score = (low due to differences in ‘f’, ’a’ and ‘s’ duration)

Department of Electrical Engineering, IIT Bombay42 Duration scoring (Results) :  Speaker_1 Vs Speaker_3  Duration score = (low due to differences in ‘f’, ’a’ and ‘s’ duration)

Department of Electrical Engineering, IIT Bombay43 Feedback, Articulation and Duration Score Speaker 2 Canonical Transcription (Reference speaker) SIL f aa n d aa m ee n’ clt t aa l s SIL Speaker 1 Transcription SIL f aa n d ee m ee n’ clt t aa l s SIL Feedback Articulation Score: 72%Duration Score: SIL f aa n d ee m ee n’ clt t aa l s SIL

Department of Electrical Engineering, IIT Bombay44 References 1)Strik, H., Neri, A., and Cucchiarini, C Speech technology for language tutoring. In Proceedings of LangTech ( Rome, Italy,February 28-29, 2008). 2)Witt, S., and Young, S Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication. Vol. 30, pp , )Franco, H., et al Automatic scoring of pronunciation quality. Speech Communication. Vol. 30, pp , )Kawai, G., Hirose, K A method for measuring the intelligibility and non nativeness of phone quality in foreign language pronunciation training. In Proceedings of ICSLP-98 (Sydney, Australia, November 30- December 04,1998).pp )Goronzy, S., Rapp, S., Kompe, R Generating non-native pronunciation variants for lexicon adaption. Speech Communication. Vol. 42, pp , )Lee, K.F Large-vocabulary speaker-independent continuous speech recognition: The SPHINX system. Ph.D. dissertation, Comput. Sci. Dep., Carnegie Mellon University. 7)Young, S., et al The HTK Book v3. Cambridge University, )Samudravijaya, K., Rawat, K.D., and Rao, P.V.S Design of Phonetically Rich Sentences for Hindi Speech Database. J. Ac. Soc. Ind. Vol. XXVI, December 1998, pp )Sunil K. Gupta, Ziyi Lu and Fengguang Zhao, “ Automatic Pronunciation Scoring for Language Learning ”, U.S. Patent 7,219,059, May 15, 2007.