Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

The Extended Cohn-Kanade Dataset(CK+):A complete dataset for action unit and emotion-specified expression Author:Patrick Lucey, Jeffrey F. Cohn, Takeo.
Catia Cucchiarini Quantitative assessment of second language learners’ fluency in read and spontaneous speech Radboud University Nijmegen.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
Sentence Durations and Accentedness Judgments ABSTRACT Talkers in a second language can frequently be identified as speaking with a foreign accent. It.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Assuming normally distributed data! Naïve Bayes Classifier.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Non-native Speech Languages have different pronunciation spaces
Scalable Text Mining with Sparse Generative Models
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Presented by, K.L Srinivas (M.Tech 2 nd year) Guided by, Prof. Preeti Rao (Elect. Dept) Department of Electrical Engineering, IIT Bombay Mumbai, India.
12.0 Computer-Assisted Language Learning (CALL) References: 1.“An Overview of Spoken Language Technology for Education”, Speech Communications, 51, pp ,
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
The relationship between objective properties of speech and perceived pronunciation quality in read and spontaneous speech was examined. Read and spontaneous.
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
Using Technology to Teach Pronunciation A review of the research from Melike Yücel Eleonora Frigo Laurie Wayne Ling 578, Winter 2010, Dr. Arnold.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Variational Bayesian Methods for Audio Indexing
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
1 A Two-pass Framework of Mispronunciation Detection & Diagnosis for Computer-aided Pronunciation Training Xiaojun Qian, Member, IEEE, Helen Meng, Fellow,
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A Tutorial on Speaker Verification First A. Author, Second B. Author, and Third C. Author.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Sentence Durations and Accentedness Judgments
ASR-based corrective feedback on pronunciation: does it really work?
Online Multiscale Dynamic Topic Models
Copyright © American Speech-Language-Hearing Association
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Speech Technology for Language Learning
Automatic Fluency Assessment
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Statistical Models for Automatic Speech Recognition
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Language Transfer of Audio Word2Vec:
Presentation transcript:

Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter: Davidson Date: 2009/10/14

2/20 Outline  Introduction  Speech database  Consistency of human ratings  Pronunciation scoring  Experimental results  Conclusions and future work

3/20 Introduction  CAPT (Computer-assisted Pronunciation Training) French spoken by Americans  Pronunciation scoring on Entire sentences Specific phone segments (10 phones)  Number of phone utterances vs. reliable feedback on a speaker ’ s pronunciation proficiency

4/20 Speech database (1/2)  Target language: French  Native corpus 100 Parisian French speakers Used to train models for speech recognition  Nonnative corpus 100 American students Rated by 5 French teachers Only the selected phone segments in each utterance were rated

5/20 Speech database (2/2)  4656 phone segments were selected and rated, consisting of 10 phones /an/, /eh/, /eo/, /eu/, /ey/, /in/, /on/, /r/, /uw/, /uy/  Score scale: 1 (unintelligible) to 5 (native- like)  Serious disfluent or unacceptable audio quality sentences were discarded  Each rater scored some utterances more than once without being informed Self-consistency test

6/20 Consistency of human ratings (1/4)  Inter-rater correlation Phone level Phone-specific speaker level Overall speaker level  Intra-rater correlation  Correlation between and : Vector length Standard deviation

7/20 Consistency of human ratings (2/4)  Phone level inter-rater correlation Speaker 1Speaker 2Speaker 100 Sent 1 Sent P1P Rater1 Rater

8/20 Consistency of human ratings (3/4)  Phone-specific speaker level inter-rater correlation Speaker 1Speaker 2Speaker /an/ Rater1 Rater /uy/ /eh/ 4 5  Overall speaker level inter-rater correlation Speaker Rater1 Rater Speaker 2Speaker 100

9/20 Consistency of human ratings (4/4)  Average inter- and intra-rater correlation across all phones for 5 human raters Corr typeLevel# of scoresCorr. inter Phone Sentence n/a0.65 Phone-specific speaker Overall speaker n/a0.87 intra Phone

10/20 Pronunciation scoring  HMM-based log-likelihood scores  HMM-based log-posterior probability scores  Segment duration scores

11/20 HMM-based log-likelihood scores  For each phone segment, the log-likelihood score is defined as: where = starting frame index of the phone segment = number of frames of the phone segment = observation vector = the i th model = likelihood of the current frame

12/20 HMM-based log-posterior probability scores  For each phone segment, the frame-based posterior probability is defined as: where = prior probability of the phone class  The posterior score for the phone segment is then defined as:

13/20 Segment duration scores  Phone lengths are measured in frames  Phone lengths are normalized by speakers ’ rate-of-speech  Log-probability of the normalized duration is computed using a discrete distribution of durations (trained from native training data)

14/20 Experimental results  Test set: (average) 30 sentences from each of the 100 American speakers  Experiments Human-machine correlation for phone scores Effect of varying the amount of speaker data

15/20 Human-machine correlation for phone scores (1/3)  Phone level correlations with about 450 phone scores in each phone class

16/20 Human-machine correlation for phone scores (2/3)  Phone-specific speaker level correlations with a total of 4656 phone segments across 100 speakers

17/20 Human-machine correlation for phone scores (3/3)  Comparison between human-human and human-machine correlation at the phone level and phone-specific speaker level Human-machine correlation Human-human correlation

18/20 Effect of varying the amount of speaker data (1/2)  To evaluate the system ’ s performance as a function of the number of test utterances per speaker  The number (N) of phone scores per speakers varied from 10 to 320 Phone proportion is preserved  Then, for each N, the speaker level correlation is computed between: Speaker-averaged machine scores (of N scores) Speaker-averaged human scores (of entire human score data)

19/20 Effect of varying the amount of speaker data (2/2) N=40 Posterior Duration Likelihood

20/20 Conclusions and future work  Posterior score performs better than likelihood and duration scores  The system ’ s performance is comparable to human raters at speaker level but not as good as that at phone level  Future work: More human-rated utterances Scoring algorithms with mispronunciation detection