2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Multipitch Tracking for Noisy Speech

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.

Speaker Adaptation for Vowel Classification

Chapter three Phonology

Introduction to Automatic Speech Recognition

English Pronunciation Learning System for Japanese Students Based on Diagnosis of Critical Pronunciation Errors Yasushi Tsubota, Tatsuya Kawahara, Masatake.

Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,

Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Acoustic Cues to Laryngeal Contrasts in Hindi Susan Jackson and Stephen Winters University of Calgary Acoustics Week in Canada October 14,

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.

New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.

National Taiwan University, Taiwan

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

Supervised Sequence Labelling with Recurrent Neural Networks PRESENTED BY: KUNAL PARMAR UHID:

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

A NONPARAMETRIC BAYESIAN APPROACH FOR

Automatic Speech Recognition

Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition Pengrui Wang, Jie Li, Bo Xu Interactive Digital.

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Pick samples from task t

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Audio Books for Phonetics Research

CRANDEM: Conditional Random Fields for ASR

Automatic Speech Recognition: Conditional Random Fields for ASR

Research on the Modeling of Chinese Continuous Speech Recognition

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Attention for translation

Presenter: Shih-Hsiang(士翔)

Measuring the Similarity of Rhythmic Patterns

Listen Attend and Spell – a brief introduction

Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu

Presentation transcript:

2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2 and Yanlu Xie1 1Beijing Language and Culture University, Beijing, China 2University of Illinois at Urbana-Champaign, USA

Outline 01 02 03 04 INTRODUCTION CTC-BASED LANDMARK DETECTION THE FRAMEWORK OF PRONUNCIATION ERROR DETECTION 04 CONCLUSIONS

INTRODUCTION A number of CAPT approaches have been presented to detect pronunciation errors at segmental level in the last few decades. Most of them are based on automatic speech recognition (ASR) frameworks , and they have advantage of predicting pronunciation errors easily and flexibly in same way for all phonemes. ASR based CAPT system are heavily limited by the size of training data and language backgrounds For specific error detection, their detection accuracy need to be improved in order to give learners more precise feedbacks. Many of L2 learners’ erroneous sounds cannot be simply categorized as insertion, deletion or substitution i.e. pronunciation erroneous tendency(PET) 01 02 03 04

INTRODUCTION Some approach to identify subtle distinctions : Use voice onset time (VOT) features to detect aspirated stops (/p/, /t/, /k/) in Mandarin Chinese pronounced by Japanese learners (L2) based on SVM Employed LDA to differentiate a plosive (/k/) from a fricative (/x/) in Dutch combining energy features (Rate of Rise values) with duration information As for mispronounced Dutch vowel substitutions made by L2 learners, formants (F1-F3) and intensity of segments were considered It is difficult to find distinctive features for all kinds of phonetic contrasts. Stevens’s acoustic landmark theory by exploring underlying regions of quantal nonlinear correlates between articulators and acoustic attributes provides a cue for choosing distinctive features that are suitable for pronunciation error detection and speech recognition 01 02 03 04

INTRODUCTION 01 02 03 04 However… It needs to study speech production mechanism and to do a number of annotations based on experiments of human speech perception, which are laborious and expensive The RNN acoustic model based on CTC technique is ultimately used to represent a modeling unit with a pulse signal (spike or peak). This spiky property of network output is similar to landmark’s because they all underlying assume that the information is not evenly distributed in the speech signal. 01 02 03 04

CTC-BASED LANDMARK DETECTION 01 We assume that the positions of key frames are landmarks. 02 03 04

Peak Detection Algorithm CTC-BASED LANDMARK DETECTION Peak Detection Algorithm Decode each utterance using BLSTM-RNN acoustic model. Extract posteriors of phones detected at each time step. Consequently, it forms a one-dimensional sequence sorted by time index. Compute the peak function value ai of each point xi at each time steps. Select the function of S1 as peak function which computes the average of the maximum distances between k left neighbors and right neighbors of xi. The k is set to 4 time steps, a half of phone duration estimated on the corpus. Compute the mean and standard deviation of all positive values of ai . Remove small local peaks in global context according to Chebyshev Inequality and store their temporal information. Order peaks again by their temporal index.

FRAMEWORK

EXPERIMENTS AND RESULTS CTC-based Landmark Detection Corpora English: 100 hours of LibriSpeech corpus for training TIMIT corpus was selected as the test set(except the dialect utterances ‘SA’) TIMIT transcriptions were based on abrupt acoustic changes If no acoustic evidence existed for a certain phone, then no label was put there About 68% of the total number of landmarks in the corpus are acoustically abrupt landmarks which are associated with consonantal segments, e.g., a stop closure or release. We chose stops (/p/, /t/, /k/, /b/, /d/, /g/) to verify our hypothesis.

EXPERIMENTS AND RESULTS Experiment setup Experiment was built for phone recognition using EESEN [23]. BLSTM-RNN acoustic model was trained by CTC technique 40-dimentional fliterbank features with their first and second order derivatives were extracted with 25ms window shifted by every 10ms to the input layer of BLSTMRNN. Four bidirectional LSTM hidden layers were connected with 320 cells in each forward and backward hidden layer. We employed the CMU dictionary as the lexicon without considering stresses.

EXPERIMENTS AND RESULTS Evaluation Metrics Recall: The ratio of the number of hits to the number of hand-labeled landmarks. Precision: The ratio of the number of hits to the number of total landmarks detected. F-measure: The harmonic mean of recall and precision. There is “but didn’t” in the transcription “train/dr2/fajw0/sx273.phn”. /t/ is in the final position of ‘but’ and /d/ is in the initial position of “didn’t”. There is no release of /t/ and there is no closure of /d/

EXPERIMENTS AND RESULTS Landmark-based Pronunciation Error Detection Corpora Chinese: Acoustic model: Chinese National Hi-Tech Project 863 corpus(100 hours) 6 native Chinese speakers from Chinese part of BLCU inter-Chinese speech corpus Their utterances first were forced alignment by HTK and then human transcribers corrected the boundaries. For the purpose of CAPT, we collected a large scale of Chinese interlanguage corpus read by Japanese referred to as the BLCU inter-Chinese corpus. We select 7 Japanese females who read 1899 utterances. 80% of the data was selected as training set, and the rest as test set. 16 most frequent PETs and their canonical sounds constituted 16 binary phonetic contrasts.

EXPERIMENTS AND RESULTS DATA-DRIVEN AND KNOWLEDGE-BASED LANDMARKS

EXPERIMENTS AND RESULTS Detection Results

EXPERIMENTS AND RESULTS THE RESULTS OF LANDMARK-BASED SYSTEM AND DNN-HMM-BASED SYSTEM

Conclusion We firstly verify the hypothesis that the positions of spiky phone posterior outputs of the model trained by CTC technique are consistent with the stop burst landmarks annotated in the TIMIT corpus. As a result, we think these peaks evaluated by CTC-based acoustic model are similar to landmarks and they can be generalized to other phones. Then we propose a pronunciation error detection framework on Chinese learning based on landmarks and SVM, and the landmarks can be predicted automatically from BLSTM-RNN acoustic model. Experiments illustrate that data-driven CTC landmark model is comparable to knowledge-based model in pronunciation error detection. Their combination further improves the performance which outperforms DNNHMM+MFCC system.

THANKS