2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2 and Yanlu Xie1 1Beijing Language and Culture University, Beijing, China 2University of Illinois at Urbana-Champaign, USA
Outline 01 02 03 04 INTRODUCTION CTC-BASED LANDMARK DETECTION THE FRAMEWORK OF PRONUNCIATION ERROR DETECTION 04 CONCLUSIONS
INTRODUCTION A number of CAPT approaches have been presented to detect pronunciation errors at segmental level in the last few decades. Most of them are based on automatic speech recognition (ASR) frameworks , and they have advantage of predicting pronunciation errors easily and flexibly in same way for all phonemes. ASR based CAPT system are heavily limited by the size of training data and language backgrounds For specific error detection, their detection accuracy need to be improved in order to give learners more precise feedbacks. Many of L2 learners’ erroneous sounds cannot be simply categorized as insertion, deletion or substitution i.e. pronunciation erroneous tendency(PET) 01 02 03 04
INTRODUCTION Some approach to identify subtle distinctions : Use voice onset time (VOT) features to detect aspirated stops (/p/, /t/, /k/) in Mandarin Chinese pronounced by Japanese learners (L2) based on SVM Employed LDA to differentiate a plosive (/k/) from a fricative (/x/) in Dutch combining energy features (Rate of Rise values) with duration information As for mispronounced Dutch vowel substitutions made by L2 learners, formants (F1-F3) and intensity of segments were considered It is difficult to find distinctive features for all kinds of phonetic contrasts. Stevens’s acoustic landmark theory by exploring underlying regions of quantal nonlinear correlates between articulators and acoustic attributes provides a cue for choosing distinctive features that are suitable for pronunciation error detection and speech recognition 01 02 03 04
INTRODUCTION 01 02 03 04 However… It needs to study speech production mechanism and to do a number of annotations based on experiments of human speech perception, which are laborious and expensive The RNN acoustic model based on CTC technique is ultimately used to represent a modeling unit with a pulse signal (spike or peak). This spiky property of network output is similar to landmark’s because they all underlying assume that the information is not evenly distributed in the speech signal. 01 02 03 04
CTC-BASED LANDMARK DETECTION 01 We assume that the positions of key frames are landmarks. 02 03 04
Peak Detection Algorithm CTC-BASED LANDMARK DETECTION Peak Detection Algorithm Decode each utterance using BLSTM-RNN acoustic model. Extract posteriors of phones detected at each time step. Consequently, it forms a one-dimensional sequence sorted by time index. Compute the peak function value ai of each point xi at each time steps. Select the function of S1 as peak function which computes the average of the maximum distances between k left neighbors and right neighbors of xi. The k is set to 4 time steps, a half of phone duration estimated on the corpus. Compute the mean and standard deviation of all positive values of ai . Remove small local peaks in global context according to Chebyshev Inequality and store their temporal information. Order peaks again by their temporal index.
FRAMEWORK
EXPERIMENTS AND RESULTS CTC-based Landmark Detection Corpora English: 100 hours of LibriSpeech corpus for training TIMIT corpus was selected as the test set(except the dialect utterances ‘SA’) TIMIT transcriptions were based on abrupt acoustic changes If no acoustic evidence existed for a certain phone, then no label was put there About 68% of the total number of landmarks in the corpus are acoustically abrupt landmarks which are associated with consonantal segments, e.g., a stop closure or release. We chose stops (/p/, /t/, /k/, /b/, /d/, /g/) to verify our hypothesis.
EXPERIMENTS AND RESULTS Experiment setup Experiment was built for phone recognition using EESEN [23]. BLSTM-RNN acoustic model was trained by CTC technique 40-dimentional fliterbank features with their first and second order derivatives were extracted with 25ms window shifted by every 10ms to the input layer of BLSTMRNN. Four bidirectional LSTM hidden layers were connected with 320 cells in each forward and backward hidden layer. We employed the CMU dictionary as the lexicon without considering stresses.
EXPERIMENTS AND RESULTS Evaluation Metrics Recall: The ratio of the number of hits to the number of hand-labeled landmarks. Precision: The ratio of the number of hits to the number of total landmarks detected. F-measure: The harmonic mean of recall and precision. There is “but didn’t” in the transcription “train/dr2/fajw0/sx273.phn”. /t/ is in the final position of ‘but’ and /d/ is in the initial position of “didn’t”. There is no release of /t/ and there is no closure of /d/
EXPERIMENTS AND RESULTS Landmark-based Pronunciation Error Detection Corpora Chinese: Acoustic model: Chinese National Hi-Tech Project 863 corpus(100 hours) 6 native Chinese speakers from Chinese part of BLCU inter-Chinese speech corpus Their utterances first were forced alignment by HTK and then human transcribers corrected the boundaries. For the purpose of CAPT, we collected a large scale of Chinese interlanguage corpus read by Japanese referred to as the BLCU inter-Chinese corpus. We select 7 Japanese females who read 1899 utterances. 80% of the data was selected as training set, and the rest as test set. 16 most frequent PETs and their canonical sounds constituted 16 binary phonetic contrasts.
EXPERIMENTS AND RESULTS DATA-DRIVEN AND KNOWLEDGE-BASED LANDMARKS
EXPERIMENTS AND RESULTS Detection Results
EXPERIMENTS AND RESULTS THE RESULTS OF LANDMARK-BASED SYSTEM AND DNN-HMM-BASED SYSTEM
Conclusion We firstly verify the hypothesis that the positions of spiky phone posterior outputs of the model trained by CTC technique are consistent with the stop burst landmarks annotated in the TIMIT corpus. As a result, we think these peaks evaluated by CTC-based acoustic model are similar to landmarks and they can be generalized to other phones. Then we propose a pronunciation error detection framework on Chinese learning based on landmarks and SVM, and the landmarks can be predicted automatically from BLSTM-RNN acoustic model. Experiments illustrate that data-driven CTC landmark model is comparable to knowledge-based model in pronunciation error detection. Their combination further improves the performance which outperforms DNNHMM+MFCC system.
THANKS