Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang Speech Acquisition and Intelligent Technology (SAIT) Lab Beijing Language and Culture University
Outline Introduction Proposed Method Experiments and Results Conclusion 2019/4/23
Introduction Mandarin Syllabic Tonal Distinguishing ambiguous words Four basic lexical tones and a neutral tone Tone 1 (high-level), Tone 2 (high-rising), Tone 3 (low-dipping), and Tone 4 (high-falling) Neutral tone is usually highly context dependent 2019/4/23
Introduction Impact of Tone recognition Continuous speech: difficult Speech recognition Language learning Continuous speech: difficult Tonal co-articulation Sentential intonation structure Cross-speakers, variable emphasis, and topic-shift effects, etc. 2019/4/23
Introduction Previous study MFCCs I. Lehiste(1961); J. L. Zhang(1988); N. Ryant(2014,2015) Prosodic features(F0, Duration, Energy) W. J. Yang(1988); J. S. Zhang(2000, 2004); L. Wang(2013) Articulatory features(AFs) J. M. Hombert(1978); H. Chao(2012) 2019/4/23
Proposed Method Our proposed procedure about tone recognition focus on integrating phonetics information to improve tone recognition 1) estimating posterior probabilities of different AFs using a DNN classifier; 2) combining the estimated posterior probabilities with MFCC and F0 as input features; 3) realizing tone recognition using DNN-HMM 2019/4/23
Proposed Method-Tone Modeling 6 labels: 1)1 no-tone 2) five tones 2019/4/23
Proposed Method-Articulatory Features Categories Description 1 m n l r y w Voice Initial 2 b p d t g k Stop 3 z c zh ch j q Fricative 4 f s sh x h r Affricate 5 a ia ua Simple vowel and tail-dominant Final 6 e ie üe 7 o uo 8 i 9 u 10 ü 11 er 12 ai uai Head- dominant and centre- dominant 13 ei uei 14 ao iao 15 ou iou 16 an ian üan uan Nasal 17 in en uen üen 18 ang iang uang 19 eng ong ing iong 20 SIL silence Articulatory categories 2019/4/23
Experiments and Results Data set: Chinese National Hi-Tech Project 863 Train Test Hours Speakers 74 9 Utterance 42748 5625 Average length per utterance 12 The training set and testing set did not have any overlap at speaker-level and utterance-level. 2019/4/23
Articulatory DNN Classifier Frame accuracy of the articulatory DNN classifier on the cross validation Hidden layers Hidden nodes Frame Acc. 2 1024 96.30% 2048 97.21% 3 96.29% 97.29% 4 96.26% 97.10% 2019/4/23
Tone DNN-HMM Classifier Experiment setup A 660-unit input layer 6 hidden layers, each layer consists of 2048 sigmod units An output layer consist of 204 softmax units Output labels corresponding to context-dependent HMM states 2019/4/23
Tone DNN-HMM Classifier Tone error rate of different systems. System MFCC F0 AF Overall Five tones DNN-A √ 14.97% 18.29% DNN-B 12.83% 14.20% DNN-C 5.36% 9.73% DNN-D 4.78% 8.75% Overall: including no-tone label and five tone labels Five tones: only including five tone labels 2019/4/23
Confusion Matrix (%) for DNN-D System. Tone1 Tone2 Tone3 Tone4 94.39 2.25 0.56 2.12 1.49 92.79 3.48 0.94 0.57 11.26 84.36 2.32 1.29 0.84 2.48 94.12 2019/4/23
Tone DNN-HMM Classifier Compare three kinds of error(insertion, deletion, substitution) between DNN-C and DNN-D. Discussion: AFs: 1) offer the information of pitch contour affected by articulatory characteristics 2) provide the boundary information of tone or no-tone label 2019/4/23
Conclusion Articulatory feature is helpful for tone recognition. Articulatory features provide the information of pitch contour influenced by articulatory characteristics give boundary information of tone or no-tone to reduce the insert or delete errors in tone recognition. DNN model may be able to extract more useful information from the MFCC parameters for tone recognition than from F0 parameters. 2019/4/23
Thank you! 2019/4/23