Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang

Slides:



Advertisements
Similar presentations
An Introduction of Chinese Language Clary Xue
Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Building an ASR using HTK CS4706
Tone perception and production by Cantonese-speaking and English- speaking L2 learners of Mandarin Chinese Yen-Chen Hao Indiana University.
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Being a Chinese Tutor Can You Translate Them?. Being a Chinese Tutor.
Pinyin Foundation (6) Pinyin Foundation (6) 拼音基础.
Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Difficulties In Learning Chinese 第二組 余叔穎 謝宛真 陳伊娟 賴英瑛 郭欣怡 林紓帆 王慈慧.
Speaker Adaptation for Vowel Classification
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Create Photo-Realistic Talking Face Changbo Hu * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang.
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
Chapter three Phonology
2012 年 8 月 16 日 Do Now Complete the table: èrsānwŭ qījiŭ.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Study of Word-Level Accent Classification and Gender Factors
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
莫 晓灵 (Mo Xiaoling) 萩原 正人 (Hagiwara Masato) 商务汉语 Meetup.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Science Fall 2009 Nov 2, Outline Suprasegmental features of speech Stress Intonation Duration and Juncture Role of feedback in speech production.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Introduction The Chinese language is: Official language in ROC and PROC Writing form: Traditional (ROC, Japan, Korea…) Simplified (PROC…) Non-phonetic.
你 好 nǐ hǎo Chinese 1A Jill Chang
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Pinyin Introduction day-1 Single finals Compound finals Initials.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
National Taiwan University, Taiwan
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Speaker: Hsiao-Tsung Hung.
About one-fifth of the world’s population, or over one billion people, speak some form of Chinese as their native language. 1. How many people speak Chinese.
第一步 Tutorial 1 L197 Beginners’ Chinese 14/11/2009.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Chinese Phonetics System &Review.
Olivier Siohan David Rybach
Chinese Language 华 文 huá wén
Investigating Pitch Accent Recognition in Non-native Speech
2 Research Department, iFLYTEK Co. LTD.
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Pick samples from task t
Statistical Models for Automatic Speech Recognition
Chinese Pinyin 中文拚音.
Research on the Modeling of Chinese Continuous Speech Recognition
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Introduction to Pinyin
Presenter: Shih-Hsiang(士翔)
Phonetics and Phonemics
Low Level Cues to Emotion
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu
Presentation transcript:

Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang Speech Acquisition and Intelligent Technology (SAIT) Lab Beijing Language and Culture University

Outline Introduction Proposed Method Experiments and Results Conclusion 2019/4/23

Introduction Mandarin Syllabic Tonal Distinguishing ambiguous words Four basic lexical tones and a neutral tone Tone 1 (high-level), Tone 2 (high-rising), Tone 3 (low-dipping), and Tone 4 (high-falling) Neutral tone is usually highly context dependent 2019/4/23

Introduction Impact of Tone recognition Continuous speech: difficult Speech recognition Language learning Continuous speech: difficult Tonal co-articulation Sentential intonation structure Cross-speakers, variable emphasis, and topic-shift effects, etc. 2019/4/23

Introduction Previous study MFCCs I. Lehiste(1961); J. L. Zhang(1988); N. Ryant(2014,2015) Prosodic features(F0, Duration, Energy) W. J. Yang(1988); J. S. Zhang(2000, 2004); L. Wang(2013) Articulatory features(AFs) J. M. Hombert(1978); H. Chao(2012) 2019/4/23

Proposed Method Our proposed procedure about tone recognition focus on integrating phonetics information to improve tone recognition 1) estimating posterior probabilities of different AFs using a DNN classifier; 2) combining the estimated posterior probabilities with MFCC and F0 as input features; 3) realizing tone recognition using DNN-HMM 2019/4/23

Proposed Method-Tone Modeling 6 labels: 1)1 no-tone 2) five tones 2019/4/23

Proposed Method-Articulatory Features   Categories Description 1 m n l r y w Voice Initial 2 b p d t g k Stop 3 z c zh ch j q Fricative 4 f s sh x h r Affricate 5 a ia ua Simple vowel and tail-dominant Final 6 e ie üe 7 o uo 8 i 9 u 10 ü 11 er 12 ai uai Head- dominant and centre- dominant 13 ei uei 14 ao iao 15 ou iou 16 an ian üan uan Nasal 17 in en uen üen 18 ang iang uang 19 eng ong ing iong 20 SIL silence Articulatory categories 2019/4/23

Experiments and Results Data set: Chinese National Hi-Tech Project 863   Train Test Hours Speakers 74 9 Utterance 42748 5625 Average length per utterance 12 The training set and testing set did not have any overlap at speaker-level and utterance-level. 2019/4/23

Articulatory DNN Classifier Frame accuracy of the articulatory DNN classifier on the cross validation Hidden layers Hidden nodes Frame Acc. 2 1024 96.30% 2048 97.21% 3 96.29% 97.29% 4 96.26% 97.10% 2019/4/23

Tone DNN-HMM Classifier Experiment setup A 660-unit input layer 6 hidden layers, each layer consists of 2048 sigmod units An output layer consist of 204 softmax units Output labels corresponding to context-dependent HMM states 2019/4/23

Tone DNN-HMM Classifier Tone error rate of different systems. System MFCC F0 AF Overall Five tones DNN-A   √ 14.97% 18.29% DNN-B 12.83% 14.20% DNN-C 5.36% 9.73% DNN-D 4.78% 8.75% Overall: including no-tone label and five tone labels Five tones: only including five tone labels 2019/4/23

Confusion Matrix (%) for DNN-D System.   Tone1 Tone2 Tone3 Tone4 94.39 2.25 0.56 2.12 1.49 92.79 3.48 0.94 0.57 11.26 84.36 2.32 1.29 0.84 2.48 94.12 2019/4/23

Tone DNN-HMM Classifier Compare three kinds of error(insertion, deletion, substitution) between DNN-C and DNN-D. Discussion: AFs: 1) offer the information of pitch contour affected by articulatory characteristics 2) provide the boundary information of tone or no-tone label 2019/4/23

Conclusion Articulatory feature is helpful for tone recognition. Articulatory features provide the information of pitch contour influenced by articulatory characteristics give boundary information of tone or no-tone to reduce the insert or delete errors in tone recognition. DNN model may be able to extract more useful information from the MFCC parameters for tone recognition than from F0 parameters. 2019/4/23

Thank you! 2019/4/23