Mandarin Chinese Speech Recognition
Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant pitch (Like saying “aaah”) 1 st tone – High, constant pitch (Like saying “aaah”) 2 nd tone – Rising pitch (“Huh?”) 2 nd tone – Rising pitch (“Huh?”) 3 rd tone – Low pitch (“ugh”) 3 rd tone – Low pitch (“ugh”) 4 th tone – High pitch with a rapid descent (“No!”) 4 th tone – High pitch with a rapid descent (“No!”) “5 th tone” – Neutral used for de-emphasized syllables “5 th tone” – Neutral used for de-emphasized syllables Monosyllabic language Monosyllabic language Each character represents a single base syllable and tone Each character represents a single base syllable and tone Most words consist of 1, 2, or 4 characters Most words consist of 1, 2, or 4 characters Heavily contextual language Heavily contextual language
Mandarin Chinese and Speech Processing Accoustic representations of Chinese syllables Accoustic representations of Chinese syllables Structural Form Structural Form (consonant) + vowel + (consonant) (consonant) + vowel + (consonant)
Mandarin Chinese and Speech Processing Phone Sets Phone Sets Initial/final phones [1] Initial/final phones [1] e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) Initial phones: unvoiced Initial phones: unvoiced 1 phone 1 phone Final phones: voiced (tone 1-5) Final phones: voiced (tone 1-5) Can consist of multiple phones Can consist of multiple phones
Mandarin Chinese and Speech Processing Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) Creating tone models is difficult Creating tone models is difficult Discontinuities exist in the F0 contour between voiced and unvoiced regions Discontinuities exist in the F0 contour between voiced and unvoiced regions
Prosody Prosody: “the rhythmic and intonational aspect of language” [2] Prosody: “the rhythmic and intonational aspect of language” [2] Embedded Tone Modeling[4] Embedded Tone Modeling[4] Explicit Tone Modeling[4] Explicit Tone Modeling[4]
Tone Modeling Embedded Tone Modeling Embedded Tone Modeling Tonal acoustic units are joined with spectral features at each frame [4] Tonal acoustic units are joined with spectral features at each frame [4] Explicit Tone Modeling Explicit Tone Modeling Tone recognition is completed independently and combined after post-processing [4] Tone recognition is completed independently and combined after post-processing [4]
Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Coarticulation Coarticulation Variations in syllables can cause variations in tone: Bu4 + Dui4 = Bu2 Dui4 (wrong) Variations in syllables can cause variations in tone: Bu4 + Dui4 = Bu2 Dui4 (wrong) Ni3 + Hao3 = Ni2 Hao3 (hello) Ni3 + Hao3 = Ni2 Hao3 (hello) Tone Modeling
Emebedded Tone Modeling: Two Stream Modeling Ni, Liu, Xu Spectral Stream –MFCC’s (Mel frequency cepstral coefficients) Spectral Stream –MFCC’s (Mel frequency cepstral coefficients) Describe vocal tract information Describe vocal tract information Distinctive for phones (short time duration) Distinctive for phones (short time duration) Pitch/Tone Stream – requires smoothing Pitch/Tone Stream – requires smoothing Describe vibrations of the vocal chords Describe vibrations of the vocal chords Independent of Spectral features Independent of Spectral features d/dt(pitch) aka tone and d2/dt2(pitch) are added d/dt(pitch) aka tone and d2/dt2(pitch) are added Embedded in an entire syllable Embedded in an entire syllable Affected by coarticulation (requires a longer time window) – i.e. Sandhi Tone – context dependency Affected by coarticulation (requires a longer time window) – i.e. Sandhi Tone – context dependency
Embedded Tone Modeling: Two Stream Modeling [4] Tonal Identification Features Tonal Identification Features F0 F0 Energy Energy Duration Duration Coarticulation (cont. speech) Coarticulation (cont. speech) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment?) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment?) Explicit tone modeling uses max. entropy framework [4] (discriminative model) Explicit tone modeling uses max. entropy framework [4] (discriminative model)
Explicit Tone Modeling [4] No. Feature Description # of Features 1 Duration of current, previous, and following syllables 3 2 Previous syllable is or is not sp 1 3 Slope and intercept of F0 contour of current syllable, its delta, and delta-delta 6 4 Statistical Parameters of pitch and log-energy of current syllable (i.e. max, min, mean, etc.) 10 5 Normalized max and mean of pitch and energy in each syllable in the context window 12 6 Location of current syllable within word 1 7 Tones of preceding and proceding syllables 2
Other Work Chang, Zhou, Di, Huang, & Lee [1] 3 Methods 3 Methods Powerful Language Model (no tone modeling) Powerful Language Model (no tone modeling) CER = 7.32% CER = 7.32% Embedded 2 Stream Embedded 2 Stream Tone Stream + Feature Stream Tone Stream + Feature Stream CER = 6.43% CER = 6.43% Embedded 1 Stream Embedded 1 Stream Developed Pitch extractor Developed Pitch extractor pitch track added to feature vector pitch track added to feature vector CER = 6.03% CER = 6.03%
Other Work Qian, Soong [3] F0 contour smoothing F0 contour smoothing Multi-Space Distribution (MSD) Multi-Space Distribution (MSD) Models 2 prob. Spaces Models 2 prob. Spaces Unvoiced: Discrete Unvoiced: Discrete Voiced (F0 Contour): Continuous Voiced (F0 Contour): Continuous
Other Work Lamel, Gauvain, Le, Oparin, Meng [6] Multi-Layer Perceptron Features Multi-Layer Perceptron Features Combined with MFCC’s and Pitch features Combined with MFCC’s and Pitch features Compare Language Models Compare Language Models N-Gram: Back-off Language Model N-Gram: Back-off Language Model Neural Network Language Model Neural Network Language Model Language Model Adaptation Language Model Adaptation
Other Work O. Kalinli [7] Replace prosodic features with biologically inspired auditory attention cues Replace prosodic features with biologically inspired auditory attention cues Cochlear filtering, inner hair cell, etc. Cochlear filtering, inner hair cell, etc. Other features are extracted from the auditory spectrum Other features are extracted from the auditory spectrum Intensity Intensity Frequency contrast Frequency contrast Temporal contrast Temporal contrast Orientation (phase) Orientation (phase)
Other Work Qian, Xu, Soong [8] Cross-Lingual Voice Transformation Cross-Lingual Voice Transformation Phonetic mapping between languages Phonetic mapping between languages Difficult for Mandarin and English Difficult for Mandarin and English Very different prosodic features Very different prosodic features
References [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary Mandarin Speech Recognition with different Approached in Modeling Tones” [2] Meriam-Webster Dictionary, [3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009 [4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech Recognition using Prosodic and Lexical Information in Maximum Entropy Framework” [5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech Recognition”, International Journal of Speech Technology, 2004 [6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin Speech to Text Transcription, ICASSP, 2011 [7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”, ICASSP, 2011