Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.

Slides:



Advertisements
Similar presentations
Acoustic/Prosodic Features
Advertisements

Building an ASR using HTK CS4706
Tone perception and production by Cantonese-speaking and English- speaking L2 learners of Mandarin Chinese Yen-Chen Hao Indiana University.
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Acoustic Characteristics of Vowels
Prosodics, Part 1 LIN Prosodics, or Suprasegmentals Remember, from our first discussions in class, that speech is really a continuous flow of initiation,
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Analyzing Students’ Pronunciation and Improving Tonal Teaching Ropngrong Liao Marilyn Chakwin Defense.
PHONETICS AND PHONOLOGY
Speech perception Relating features of hearing to the perception of speech.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Speaker Adaptation for Vowel Classification
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Sound and Speech. The vocal tract Figures from Graddol et al.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Optimal Adaptation for Statistical Classifiers Xiao Li.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Pattern Recognition Applications Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Background Infants and toddlers have detailed representations for their known vocabulary items Consonants (e.g., Swingley & Aslin, 2000; Fennel & Werker,
As a conclusion, our system can perform good performance on a read speech corpus, but we will have to develop more accurate tools in order to model the.
Lecture 6 The Intonation Phonology Suprasegmental phonology Intonation
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
World Languages Mandarin English Challenges in Mandarin Speech Recognition  Highly developed language model is required due to highly contextual nature.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Stops Stops include / p, b, t, d, k, g/ (and glottal stop)
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
National Taiwan University, Taiwan
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Phonetics, part III: Suprasegmentals October 19, 2012.
Speech Perception.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Suprasegmental Properties of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
Tone Recognition With Fractionized Models and Outlined Features Ye Tian, Jian-Lai Zhou, Min Chu, Eric Chang ICASSP 2004 Hsiao-Tsung Hung Department of.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Suprasegmental features and Prosody Lect 6A&B LING1005/6105.
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Speaker: Hsiao-Tsung Hung.
Korean Phoneme Discrimination
Chinese Language 华 文 huá wén
Investigating Pitch Accent Recognition in Non-native Speech
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Speech Perception.
Digital Systems: Hardware Organization and Design
Speech Perception (acoustic cues)
Research on the Modeling of Chinese Continuous Speech Recognition
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Introduction to Pinyin
Low Level Cues to Emotion
Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu
Presentation transcript:

Mandarin Chinese Speech Recognition

Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant pitch (Like saying “aaah”) 1 st tone – High, constant pitch (Like saying “aaah”) 2 nd tone – Rising pitch (“Huh?”) 2 nd tone – Rising pitch (“Huh?”) 3 rd tone – Low pitch (“ugh”) 3 rd tone – Low pitch (“ugh”) 4 th tone – High pitch with a rapid descent (“No!”) 4 th tone – High pitch with a rapid descent (“No!”) “5 th tone” – Neutral used for de-emphasized syllables “5 th tone” – Neutral used for de-emphasized syllables Monosyllabic language Monosyllabic language Each character represents a single base syllable and tone Each character represents a single base syllable and tone Most words consist of 1, 2, or 4 characters Most words consist of 1, 2, or 4 characters Heavily contextual language Heavily contextual language

Mandarin Chinese and Speech Processing Accoustic representations of Chinese syllables Accoustic representations of Chinese syllables Structural Form Structural Form (consonant) + vowel + (consonant) (consonant) + vowel + (consonant)

Mandarin Chinese and Speech Processing Phone Sets Phone Sets Initial/final phones [1] Initial/final phones [1] e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) Initial phones: unvoiced Initial phones: unvoiced 1 phone 1 phone Final phones: voiced (tone 1-5) Final phones: voiced (tone 1-5) Can consist of multiple phones Can consist of multiple phones

Mandarin Chinese and Speech Processing Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) Creating tone models is difficult Creating tone models is difficult Discontinuities exist in the F0 contour between voiced and unvoiced regions Discontinuities exist in the F0 contour between voiced and unvoiced regions

Prosody Prosody: “the rhythmic and intonational aspect of language” [2] Prosody: “the rhythmic and intonational aspect of language” [2] Embedded Tone Modeling[4] Embedded Tone Modeling[4] Explicit Tone Modeling[4] Explicit Tone Modeling[4]

Tone Modeling Embedded Tone Modeling Embedded Tone Modeling Tonal acoustic units are joined with spectral features at each frame [4] Tonal acoustic units are joined with spectral features at each frame [4] Explicit Tone Modeling Explicit Tone Modeling Tone recognition is completed independently and combined after post-processing [4] Tone recognition is completed independently and combined after post-processing [4]

Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Coarticulation Coarticulation Variations in syllables can cause variations in tone: Bu4 + Dui4 = Bu2 Dui4 (wrong) Variations in syllables can cause variations in tone: Bu4 + Dui4 = Bu2 Dui4 (wrong) Ni3 + Hao3 = Ni2 Hao3 (hello) Ni3 + Hao3 = Ni2 Hao3 (hello) Tone Modeling

Emebedded Tone Modeling: Two Stream Modeling Ni, Liu, Xu Spectral Stream –MFCC’s (Mel frequency cepstral coefficients) Spectral Stream –MFCC’s (Mel frequency cepstral coefficients) Describe vocal tract information Describe vocal tract information Distinctive for phones (short time duration) Distinctive for phones (short time duration) Pitch/Tone Stream – requires smoothing Pitch/Tone Stream – requires smoothing Describe vibrations of the vocal chords Describe vibrations of the vocal chords Independent of Spectral features Independent of Spectral features d/dt(pitch) aka tone and d2/dt2(pitch) are added d/dt(pitch) aka tone and d2/dt2(pitch) are added Embedded in an entire syllable Embedded in an entire syllable Affected by coarticulation (requires a longer time window) – i.e. Sandhi Tone – context dependency Affected by coarticulation (requires a longer time window) – i.e. Sandhi Tone – context dependency

Embedded Tone Modeling: Two Stream Modeling [4] Tonal Identification Features Tonal Identification Features F0 F0 Energy Energy Duration Duration Coarticulation (cont. speech) Coarticulation (cont. speech) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment?) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment?) Explicit tone modeling uses max. entropy framework [4] (discriminative model) Explicit tone modeling uses max. entropy framework [4] (discriminative model)

Explicit Tone Modeling [4] No. Feature Description # of Features 1 Duration of current, previous, and following syllables 3 2 Previous syllable is or is not sp 1 3 Slope and intercept of F0 contour of current syllable, its delta, and delta-delta 6 4 Statistical Parameters of pitch and log-energy of current syllable (i.e. max, min, mean, etc.) 10 5 Normalized max and mean of pitch and energy in each syllable in the context window 12 6 Location of current syllable within word 1 7 Tones of preceding and proceding syllables 2

Other Work Chang, Zhou, Di, Huang, & Lee [1] 3 Methods 3 Methods Powerful Language Model (no tone modeling) Powerful Language Model (no tone modeling) CER = 7.32% CER = 7.32% Embedded 2 Stream Embedded 2 Stream Tone Stream + Feature Stream Tone Stream + Feature Stream CER = 6.43% CER = 6.43% Embedded 1 Stream Embedded 1 Stream Developed Pitch extractor Developed Pitch extractor pitch track added to feature vector pitch track added to feature vector CER = 6.03% CER = 6.03%

Other Work Qian, Soong [3] F0 contour smoothing F0 contour smoothing Multi-Space Distribution (MSD) Multi-Space Distribution (MSD) Models 2 prob. Spaces Models 2 prob. Spaces Unvoiced: Discrete Unvoiced: Discrete Voiced (F0 Contour): Continuous Voiced (F0 Contour): Continuous

Other Work Lamel, Gauvain, Le, Oparin, Meng [6] Multi-Layer Perceptron Features Multi-Layer Perceptron Features Combined with MFCC’s and Pitch features Combined with MFCC’s and Pitch features Compare Language Models Compare Language Models N-Gram: Back-off Language Model N-Gram: Back-off Language Model Neural Network Language Model Neural Network Language Model Language Model Adaptation Language Model Adaptation

Other Work O. Kalinli [7] Replace prosodic features with biologically inspired auditory attention cues Replace prosodic features with biologically inspired auditory attention cues Cochlear filtering, inner hair cell, etc. Cochlear filtering, inner hair cell, etc. Other features are extracted from the auditory spectrum Other features are extracted from the auditory spectrum Intensity Intensity Frequency contrast Frequency contrast Temporal contrast Temporal contrast Orientation (phase) Orientation (phase)

Other Work Qian, Xu, Soong [8] Cross-Lingual Voice Transformation Cross-Lingual Voice Transformation Phonetic mapping between languages Phonetic mapping between languages Difficult for Mandarin and English Difficult for Mandarin and English Very different prosodic features Very different prosodic features

References [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary Mandarin Speech Recognition with different Approached in Modeling Tones” [2] Meriam-Webster Dictionary, [3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009 [4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech Recognition using Prosodic and Lexical Information in Maximum Entropy Framework” [5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech Recognition”, International Journal of Speech Technology, 2004 [6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin Speech to Text Transcription, ICASSP, 2011 [7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”, ICASSP, 2011