Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

Slides:



Advertisements
Similar presentations
Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.
1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,
By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
WEB-DATA AUGMENTED LANGUAGE MODEL FOR MANDARIN SPEECH RECOGNITION Tim Ng 1,2, Mari Ostendrof 2, Mei-Yuh Hwang 2, Manhung Siu 1, Ivan Bulyko 2, Xin Lei.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Advisor: Prof. Tony Jebara
ASRU 2007 Survey Presented by Shih-Hsiang. 2 Outline LVCSR –Building a Highly Accurate Mandarin Speech RecognizerBuilding a Highly Accurate Mandarin Speech.
1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Compensating speaker-to-microphone playback system for robust speech recognition So-Young Jeong and Soo-Young Lee Brain Science Research Center and Department.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.
Building A Highly Accurate Mandarin Speech Recognizer
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
CRANDEM: Conditional Random Fields for ASR
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
Research on the Modeling of Chinese Continuous Speech Recognition
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Speaker Identification:
Learning Long-Term Temporal Features
Presentation transcript:

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech Tone depends on the F0 contour of a time span much longer than a frame; frame-level F0 and its derivatives only capture part of the tonal patterns Our approach Use MLP with longer input window to extract tone-related posteriors as a better model of tone Incorporate tone-related posteriors in feature vector Xin Lei, Mei-Yuh Hwang and Mari Ostendorf SSLI-LAB, Department of Electrical Engineering, University of Washington, Seattle, WA Tone-related Posterior Feature Extraction Overall configuration Consider two types of features: Tone posterior feature –MLP targets: five tones and a no-tone target –MLP input: 9-frame window of MFCC+F0 features Toneme posterior feature –MLP targets: 62 speech tonemes plus one silence, one for laughter, one for all other noise –PCA the output to 25-dim Speech Recognition Results Tone posterior feature results (non-crossword models) Toneme posterior feature results (non-crossword models) Cross-word system results Conclusions Tone posterior features by themselves outperform F0 Tone posteriors and the plain F0 features complement each other, though with some overlap More improvement is achieved by using toneme posteriors than tone posteriors, since they carry both tone and segmental information Combining tone-related posterior features with F0 shows 2-2.5% absolute improvement in CER Five Tones in Mandarin Five tones in Mandarin –High-level (tone 1), high-rising (tone 2), low-dipping (tone 3), high-falling (tone 4) and neutral (tone 5) –Characterized by syllable-level F0 contour Confusion caused by tones – 烟 (y an1, cigarette), 盐 (y an2, salt), 眼 (y an3, eye), 厌 (y an4, hate) Common tone modeling techniques –Tonal acoustic units (such as tonal syllable or toneme) –Add pitch and derivatives to the feature vector Tandem Features Hermansky et al. proposed tandem approach –Use log MLP posterior outputs as the input features for Gaussian mixture models of a conventional speech recognizer References –H. Hermansky et al., “Tandem connectionist feature extraction for conventional HMM systems”, ICASSP 2000 –N. Morgan et al., “TRAPping conversational speech: Extending TRAP/Tandem approaches to conversational telephone speech recognition”, ICASSP 2004 Future Work Context-dependent tone (“tri-tone”) classification Tri-tone clustering TargetsCardinalityFrame Accuracy tone680.3% toneme6468.6% FeatureDimensionCER PLP3936.8% PLP+F % PLP+(tone posterior)4535.4% PLP+PCA(tone posterior)4235.6% PLP+F0+(tone posterior)4835.2% PLP+F0+PCA(tone posterior)4535.2% FeatureDimensionCER PLP+F % PLP+PCA(toneme posterior)6433.7% PLP+F0+PCA(toneme posterior)6733.2% PLP+PCA(tone, toneme posterior)6433.3% PLP+F0+PCA(tone, toneme posterior) % FeatureDimensionCER PLP+F % PLP+F0+PCA(toneme posterior) % Experiment Setup Corpora –Mandarin conversational telephone speech data collected by HKUST –Training set: train04, 251 phone calls, 57.7 hours of speech –Testing set: dev04, 24 calls totaling 2 hours, manually segmented Front-end feature –Standard MFCC/PLP features by SRI Decipher front-end –3-dim pitch and its delta features by ESPS get_f0 –Pitch feature post-processed by SRI pitch tool graphtrack (reduces doubling and halving errors + smoothing) –Various MLP generated tone-related posterior features Decoding setup –7-Class unsupervised MLLR adaptation –Rescoring a lattice of word hypotheses with trigram MLP Training Results MLP training with ICSI quicknet package –Features: 9-frame MFCC + F0 –Tone/toneme labels: from forced alignment –10% of the data used for cross validation Cross validation results on MLP training For reference, frame accuracy for 46 English phones is around 67%