Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

Slides:



Advertisements
Similar presentations
The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,
Advertisements

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Error Analysis: Indicators of the Success of Adaptation Arindam Mandal, Mari Ostendorf, & Ivan Bulyko University of Washington.
Speaker Adaptation for Vowel Classification
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
WEB-DATA AUGMENTED LANGUAGE MODEL FOR MANDARIN SPEECH RECOGNITION Tim Ng 1,2, Mari Ostendrof 2, Mei-Yuh Hwang 2, Manhung Siu 1, Ivan Bulyko 2, Xin Lei.
Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
ASRU 2007 Survey Presented by Shih-Hsiang. 2 Outline LVCSR –Building a Highly Accurate Mandarin Speech RecognizerBuilding a Highly Accurate Mandarin Speech.
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.
Speech and Language Processing
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
World Languages Mandarin English Challenges in Mandarin Speech Recognition  Highly developed language model is required due to highly contextual nature.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.
Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Speech Enhancement based on
NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Investigating Pitch Accent Recognition in Non-native Speech
An overview of decoding techniques for LVCSR
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Segment-Based Speech Recognition
Research on the Modeling of Chinese Continuous Speech Recognition
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
Presentation transcript:

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab, Univ. of Washington 2 Hong Kong Univ. of Science and Tech. 3 Chinese Univ. of Hong Kong 9/19/2006 Interspeech’06

Why Model Tones? In Mandarin, recognition of tones is needed for decoding words Ng et al. (SPL 2005) showed that word context alone (modeled by an n-gram) cannot disambiguate tones In our work: upperbound evaluation –Pruning lattice arcs with the wrong tone improves CER 12.0%  8.2% Interspeech’06

Background & Approach Previous work: –Embedded tone modeling: augment observation vector with F0 features extend phone set to include tonal units –Explicit tone modeling: separate detection of tones and phones Our approach –Improve pitch features for embedded modeling –Combine embedded approach with explicit tone models in lattice rescoring Interspeech’06

Baseline Mandarin BN System Train and test data –Acoustic: 28hrs of Hub4 –Text: 121M words from Hub4, TDT[2,3,4], Gigaword(Xinhua) –Test: 1hr of eval04 (CTV, RFA and NTDTV) Feature and models –39-dim MFCC + 3-dim pitch/delta/double delta –Maximum likelihood AM 2000x32, bigram LM Decoding structure –Auto segmentation, speaker clustering, –First pass decoding  3-class MLLR in second pass decoding Evaluation in terms of character error rate (CER) Interspeech’06

Smoothing and Normalization of F0 Interspeech’06 Composition of sentence F0 contour: phrase intonation + lexical tone + segmental effects + tracking errors F0 processing algorithm 1.Spline interpolate F0 contour with piecewise cubic Hermite interpolating polynomial (PCHIP) 2.Take the log of F0 3.Moving window normalization (MWN) for phrase level effects 4.5-point moving average (MA) smoother 5.Mean/var normalization as for MFCC features Raw F0 Final feature

Embedded Modeling Experiments Compare CER using different F0 processing techniques to ASR without F0 Baseline IBM-style smoothing interpolates unvoiced regions with waveform F0 average Observations: –Biggest win from using some F0 (vs. none) –Next biggest gain is from normalization Interspeech’06 Feature CTVRFANTDTVOverall MFCC only IBM-style F spline F spline + MWN+ MA F

Explicit Tone Modeling Approach 4-way tone T i classification –Neural net using features f i : Sampled F0 Duration Polynomial regression coefficients (not very helpful) –Train on and score only longer syllables Lattice rescoring combines –Acoustic score from embedded model –Language model score –Duration-weighted tone posterior: d i log p(T i |f i ) where d i is the syllable duration Interspeech’06

Explicit Tone Model Experiments 4-way tone classification results on CTV Impact after lattice rescoring for CTV: –12.0% CER  11.5% CER –Significant with p<.04 Interspeech’06 UnitsFeaturesDim # of NN nodes % Acc. CI toneFinal F0 + dur CI toneSyllable F0 + dur Left tone contextSyllable F0 + dur

Summary and Future Work Contributions –Upperbound evaluation of explicit tone modeling in lattice rescoring –Improved F0 features for embedded tone modeling via new smoothing and normalization methods –Combining explicit tone models in lattice rescoring gives significant CER reduction Future work –Modeling the coarticulation effects of tones –Tone model adaptation Interspeech’06