H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.

Error Analysis: Indicators of the Success of Adaptation Arindam Mandal, Mari Ostendorf, & Ivan Bulyko University of Washington.

Regularized Adaptation for Discriminative Classifiers Xiao Li and Jeff Bilmes University of Washington, Seattle.

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

Optimal Adaptation for Statistical Classifiers Xiao Li.

WEB-DATA AUGMENTED LANGUAGE MODEL FOR MANDARIN SPEECH RECOGNITION Tim Ng 1,2, Mari Ostendrof 2, Mei-Yuh Hwang 2, Manhung Siu 1, Ivan Bulyko 2, Xin Lei.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

Why is ASR Hard? Natural speech is continuous

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

ASRU 2007 Survey Presented by Shih-Hsiang. 2 Outline LVCSR –Building a Highly Accurate Mandarin Speech RecognizerBuilding a Highly Accurate Mandarin Speech.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE NEURAL NETWORKS RESEARCH CENTRE The development of the HTK Broadcast News.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

Large Vocabulary Continuous Speech Recognition. Subword Speech Units.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.

Experiments in Adaptive Language Modeling Lidia Mangu & Geoffrey Zweig.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.

Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.

Building A Highly Accurate Mandarin Speech Recognizer

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

2 Research Department, iFLYTEK Co. LTD.

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin.

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Jun Wu Department of Computer Science and

Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing

8-Speech Recognition Speech Recognition Concepts

Research on the Modeling of Chinese Continuous Speech Recognition

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Learning Long-Term Temporal Features

STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe.

Presentation transcript:

h ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast conversation (BC). Improvements over the previous version Increased training data, Discriminative features, Frame-level discriminative training criterion, Multiple-pass AM adaptation, System combination, LM adaptation, Total 24%--64% relative CER reduction. M.Y. Hwang 1, W. Wang 2, X. Lei 1, J. Zheng 2, O. Cetin 3,, G. Peng 1 1. Department of Electrical Engineering, University of Washington, Seattle, WA, USA 2. SRI International, Menlo Park, CA, USA 3. ICSI Berkeley, Berkeley, USA Character Error Rates Increased Training Data Acoustic training data increased from 97 hours to 465 hours (2/3 BN, 1/3 BC) Test data ML word segmentation used for Chinese text. Training text increased from 420M words to 849M words. – Lexicon size increased from 49K to 60K (1700 English words) – 6 LMs trained for interpolation into one – Bigrams (qLM 2 ), trigrams (LM 3, qLM 3 ), 5-grams (LM 5a, LM 5b ) are trained. LM 5b uses count-based smoothing. Future Work Topic-dependent language model adaptation. Machine-translation (MT) targeted error rates. Perplexity Two Acoustic Models 1.MFCC, 39-dim, CW+SAT fMPE+MPE, 3000x128 Gaussians. 2. MFCC + MPE-phoneme posterior feature, 74-dim, nonCW fMPE+MPE, 3000x64 Gaussians. MLP features: [1] Zheng, ICASSP-2007, “Combining discriminative feature, transform, and model training for large vocabulary speech recognition”. [2] Chen, ICSLP-2004, “Learning long-term temporal features in LVCSR using neural networks”. Test SetBNBC Dev040.5 hr- Eval041 hr- Ext061 hr- Dev05bc-2.7 hr Training textBNBC (1) TDT+ 17.7M (2) GALE 3.0M2.7M (3) Giga-CNA451.4M (4) Giga-XIN260.9M (5) Giga-ZBN 15.8M (6) NTU-Web 95.5M2.1M Final LM 844.3M4.8M LMWord Perplexity 49K LM K qLM K qLM K LM K LM 5a 77.9 AMLMLexAMDev04Eval04 97 hr420M wrd49KMFCC CW+SAT MPE6.0%16.0% 465 hr420M wrd49KMFCC CW+SAT MPE5.3%15.1% 465 hr850M wrd60KMFCC CW+SAT fMPE+MPE || MLP fMPE+MPE 3.7%12.2% AMLMLexAMExt60Dev05bc 97 hr420M wrd49KMFCC CW+SAT MPE15.0%34.0% 465 hr850M wrd60KMFCC CW+SAT fMPE+MPE || MLP fMPE+MPE 5.4%22.5% LM Adaptation for BC (Dev05bc) LM BN = (1) ~ (6) BN + EARS Conversational Telephony Speech (159M words) LM BC = (2)+(6) BC LM ALL = interpolation (LM BN, LM BC ) LM BN-C = LM BN adapted by (2) GALE-BC One LM adaptation per show i, to maximize the likelihood of h Re-start the entire recognition process after LM adaptation. LM BN ’ = LM BN adapted by h dynamically per show i Same strategy no improvement on BN test data --- plenty of BN training text. Adaptation SetupFirst-passFinal CER LM ALL (no LM adaptation)24.9%21.9% i LM BC + (1- i ) LM BN 24.4%21.2% i LM BC + (1- i ) LM BN ’ 24.3%21.0% i LM BC + (1- i ) LM BN-C 24.0%20.6% Acoustic segmentation Speaker clustering VTLN/CMN/CVN 1.nonCW MLP qLM 3 3. nonCW MLP MLLR, LM 3 2. CW MFCC MLLR, LM 3 LM 5a, LM 5b rescore Confusion Network Combination Top 1 Decoding Architecture LM 5a, LM 5b rescore