Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.
1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.
Review of ICASSP 2004 Arthur Chan. Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
WEB-DATA AUGMENTED LANGUAGE MODEL FOR MANDARIN SPEECH RECOGNITION Tim Ng 1,2, Mari Ostendrof 2, Mei-Yuh Hwang 2, Manhung Siu 1, Ivan Bulyko 2, Xin Lei.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Why is ASR Hard? Natural speech is continuous
Automatic Continuous Speech Recognition Database speech text Scoring.
Discriminative Feature Optimization for Speech Recognition
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
ASRU 2007 Survey Presented by Shih-Hsiang. 2 Outline LVCSR –Building a Highly Accurate Mandarin Speech RecognizerBuilding a Highly Accurate Mandarin Speech.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Notes on ICASSP 2004 Arthur Chan May 24, This Presentation (5 pages)  Brief note of ICASSP 2004  NIST RT 04 Evaluation results  Other interesting.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.
EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel,
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.
HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE NEURAL NETWORKS RESEARCH CENTRE The development of the HTK Broadcast News.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
Tom Ko and Brian Mak The Hong Kong University of Science and Technology.
1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Experiments in Adaptive Language Modeling Lidia Mangu & Geoffrey Zweig.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.
Building A Highly Accurate Mandarin Speech Recognizer
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin.
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
汉语连续语音识别 年1月4日访北京工业大学 973 Project 2019/4/17 汉语连续语音识别 年1月4日访北京工业大学 郑 方 清华大学 计算机科学与技术系 语音实验室
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006, Interspeech, Pittsburgh

Outline The task Text training data and language modeling Acoustic training data and acoustic modeling Decoding structure Experimental results Recent progress and future direction

The Task Mandarin broadcast news (BN) transcription  Mainland Mandarin speech  TV/radio programs in China, USA CCTV 中央电视台 NTDTV 新唐人电视台 PHOENIX TV 凤凰卫视 VOA 美国之音 RFA 自由亚洲电台 CNR 中国广播网

Text Training Data LM1:  1997 Mandarin BN Hub4 transcriptions  Chinese TDT2,3,4  Multiple-translation Chinese (MTC) corpus, part 1, 2, 3 LM2: Gigaword XIN (China) LM3: Gigaword ZBN (Singapore) LM4: Gigaword CNA (Taiwan) All together 420M words. 4 LMs interpolated

Chinese Word Segmentation BBN 64k-word lexicon, derived from LDC Longest-first match with the 64k-lexicon Choose most frequent 49k words as new lexicon Train n-gram Use unigram part to re-do word segmentation based on the ML path

Chinese Word Segmentation Longest-first  民进党 / 和亲 / 民党 …  The Green Party made peace with the Min Party via marriage… Maximum-likelihood  民进党 / 和 / 亲民党 …  The Green Party and the Qin-Min Party...

Perplexity 49k-word lexicon Word perplexity 2-gram495 4-gram288

Acoustic Training Data CorpusSize 1997 Hub4 BN28.5 hrs *TDT4-CCTV25 hrs *TDT4-VOA43.5 hrs Total97 hours *auto selection via a flexible alignment with closed caption

Acoustic Feature Representation 39-dim MFCC cepstra +  +  3-dim pitch +  +  Auto speaker clustering VTLN per auto speaker Speaker-based CMN+CVN for training

Acoustic Models 2500 senones (clustered states) x 32 Gaussians ML training vs. MPE training with phone lattices Gender indepdent. nonCW vs. CW triphones Speaker-adaptive training (SAT): N(x; a  +b, A  A t ) = |A| -1 N(A -1 (x-b); ,  ) Linear transformation A -1 x + (-A -1 b) applied to the feature domain.

2-Pass Search Architecture Search 1 SAT MLLR Search 2 nonCW,nonSAT, ML model Small bigram hypothesis CW,SAT,MPE model Final word sequence Big 4-gram

Adding Pitch: SA Results (CER) SmoothingDev04Eval04 No pitch14.5%24.1% IBM-style (mean based) 14.0%22.2% SPLINE (cubic smoothing) 12.7%21.4%

2-pass Search Results (CER) Acoustic modelDev04Eval04 nonCW, nonSAT, ML 7.4%-- nonCW, nonSAT, MPE 6.9%-- nonCW, SAT, ML6.8%-- CW, SAT, ML6.4%-- CW,SAT,MPE6.0%16.0%

More Recent Progress Add more acoustic (440 hrs) and text training data (840M words). Increased and improved lexicon (60k words). fMPE training. Add ICSI feature as a second system. 5-gram LM. Between MFCC system and ICSI system,  Cross adaptation  Rover 3.7% on dev04, 12.1% on eval04. Submitted to ICASSP 2007

Challenges Channel compensation Conversational speech Overlapped speech Speech with music background Commercial Language ID (in addition to English) Is CER the best measurement for MT?