Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006, Interspeech, Pittsburgh
Outline The task Text training data and language modeling Acoustic training data and acoustic modeling Decoding structure Experimental results Recent progress and future direction
The Task Mandarin broadcast news (BN) transcription Mainland Mandarin speech TV/radio programs in China, USA CCTV 中央电视台 NTDTV 新唐人电视台 PHOENIX TV 凤凰卫视 VOA 美国之音 RFA 自由亚洲电台 CNR 中国广播网
Text Training Data LM1: 1997 Mandarin BN Hub4 transcriptions Chinese TDT2,3,4 Multiple-translation Chinese (MTC) corpus, part 1, 2, 3 LM2: Gigaword XIN (China) LM3: Gigaword ZBN (Singapore) LM4: Gigaword CNA (Taiwan) All together 420M words. 4 LMs interpolated
Chinese Word Segmentation BBN 64k-word lexicon, derived from LDC Longest-first match with the 64k-lexicon Choose most frequent 49k words as new lexicon Train n-gram Use unigram part to re-do word segmentation based on the ML path
Chinese Word Segmentation Longest-first 民进党 / 和亲 / 民党 … The Green Party made peace with the Min Party via marriage… Maximum-likelihood 民进党 / 和 / 亲民党 … The Green Party and the Qin-Min Party...
Perplexity 49k-word lexicon Word perplexity 2-gram495 4-gram288
Acoustic Training Data CorpusSize 1997 Hub4 BN28.5 hrs *TDT4-CCTV25 hrs *TDT4-VOA43.5 hrs Total97 hours *auto selection via a flexible alignment with closed caption
Acoustic Feature Representation 39-dim MFCC cepstra + + 3-dim pitch + + Auto speaker clustering VTLN per auto speaker Speaker-based CMN+CVN for training
Acoustic Models 2500 senones (clustered states) x 32 Gaussians ML training vs. MPE training with phone lattices Gender indepdent. nonCW vs. CW triphones Speaker-adaptive training (SAT): N(x; a +b, A A t ) = |A| -1 N(A -1 (x-b); , ) Linear transformation A -1 x + (-A -1 b) applied to the feature domain.
2-Pass Search Architecture Search 1 SAT MLLR Search 2 nonCW,nonSAT, ML model Small bigram hypothesis CW,SAT,MPE model Final word sequence Big 4-gram
Adding Pitch: SA Results (CER) SmoothingDev04Eval04 No pitch14.5%24.1% IBM-style (mean based) 14.0%22.2% SPLINE (cubic smoothing) 12.7%21.4%
2-pass Search Results (CER) Acoustic modelDev04Eval04 nonCW, nonSAT, ML 7.4%-- nonCW, nonSAT, MPE 6.9%-- nonCW, SAT, ML6.8%-- CW, SAT, ML6.4%-- CW,SAT,MPE6.0%16.0%
More Recent Progress Add more acoustic (440 hrs) and text training data (840M words). Increased and improved lexicon (60k words). fMPE training. Add ICSI feature as a second system. 5-gram LM. Between MFCC system and ICSI system, Cross adaptation Rover 3.7% on dev04, 12.1% on eval04. Submitted to ICASSP 2007
Challenges Channel compensation Conversational speech Overlapped speech Speech with music background Commercial Language ID (in addition to English) Is CER the best measurement for MT?