Download presentation
Presentation is loading. Please wait.
1
1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007
2
2 Outline Goal: A highly accurate Mandarin ASR Baseline: System-2006 Improvement Acoustic segmentation Two complementary comparable systems Language models and adaptation More Data Error analysis Future
3
3 Background: System-2006 849M words training text 60K-word lexicon Static 5-gram rescoring 465 hrs acoustic training Two AMs (same phone-72 pronunciation) MFCC+pitch (42-dim), SAT+fMPE, CW MPE, 3000x128 Gaussians. MFCC+MLP+pitch (74-dim), SAT+fMPE, nonoCW MPE, 3000x64 Gaussians CER 18.4% on Eval06.
4
4 2007 Increased Training Data 870 hours of acoustic training data. 3500x128 Gaussians. 1.2G words of training text. Trigrams and 4-grams. #bigrams#trigrams#4-gramsDev07-IV Perplexity LM 3 58M108M---325.7 qLM 3 6M 3M---379.8 LM 4 58M316M201M297.8 qLM 4 19M24M 6M383.2
5
5 Acoustic segmentation Former segmenter caused high deletion errors. It mis-classified some speech segments as noises. Speech segment min duration 18*30=540ms=0.5s Start/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noise VocabularyPronunciation speech18 + fg Noiserej silencebg Start/nullEnd/null speech silence noise
6
6 New Acoustic Segmenter Allow shorter speech duration Model Mandarin vs. Foreign (English) separately. VocabularyPronunciation Mandarin1I1 F Mandarin2I2 F Foreignforgn Noiserej Silencebg Start/nullEnd/nullForeign silence Mandarin1 2 noise
7
7 Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on Eval06 SegmenterSubDelInsTotal OLD9.77.01.918.6 NEW9.96.42.018.3 Oracle9.56.81.818.1
8
8 Decoding Architecture MLP nonCW qLM 3 PLP CW SAT+fMPE MLLR, LM 3 MLP CW SAT MLLR, LM 3 qLM 4 Adapt/Rescore Confusion Network Combination Aachen
9
9 Two Sets of Acoustic Models For cross adaptation and system combo Different error behaviors Similar error rate performance System-MLPSystem-PLP Features74 (MFCC+pitch+MLP) 42 (PLP+pitch) fMPEnoyes Phones7281
10
10 MLP Phoneme Posterior Features Compute Tandem features with pitch+PLP input. Compute HATs features with 19 critical bands Combine Tandem and HATs posterior vectors into one. PCA(Log(71)) 32 MFCC + pitch + MLP = 74-dim
11
11 Tandem Features [T 1,T 2,…,T 71 ] Input: 9 frames of PLP+pitch (42x9)x15000x71 PLP (39x9) Pitch (3x9)
12
12 HATS Features [H 1,H 2,…,H 71 ] 51x60x71 … E1E1 E2E2 E 19 (60*19)x8000x71
13
13 MLP and Pitch Features HMM FeatureMLP InputCER MFCC (39-dim)None24.1 MFCC+F0 (42-dim)None21.4 MFCC+F0+Tandem (74-dim)PLP(39*9)20.3 MFCC+F0+Tandem (74-dim)PLP+F0(42*9)19.7 nonCW ML, Hub4 Training, MLLR, LM 2 on Eval04
14
14 Phone-81: Diphthongs for BC Add diphthongs (4x4=16) for fast speech and modeling longer triphone context. Maintain unique syllabification. Syllable ending W and Y not needed anymore. ExamplePhone-72Phone-81 要 /yao4/ a4 Waw4 北 /bei3/ E3 Yey3 有 /you3/ o3 Wow3 爱 /ai4/ a4 Yay4
15
15 Phone-81: Frequent Neutral Tones for BC Neural tones more common in conversation. Neutral tones were not modeled. The 3 rd tone was used as replacement. Add 3 neutral tones for frequent chars. ExamplePhone-72Phone-81 了 /e5/ e3e5 吗 /ma5/ a3a5 子 /zi5/ i3i5
16
16 Phone-81: Special CI Phones for BC Filled pauses (hmm, ah) common in BC. Add two CI phones for them. Add CI /V/ for English. ExamplePhone-72Phone-81 victorywV 呃 /ah/ o3fp_o 嗯 /hmm/ e3 Nfp_en
17
17 Phone-81: Simplification of Other Phones Now 72+14+3+3=92 phones, too many triphones to model. Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2. 92 – (4x3–1) = 81 phones. ExamplePhone-72Phone-81 安 /an1/ A1 Na1 N 词 /ci2/ I1i2 池 /chi2/ IH2i2
18
18 Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 BNBCAvg Phone-817.627.318.9 Phone-727.427.619.0 Indeed different error behaviors --- good for system combo.
19
19 PLP Models with fMPE Transform PLP model with fMPE transform to compete with MLP model. Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT 5 Neighboring frames of Gaussian posteriors. M is 42 x (3500*32*5), h is (3500*32*5)x1. Ref: Zheng ICASSP 07 paper
20
20 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model {w | w same story (4secs) } 00 One sentence 4s window is used to make adaptation more robust against ASR errors. {w} are weighted based on distance.
21
21 Topic-based LM Adaptation Training: one topic per sentence Train 64 topic-dependent LMs. Testing: top n topics per sentence, weighting on neighboring 4s of speech
22
22 Topic-based LM Adaptation LM i still 60K-words? Per-sentence adaptation? Computational cost?
23
23 LM Adaptation and CNC on Dev07 Dev07CW PLPCW MLPCNC LM 3 12.011.9--- LM 4 11.911.711.4 Adapted qLM 4 11.711.411.2 UW 2 systems only
24
24 LM Adaptation and CNC on Eval07 AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) Rover LM 3 10.29.6 9.910.1-- qLM 4 10.29.710.010.1-- LM 4 10.09.6 9.810.09.1 Adapted qLM 4 9.79.39.69.78.9
25
25 Eval07 TeamCER UW9.1% RWTH12.1% UW+RWTH8.9% CU+BBN9.4% IBM+CMU9.8%
26
26 2006 vs. 2007 on Eval07 SUBDELINSTOTAL 2006 system 7.26.50.414.1 2007 system 5.53.00.4 8.9 37% relative improvement!!
27
27 Progress Testset20062007-062007-12 Eval0618.4%15.3%14.7% Dev07---11.2%9.6% * Eval0714.1%8.9%---
28
28 RWTH Demo UW acoustic segmenter. RWTH single-system ASR. Foreign (Korean) speech skipped. Mis-reco highlighted. Manual sentence segmentation. Machine translation. Not real-time.
29
29 MT Error Analysis on Extreme Cases SnippetDurCERHTER a) Worst BN87s10.9%47.73% b) Worst BC72s24.9%48.37% c) Best BN62s012.67% d) Best BC77s15.2%14.20% CER not directly related to HTER; genre matters. Better CER does ease MT.
30
30 MT Error Analysis (a) worst BN: OOV names (b) worst BC: overlapped speech (c) best BN: composite sentences (d) best BC: simple sentences with disfluency and re-starts. *.html, *.wav
31
31 Error Analysis OOV (especially names): problematic for ASR, MT, distillation. 徐 昌 霖 徐 成 民 徐 长 明 Xu, Chang-Lin 黄 竹 琴 黄 朱 琴 黄 朱 勤 皇 猪 禽 黄 朱 其 Huang, Zhu-Qin
32
32 Error Analysis MT BN high errors Composite syntax structure. Syntactic parsing would be useful. MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure
33
33 Next ASR: Chinese Organization Names Semi-auto abbreviation generation for long words. Segment a long word into a sequence of shorter words Extract the 1 st char of each shorter words: 世界卫生组织 世卫 (Make sure they are in MT translation table, too)
34
34 Next ASR: Chinese Person Names Mandarin high rate of homophones: 408 syllables 6000 common characters. 14 homophone chars / syllable!! Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!! Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu {Chang, Cheng} {Lin, Min, Ming} Huang Zhu {Qin, Qi} After syllable CNC, apply the same name to all occurrences in Pinyin.
35
35 Next ASR: Foreign Names English spelling in Lexicon, with (multiple) Mandarin pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben3 la1 deng1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT? Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n- gram.
36
36 Next ASR: LM LM adaptation with fine topics, each topic with small vocabulary size. Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是 我想那也是 I think it, (too), (too), is, too. I think it is, too. If optimizing CER, stm needs to be designed such that disfluency is optionally deletable. 小孩 ( 儿 )
37
37 Next ASR: AM Add explicit tone modeling (Lei07). Prosody info: duration and pitch contour at word level Various backoff schemes for infrequent words More understanding why outside regions not helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering Smaller clusters, better performance Gender ID first.
38
38 ASR & MT Integration Do we need to merge lexicon? ASR MT. Do we need to use the same word segmenter? Is word/char -level CNC output better for MT? Open questions and feedback!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.