1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007
2 Outline Goal: A highly accurate Mandarin ASR Background Acoustic segmentation Acoustic models and adaptation Language models and adaptation Cross adaptation System combination Error analysis Future
3 Background 870 hours of acoustic training data. N-gram based (N=1) ML Chinese word segmentation. 60K-word lexicon. 1.2G words of training text. Trigrams and 4-grams. n2n3n4Dev07-IV Perplexity LM 3 58M108M qLM 3 6M 3M LM 4 58M316M201M297.8 qLM 4 19M24M 6M383.2
4 Acoustic segmentation Former segmenter caused high deletion errors. It mis-classified some speech segments as noises. Speech segment min duration 18*30=540ms=0.5s Start/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noise VocabularyPronunciation speech18 + fg Noiserej silencebg Start/nullEnd/null speech silence noise
5 New Acoustic Segmenter Allow shorter speech duration Model Mandarin vs. Foreign (English) separately. VocabularyPronunciation Mandarin1I1 F Mandarin2I2 F Foreignforgn Noiserej Silencebg Start/nullEnd/nullForeign silence Mandarin1 2 noise
6 Two Sets of Acoustic Models For cross adaptation and system combo Different error behaviors Similar error rate performance System-MLPSystem-PLP Features74 (MFCC+3+32) 42 (PLP+3) fMPEnoyes Phones7281
7 MLP Phoneme Posterior Features Compute Tandem features with pitch+PLP input. Compute HATs features with 19 critical bands Combine two Tandem and HATs posterior vectors into one. Log(PCA(71) 32) MFCC + pitch + MLP = 74-dim 3500x128 Gaussians, MPE trained. Both cross-word (CW) and nonCW triphones trained.
8 Tandem Features [T 1,T 2,…,T 71 ] Input: 9 frames of PLP+pitch (42x9)x15000x71 PLP (39x9) Pitch (3x9)
9 HATS Features [H 1,H 2,…,H 71 ] 51x60x71 … E1 E2 E19 (60*19)x8000x71
10 Phone-81: Diphthongs for BC Add diphthongs (4x4=16) for fast speech and modeling longer triphone context. Maintain unique syllabification. Syllable ending W and Y not needed anymore. ExamplePhone-72Phone-81 要 /yao4/ a4 Waw4 北 /bei3/ E3 Yey3 有 /you3/ o3 Wow3 爱 /ai4/ a4 Yay4
11 Phone-81: Frequent Neutral Tones for BC Neural tones more common in conversation. Neutral tones were not modeled. The 3 rd tone was used as replacement. Add 3 neutral tones for frequent chars. ExamplePhone-72Phone-81 了 /e5/ e3e5 吗 /ma5/ a3a5 子 /zi5/ i3i5
12 Phone-81: Special CI Phones for BC Filled pauses (hmm,ah) common in BC. Add two CI phones for them. Add CI /V/ for English. ExamplePhone-72Phone-81 victorywV 呃 /ah/ o3fp_o 嗯 /hmm/ e3 Nfp_en
13 Phone-81: Simplification of Other Phones Now =92 phones, too many triphones to model. Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2. 92 – (4x3–1) = 81 phones. ExamplePhone-72Phone-81 安 /an1/ A1 Na1 N 词 /ci2/ I1i2 池 /chi2/ IH2i2
14 PLP Models with fMPE Transform PLP model with fMPE transform to compete with MLP model. Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT 5 Neighboring frames of Gaussian posteriors. M is 42 x (3500*32*5), h is (3500*32*5)x1. Ref: Zheng ICASSP 07 paper
15 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model {w | w same story (4secs) } 00 One sentence 4s window is used to make adaptation more robust against ASR errors. {w} are weighted based on distance.
16 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model One sentence Topic-based LM Adaptation Training One topic per sentence. Train 64 topic-dep. 4-gram LM 1, LM 2, … LM 64. Decoding Top n topics per sentence, where i ’ > threshold. Latent Dirichlet Allocation Topic Model One sentence Latent Dirichlet Allocation Topic Model One sentence Latent Dirichlet Allocation Topic Model
17 Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on eval06 SegmenterSubDelInsTotal OLD NEW Oracle
18 Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 BNBCAvg Phone Phone Indeed different error behaviors --- good for system combo.
19 Decoding Architecture MLP nonCW qLM 3 PLP CW+SAT+fMPE MLLR, LM 3 MLP CW+SAT MLLR, LM 3 qLM 4 Adapt/Rescore Confusion Network Combination Aachen
20 Topic-based LM Adaptation (NTU) Training, per sentence: 64 topics: = ( 1, 2, …, m ) Topic(sentence) = k = argmax { 1, 2, …, m } Train 64 topic-dep (TD) 4-grams Testing, per utterance: {w}: N-best confidence based weighting + distance weighting Pick all TD 4-grams whose i is above a threshold. Interpolate with the topic-indep. 4-gram. Rescore N-best list.
21 CERs with diff LMs (internal use) AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) Rover LM qLM LM Adapted qLM
22 Topic-based LM Adaptation (NTU) AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) CNC Rover LM Adapted qLM “q” represents “quick” or tightly pruned. Oracle CNC: 4.7%. Could it be a broken word sequence? Need to verify that with word perplexity and HTER.
ASR System vs SUBDELINSTOTAL 2006 system system CER on Eval07 37% relative improvement!!
24 Eval07 BN ASR Error Distribution 66 BN snippets (Avg CER 3.4%) %50.0%100.0%150.0% % snippets CER (%) SRI
25 Eval07 BC ASR Error Distribution 53 BC snippets (avg CER 15.9%) %20.0%40.0%60.0%80.0%100.0%120.0% % snippets CER (%) SRI
26 What Worked for Mandarin ASR? MLP features MPE CW+SAT fMPE Improved acoustic segmentation, particularly for deletion errors. CNC Rover.
27 Small Help for ASR Topic-dep. LM adaptation. Outside regions for additional AM adaptation data. A new phone set with diphthongs to offer different error behaviors. Pitch input in tandem features. Cross adaptation with Aachen Successful collaboration among 5 team members from 3 continents.
28 Error Analysis on Extreme Cases SnippetDurCERHTER a) Worst BN87s10.9%47.73% b) Worst BC72s24.9%48.37% c) Best BN62s012.67% d) Best BC77s15.2%14.20% CER not directly related to HTER; genre matters. Better CER does ease MT.
29 Error Analysis (a) worst BN: OOV names (b) worst BC: overlapped speech (c) best BN: composite sentences (d) best BC: simple sentences with disfluency and re-starts.
30 Error Analysis OOV (especially names) Problematic for both ASR/MT Overlapped speech What to do? Content word mis-reco (not all errors are equal!) 升值 (increase in value) 甚至 (even) Parsing scores? 徐 昌 霖 徐 成 民 徐 长 明 Xu, Chang-Lin 黄 竹 琴 黄 朱 琴 黄 朱 勤 皇 猪 禽 黄 朱 其 Huang, Zhu-Qin
31 Error Analysis MT BN high errors Composite syntax structure. Syntactic parsing would be useful. MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure
32 Next ASR: Chinese OOV Org Names Semi-auto abbreviation generation for long words. Segment a long word into a sequence of shorter words Extract the 1 st char of each shorter words: World Health Organization WHO (Make sure they are in MT translation table, too)
33 Next ASR: Chinese OOV Per. Names Mandarin high rate of homophones: 408 syllables 6000 common characters. 14 homophone chars / syllable!! Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!! Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu {Chang, Cheng} {Lin, Min, Ming} Huang Zhu {Qin, Qi} After syllable CNC, apply the same name to all occurrences in Pinyin.
34 Next ASR: English OOV Names English spelling in Lexicon, with (multiple) Mandarin pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben1 la1 dan1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT? Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re- train n-gram.
35 Next ASR: LM LM adaptation with fine topics, each topic with small vocabulary size. Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是 我想那也是 I think it, (too), (too), is, too. I think it is, too. If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.
36 Next ASR: AM Add explicit tone modeling (Lei07). Prosody info: duration and pitch contour at word level Various backoff schemes for infrequent words More understanding why outside regions not helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering Smaller clusters, better performance
37 ASR & MT Integration Do we need to merge lexicon? ASR <= MT. Do we need to use the same word segmenter? Is word/char -level CNC output better for MT? Open questions and feedback!!!