Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin
2 Outline A new phoneme set Using long-span features Pitch features Automatic segmentation Data Gaussianization Silence chopping Updated system results
3 A new phoneme set Problem –Hundreds of single-phoneme words dramatically increased the sizes of lattices after crossword quinphone expansion, which caused out-of-memory trouble Solution –Added 5 dummy phonemes to turn all single-phoneme words into double-phoneme words (“a” “Da-a”, “i” “Di-i”, “e” ”De-e”, “o” ”Do-o” and “er” ”Der-er”) –Number of cross-word quinphones was reduced by 40% Phoneme-set size Mix. Exp. Un-adapted CER Adapted CER 77Yes40.9n/a 82Yes40.8n/a 77No No
4 Long-span features (ML-SI) Frame concatenation –Basic frame includes normalized-energy, 14 PLP coefficients, pitch (F0) and prob. of voicing (PV) –Use LDA+MLLT to project the concatenated frames down to lower dimensionality Input feature LDA +MLLT output dimension Un-adapted CER basic frame + derivatives Concatenate 9 basic frames Concatenate 13 basic frames4640.6
5 Long-span features (ML-SAT) Modified HLDA-SAT –Do the first CMLLR on the original basic frames rather than the concatenated frames to avoid the high dimensionality problem Feature SetHLDA-SATAdapted CER basic frame + 1 st,2 nd,3 rd deriv.old36.8 concatenate 9 basic framesmodified36.0 concatenate 9 basic frames modified (state re-clust.) 35.6
6 Long-span features (MPE-SAT) MPE model training –Trained on top of the modified HLDA-SAT model –Lattices were generated directly from the backward decoding pass (instead of using N-best lattices) Feature SetModelAdapted CER basic frame + 1 st,2 nd,3 rd deriv.ML-SAT36.8 basic frame + 1 st,2 nd,3 rd deriv. MPE-SAT (N-best lattices) 34.8 concatenate 9 basic framesML-SAT35.6 concatenate 9 basic frames MPE-SAT (deeper lattices) 33.3
7 Pitch features: algorithms Two algorithms (old vs. new) –The old algorithm: poor smoothing on unvoiced speech, not in log domain, normalized to 0 mean and unit variance –The new algorithm: similar to IBM’s, good smoothing on unvoiced speech, in log10 domain, normalized to 0 mean and unit variance –The new one was used in RT04, since an initial comparison showed it outperformed the old one in terms of CER PitchUn-adapted CER old43.8 new43.6
8 Pitch features: a thorough comparison PitchFeature SetUn-adapted CER Adapted CER oldbasic frame + 1 st,2 nd,3 rd deriv newbasic frame + 1 st,2 nd,3 rd deriv oldconcatenate 9 basic frames newconcatenate 9 basic frames The old pitch is better in terms of CER
9 Automatic segmentation: problem Our RT04 system lost much more on Eval04 set than on Dev04 due to the automatic segmentation –Found that two conversations have severe channel leaking (cross-talk) and another one has strong background noises –Our auto-segment algorithm misclassified the cross-talk as clean speech on both channels, which caused a lot of insertion errors –Misclassification caused by errors in the “SS” (speech on both channels) class labeling of the training data Test setManu-segAuto-seg%CER increase Dev Eval
10 Automatic segmentation: solution Retrain 4-class GMM with corrected labels –Re-assign the SS (speech on both channels) class data to either SN class or NS class (one channel is speech and the other noise) if the channel correlation is higher than a threshold (0.27) Test setManu-segNew Auto-seg%CER increase Dev Eval
11 Automatic segmentation: more testing Tested on adapted decoding passes (on Eval04) –Using cross-model adaptation –The loss was reduced to 0.4% after adaptation Segmentation Un-adapted CER 1 st adapted CER 2 nd adapted CER manual RT04 auto-seg New auto-seg
12 Gaussianization – after HLDA Inspired by the gain CU reported Gaussianization applied to the HLDA features –Tested various number of mixture components in the GMMs used to gaussianize the data –Best case: 4-mixture GMM gives 0.5% gain # GMM componentsUn-adapted CER 0 (no gaussianization)
13 Gaussianization – before HLDA Gaussianization applied to the static cepstra and energy –Easy to combine with HLDA-SAT training –Best case: 4-mixture GMM gives 0.6% gain, which is similar to the gain in the “after-HLDA” case # GMM componentsUn-adapted CER 0 (no gaussianization)
14 Gaussianization – test on HLDA-SAT Trained HLDA-SAT with the gaussianization done before HLDA –0.3% gain –No gain compared to the HLDA-SAT that re-clusters the quinphone states in the transformed space (35.6%) # GMM componentsAdapted CER 0 (no gaussianization)
15 Silence chopping Long silences in the new HKUST data –Did a quick silence chopping of endpoint silences during RT04 Set up an automatic silence chopping procedure –Chops segments on long silences –0.2% gain Silence chop Un-adapted CER 1 st adapted CER (HLDA-SAT) 1 st adapted CER (MPE) RT04 chop New chop
16 An updated system Un-adapted M14r 36.3 M M M10r 29.0 M15r 29.3 M16r 29.4 M Adapted (2 nd pass) Adapted (1 st pass) ROVER 27.2 Note: CER measured on Eval04
17 Model characteristics ModelFeatureMixExpPhn-setSATPitch M2PLP, deriv.yes77yesnew M4PLP, deriv.yes147yesnew M10rPLPno82yesold M13PLP, MPE-HLDAno82yesnew M15rMFCCno82yesold M16rPLPno147yesold M14rPLPno82noold All models trained with MPE M2 and M4 use derivatives and HLDA to project to 46 dim. All other models concatenate 9 frames and project to 46 dim with LDA+MLLT or MPE-HLDA M2, M4 and M13 have not been updated yet