ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories
RT-02 Wokshop 2 Outline Experiments on SWB Adaptation to Meeting data Multi-domain Training for meeting recognition Model based Acoustic Mapping
RT-02 Wokshop 3 SWB: from 1997 to now ´97 eval. System [Finke97]: –Frontend: CMS, no CVN, LDA –AM : 25k distributions defined over 10k codebooks –LM : 3gram swb + class swb + 4gram BN –Multiple search passes to estimate VTLN, MLLR –45.1% on eval97 (best result was 45.0%) Tested ´97 swb system on eval2001: -> 36.5% error rate –More than 11% worse compared to 2001 top system
RT-02 Wokshop 4 Training Data Used ISIP transcripts –Re-checked segments by flexible transcription alignment –Skipped all turns containing noises or single words only –no gain compared to ´97 transcripts ! Development data: –1h subset from eval2001
RT-02 Wokshop 5 Traditional MFCC Front-end Observations: –Many linear transform stages –Many dimensionality reduction stages –Many different criterions! Can we streamline this process using data-driven optimization? FFT Mel-scale Filterbank logDCTCMN , LDAMLLT
RT-02 Wokshop 6 Optimizing the MFCC Front-end , can be generalized by concatenating adjacent N frames. Then use LDA to choose the final projection. DCT can be removed without affecting performance. Mel-scale filterbank can be removed, but with a big increase in computation FFT Mel-scale Filterbank logDCTCMN , LDAMLLT
RT-02 Wokshop 7 Front-end Experiments on SWB SystemWER (%) Baseline39.8* +data-driven , plain CVN39.7 +SCMN37.8 +MLLT35.6 Test set: hub5e_01 subset * The baseline is trained on 180 hrs, while the rest uses a 66 hrs subset for training.
RT-02 Wokshop 8 Frontend, Semi-tied covariances Front-end –Speaker based CVN –15 adjacent frames instead of delta, delta-deltas: –34.2% -> 33.1% Semi-tied covariances No MLLRMLLR No STC36.7%34.1% Global33.7%32.2% Per Phone33.4%33.1%
RT-02 Wokshop 9 AM Training Decision Tree –10000 context dependent states –Increased context from ±2 to ± 3 : 34.7% -> 34.2% –modalities Growing of gaussians: K-meansIncr. growing 10000x2433.8%33.1% 10000x3233.7%32.4%
RT-02 Wokshop 10 Speaker Adaptive Training Feature Space Adaptation –Training: single FSA matrix for each conversation side –Decoding: estimate FSA matrix first, compute MLLR matrices on adapted feature space –System A : 34.8% -> 33.7% –System B : 30.9% -> 30.8% (better frontend, STC,...) SAT (model space) –Dynamic nr. of transforms (~ 10 ) for each training speaker –Full and Diagonal transforms tried –No gains even on non-STC system !
RT-02 Wokshop 11 SWB summary LM: interpol. with 5gram class SWB : 33.5% -> 32.9% Results on eval2001 : 29.5% –But single system only –No rover, consensus Next steps: –Fix SAT problems –Improve LM (BN corpus, distance ngrams) –Modality dependent training on SWB -Integrate gender, speaking rate, etc. into decision tree -Did work on dialects, hyperarticulated speech [Fuegen2000, Soltau2000]
RT-02 Wokshop 12 Experiments on Meetings LIMSI style automatic partitioning scheme for the table-mic, manual segmentation for personal mic AM: SWB 244k Gaussians, BN 104k Gaussians BN LM, 40k vocab 1 st pass decoding, no adaptation b008 devtestPersonal micTable mic SWB models BN models b009 devtestPersonal micTable mic SWB models BN models
RT-02 Wokshop 13 The Cross-talk Challenge Cross-talk causes many problems, especially in the table-mic case The current scoring tool won’t handle overlap correctly! Scoring on the non-crosstalk region (using word level alignment) But cross-talk is much more than just a scoring issue! b008 table-micOriginal scoring on entire devtest Scoring on non- crosstalk region SWB models BN models
RT-02 Wokshop 14 Multi domain Training Combined BN with ESST –ESST = 30h, conversational speech, but clean channels –Tested on in-house meeting dev. Set (4 meetings, 1h) ESST54.1% BN44.2% ESST+BN42.2% SWB42.0%
RT-02 Wokshop 15 Model-combination-based Acoustic Mapping MAM [Westphal2001] tries to find a non-linear mapping of feature vectors using a pair of corresponding clean and noisy GMM‘s (used for car data, distance talking) SNR for each speaker is different, so we made speaker segmentation by reference and did test individually. Finally, we concat the hyposis of each speaker to score the overall ER.
RT-02 Wokshop 16 Acoustic Mapping Result Tested on b008 devtest (6 spk, 600 sec.) Individual result Overall result –BN baseline: 68.6% –BN baseline + AM: 66.5% ER\SPK/length(s.)a/203n/177j/126l/77c/45jc/22 BN baseline BN baseline+AM