Download presentation
Presentation is loading. Please wait.
Published byLorena McCoy Modified over 9 years ago
1
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland
2
2 Overview Baseline system Technical background –Heteroscedastic Linear Discriminant Analysis (HLDA) –Constrained Maximum Likelihood Linear Regression (CMLLR) –Speaker Adaptive Training using CMLLR (CMLLR-SAT) HLDA adaptation SAT using HLDA adaptation (HLDA-SAT) Results Conclusions
3
3 Baseline SI system description PLP front-end, speaker turn based cepstral mean normalization HLDA used to find ‘optimal’ feature space –Original space consists of 14 cepstral coefficients and energy, plus their first, second and third derivatives (60 total dimensions) –Reduced space has 46 dimensions Trained three gender independent (GI) HMMs: –Phonetically tied mixture (PTM), within-word triphone model –State Clustered Tied mixture (SCTM) within-word quinphone model –SCTM cross-word quinphone model Estimated separate HLDA transforms for each model
4
4 HLDA HLDA is being adopted by many state of the art systems –Like LDA, its goal is to find a feature subspace where it is easier to discriminate among a given set of classes –Unlike LDA, it does not assume that the class Gaussian distributions have equal covariance matrix –Formulated within the ML framework Many choices available for the definition of the classes –Phonemes, tied states, mixture components Used the SCTM codebook clusters (HMM tied-states) as the classes in this work
5
5 CMLLR adaptation Widely used adaptation method –Estimates a constrained linear transformation to adapt both means and covariances of a set of Gaussians –Equivalent to transforming the input features using the inverse transformation matrix –Reliable row-iterative estimation method is available when the model to be adapted consists of diagonal covariance Gaussians Formulation can be extended to handle full covariance Gaussians –Easy to compute objective function and first derivative –Used standard gradient descent methods to estimate the ML transformation
6
6 Speaker Adaptive Training (SAT) SAT brings speaker awareness to acoustic model reestimation –Extends set of model parameters by including speaker dependent transformations –Reduces inter-speaker variability, resulting in more compact acoustic models –Improves performance on test data, after speaker adaptation Multiple flavors of SAT –MLLR-based, with transforms applied to model parameters Complicated update equations, hard to integrate with MMI –CMLLR-based, with transforms applied to features Transparently integrates with regular SI reestimation methods (ML, MMI, etc.)
7
7 CMLLR-SAT
8
8 HLDA adaptation Possible mismatch between training and testing acoustic conditions might reduce the effectiveness of HLDA HLDA adaptation alleviates this problem by transforming the test features such that their statistics look more similar to training –Uses CMLLR in the full space, based on the single Gaussian per tied state HMM –The CMLLR transform is then combined with the global HLDA matrix in order to form speaker dependent projections –Most effective when applied to both training and testing
9
9 HLDA-SAT
10
10 Experimental Setup Trained gender-independent (GI), band-independent (BI) models on 145 hours of Broadcast News (BN) data, using ML –6,300 tied states –25.6 Gaussians per state Trigram language model (LM), trained on 600M words –13 M bigrams, 43M trigrams Tested on h4e97 and h4d03 test sets –Automatic segmentation and speaker clustering –Two decoding passes Unadapted pass, generating hypotheses for adaptation Adapted pass, using SI or SAT adapted models
11
11 Results-I Effect of HLDA adaptation using SI models HLDA adaptation CMLLRMLLRh4e97h4d03 17.618.6 15.616.5 15.415.7 14.415.3 Significant gain from HLDA adaptation, even on top of CMLLR and MLLR
12
12 Results-II Effect of HLDA adaptation using SAT models ModelHLDA adaptation CMLLRMLLRh4e97h4d03 SI 15.415.7 CMLLR-SAT 14.815.2 CMLLR-SAT 14.415.2 HLDA-SAT 13.614.6 0.6-0.8% absolute gain from HLDA-SAT compared to CMLLR-SAT
13
13 Understanding the improvements HLDA-SAT extends CMLLR-SAT in two ways –Uses a single Gaussian per state (1gps) model to estimate transforms in full space –Updates HLDA in transformed space Which of the two has the largest effect in recognition accuracy? –1gps model allows to estimate CMLLR transforms that move the speakers closer to the canonical model –Reestimating HLDA in the transformed space results in significantly higher objective function value Tried two variations of HLDA-SAT, in which the SI HLDA is used –HLDA-SAT1: using 1gps-based CMLLR in reduced space –HLDA-SAT2: using 1gps-based CMLLR in full space
14
14 Results-III Effects of HLDA update and full space transforms Modelh4e97h4d03 CMLLR-SAT14.415.2 HLDA-SAT114.114.6 HLDA-SAT214.014.9 HLDA-SAT13.614.6 Most of the improvement from HLDA-SAT is due to using a 1gps model. The rest is due to updating the HLDA projection in the transformed space
15
15 HLDA-SAT on CTS data Applied HLDA-SAT to English and Mandarin CTS with mixed results –0.7% gain on Mandarin CTS –0.1% gain on English CTS Suspect problem with English CTS run, need more debugging to determine the cause of the poor performance
16
16 Conclusions Significant gain from HLDA adaptation Additional improvement from HLDA-SAT Future work: –Find out why there is no gain from HLDA-SAT on English CTS –Extend method to use non-linear transformations
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.