HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete
Combining several sources of information to improve the performance Unfortunately, for different environments and noise conditions not all the sources of information are equally reliable. Mismatch between training and test conditions. Goal Propose estimators of optimal stream weights si that can be computed in an unsupervised manner Motivation
Equal error rate in single-stream classifiers Equal estimation error variance in each stream Optimal Stream Weights
Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10) Features: Audio: 39 features (MFCC_D_A) Visual: 39 features (ROIDCT_D_A, odd columns) Multi-Streams HMM models: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams Experimental Results
Two classes → anti models Class membership → inter and intra classes distance Results (classification)
Generalization of the inter- and intra-distances measure → inter distance among all the classes. Results (recognition)
Stream weight computation for multi class classification task based on theoretical results for a two classes classification use of an anti-model technique We use only the test utterance and the information contained in the trained models. Generalization towards unsupervised estimation of stream weights for multi-streams classification and recognition problems. Conclusions
Vocal Tract Length Normalization. Dependence between warping and phonemes. Frame Segmentation into Regions. Warping Factor and Function Estimation. VTLN in Recognition. Evaluation.
Dependence between warping and phonemes[1]. Examining the similarity between two frames before and after the warping: For each phoneme and speaker and for the middle frame of the utterance, the average spectral envelope is computed. An optimal warping factor is computed (for each phonemes’ s utterance), so that the MSE, between the warped spectrum and the corresponding unwraped spectrum,, is minimized. Optimization is achieved by a full search in the interval of warping factors ranging from 0.8 to 1.2, where 1 corresponds to no warping, The mapped spectrum is warped according to this optimal warping factor.
Dependence between warping and phonemes[2]. Bi-Parametric Warping Function (2pts). Different warping factors are evaluated, correspondingly, for the low (f < 3 KHz) and high (f ≥ 3 KHz) frequencies. Constraints:, and step A full search over the 25 ( ) candidate warping functions provides the optimal pair of warping factors. Four-Parametric Warping Function (4pts). Different warping factors are evaluated for the frequency ranges, 0-1.5, 1.5-3, and KHz. The constraints and step remain the same with the bi-parametric case. Full search over the 625 ( ) different candidate warping functions. Bias addition before the warping process. Based on the ML algorithm, we evaluate a linear bias that minimizes the spectral distance between the reference and mapped spectrums. The extracted linear bias is added to the unwrapped mapped spectrum.
Results (over all speakers) after bias addition.
Frame Segmentation into Regions. Based on unsupervised K-Means algorithm, a sequence of testing utterance’s frames, length M, is divided on, specific by us, population of regions. The algorithm’s output is a function F between the frames m and the corresponding region index c, As an additional constraint, a media filtering is placed on the region index’s sequence. This constraint has the effect of smoothing the sequence of indices so as to reflect a more physiologically degree of region transition between successive frames.
Warping Factor and Function Estimation. After the division of frames into regions, an optimal factor and function for each region is obtained by maximizing the likelihood of the warped vectors with respect to the transcriptions from the first pass and the un-normalized Hidden Markov Model, where, is the testing utterance in which every frame, after its categorization into region c, is warped according to one of the R candidate factors and to one of the N candidate functions. The optimum warping factor for each region is obtained by searching over a value space between 0.88 and 1.12 with step λ is the, trained with unnormalized training vectors, Hidden Markov Model, W is the obtained by the first-pass transcription.
VTLN in Recognition. During recognition, since a preliminary transcription for testing utterances is not given, a multiple-pass strategy is introduced: A preliminary transcription W is obtained through a first pass recognition using the unwrapped sequence of cepstral vectors X and the unnormalized model λ, The utterance's frames are categorized into c regions For each region c, an optimal warping factor and function is evaluated through a multi-dimensional grid search, After the evaluation of the vectors related with the optimal per region factor and function the optimally warped sequence is decoded in order to obtain the final recognition result.
Results WER(%) # of Utters15 Baseline50.83 Li & Rose (2 pass) regions41.73 (+4.7%)42.79 (+1.60) 3 regions43.11 (+1.56)43.66 (-0.46)
The Linear Dynamic Model (LDM) Discrete-time Linear Dynamical Systems: Efficient model the evolution of spectral dynamics An observation y k is produced in each time step The state process is first-order Markov Initial state is Gaussian The state and observation noises w k, v k are : Uncorrelated Temporally white Zero-mean Gaussian distributed
Noise covariances are not constrained Matrices F,H have canonical forms Canonical form is identifiable if it is also controllable (Ljung) Generalized canonical form of LDM
Experimental Setup Training Set Aurora 2 Clean Database 3800 training sentences Test set: Aurora 2, test A, subway sentences 1000 test sentences Different levels of noise ( Clean, SNR: 20, 15, 10, 5 dB ) Front-End extracts 14-dimensional features (static features): HTK standard front-end 2 feature configurations –12 Cepstral Coefficients + C0 + Energy –+ first and second order derivatives (δ, δδ)
Model Training on Speech Data Word models with different number of segments based on the phonetic transcription Segment alignments produced using HTK SegmentsModels 2oh 4two, eight 6one, three, four, five, six, nine, zero 8seven
Classification process Keep true word-boundaries fixed Digit-level alignments produced by an HMM Apply suboptimum search and pruning algorithm Keep the 11 most probable word-histories for each word in the sentence Classification is based on maximizing the likelihood
Classification results Comparison of LDM Segment-Models and HTK HMM classification (% Accuracy) Same Front-End configuration, same alignments Both Models trained on clean training data AURORA Subway HMM (HTK)LDMs MFCC, E+δ +δδMFCC, E+δ +δδ Clean97,19%97,57%97,53% 97,61% SNR2090,91%95,71%93,23%95,12% SNR1580,09%91,76%87,91%91,13% SNR1057,68%81,93%76,29%82,69% SNR536,01%64,24%54,87%63,56%
Classification results Performance Comparison (MFCCs)
Classification results Performance Comparison (MFCCs + δ + δδ)
Sub-optimal Viterbi decoding (SOVD) We use a Viterbi-like decoding algorithm for speech classification HMM state equivalent in LDMs is : [x k,s i ] It is applied among the segments of each word-model Provides segment alignments based on the likelihood of the LDM Estimated with a Kalman filter Allows decoding at each time k using possible histories leading to a different [x k,s i ] combination at several depth levels
SOVD Steps
Sub-Optimal Viterbi-like Search S2S2 S1S1 S3S3 S4S4 F 1 x 0 F 1 x 1 F 1 x 2 F 1 x 3 F 1 x 4 F 2 x 1 F 2 x 2 F 2 x 3 F 2 x 4 F 3 x 2 F 3 x 3 F 3 x 4 F 4 x 3 F 4 x 4 t1t1 t2t2 t4t4 t5t5 t3t3 Time (frames)
Visualization of Model Predictions Trajectories of true and predicted observations for c 1, c 3
Classification results Comparison of Segment-Models and HTK HMM classification (% Accuracy) Same fixed Word-boundaries based on the HMM alignments Same Front-End configuration Both Models trained on clean training data AURORA Subway HMM-alignmentsSegment Models HMMLDMd=1d=2 Clean97,19%97,85%97,73% 97,76% SNR2090,91%92,53%93,52% SNR1580,09%85,93%89,68%89,77% SNR1057,68%71,30%77,21%77,33% SNR536,01%46,72%53,66%53,98%
Classification results (Larger State dimension) Comparison of SOVD-LDM for LDM with several state dimensions Same Front-End configuration (MFCCs+E0+c0), same word alignments AURORA Subway Segment Models HMM Clean97,19%97,73%98,22% 98,28% SNR2090,91%93,52%92,22%91,58% SNR1580,09%89,68%84,98%84,70% SNR1057,68%77,21%73,27%73,08% SNR536,01%53,66%55,12%52,99%
Conclusions We investigated generalized canonical forms for LDM We proposed an element-wise ML estimation process When alignments from an equivalent HMM Without derivatives LDMs significantly outperform HMMs particularly under highly noisy conditions When derivatives are used for both models their performance is similar
Conclusions With segment alignments based on LDM HMM alignments hurt recognition performance Viterbi-like search for LDM Larger-dimension Beneficial on clean data Performance degrades on noisy data Future Lower-dimension, articulatory-based features Non-linear state-to-observation mappings
Noise-removal formulated as a BSS problem I mutually uncorrelated speaker signals J microphones Each microphone signal : Compact form: If A invertible (W=A -1 ):
The simulated room Used Douglas Cambell’s “Roomsim” Depicts the positions of the speakers and mics Mixed file was the one received by the first mic (top left)
Database We considered Aurora 4 and TIMIT BSS shows better separability fo speech signals >30s Aurora 4 average utterance length ~7sec TIMIT average utterance length ~3sec Concatenated sentences of the same speakers When there was no overlapping during the whole time We replicated the smaller sentence with samples from the beginning We normalized the sources to ensure same energy
Experimental Setup Test Set: 330 Utterances (AURORA4) 16KHz – 16bits Performance of the clean test-set: 11.13%
Conclusions Baseline model (with Spectral Subtraction) fails to separate the signals Retraining the recognizer with mixed signals can significantly improve performance for small noise levels BSS test data and Baseline model Significantly reduces WER when the speaker’s signal is at the same level Performance highly degrades as the energy of the second speaker decreases. BSS test data + Retrained models with BSS data Best performance for noise levels 10dB or lower For smaller noise levels (>10dB) use the recognizer retrained on mixed signals rather than BSS
Combined Results
We want to determine Weighted average of many estimators In our approach θ denotes a Gaussian component Θ is a subset of Gaussians Optimal Bayes Adaptation
12M12M genone 1genone 2 Phone-Based Clustering Cluster the output distributions based on common central phone For example based on the entropy-based distance between the Gaussians the less distant Gaussians (in gray color) are clustered together Gaussian Size Number of Mixture Components
Likelihoods Collection We compute the likelihoods by using For each voice frame we track which triphones are used and calculate the probability for each θ. We use delta smoothing to the distributions of θ according to
Baseline trained on the WSJ database Adaptation data: spoke3 WSJ task non-native speakers 5 male and 5 female 40 adaptation sentences per speaker 40 test sentences per speaker Adaptation Configuration
Gender-dependent Results
Conclusions A small improvement compared to the baseline case Recent experiments have shown that dynamic associations of distributions have better results Increasing the number of adaptation data improves the recognition results as recent experiments have shown.