ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin
2 Topic Word Graph based Feature Enhancement for Noisy Speech Recognition Stereo-based Stochastic Mapping for Robust Speech Recognition Combination of Recognizers and Fusion of Features Approach to Missing Data ASR Under Non-Stationary Noise Conditions
WORD GRAPH BASED FEATURE ENHANCEMENT FOR NOISY SPEECH RECOGNITION Zhi-Jie Yan 1 Frank K. Soong 2 Ren-Hua Wang 1 1 iFlytek Speech Lab, University of Science and Technology of China, Hefei, P. R. China, Microsoft Research Asia, Beijing, P. R. China, SPE-L3: Robust Features and Acoustic Modeling
4 Introduction This paper presents a word graph based feature enhancement method for robust speech recognition in noise –The word graph based approach would have more chance that the correct hypotheses exist in the graph with relatively lower posterior probabilities (or likelihoods) than the incorrect first best hypothesis The proposed method is based upon Wiener fitering of the Mel- filter bank energy, given –the input noisy speech –a signal processing based estimate of noise –a clean trained Hidden Markov Model (HMM) Therefore, the enhanced speech feature after Wiener filtering can match the clean speech model better in the acoustic space, and thus leads to an improved recognition performance
5 Algorithm Overview rough estimate of noise spectrum re-estimate noise mean normalized speech model based estimate of the clean speech final estimate of clean speech corresponding clean speech
6 More Details… Kernel posterior probabilities for each Gaussian component of the model can be calculated –These posterior probabilities will serve as the weighting coefficients for synthesizing the model based clean speech for Wiener filtering –Using the word graph, the posterior probability of kernel k at time t, given the entire observation sequence can be formulated as: The model based clean speech estimate for Wiener filtering is constructed in two steps –First step, for each time frame t, the expected values of the mean and covariance of the clean speech feature are calculated using the kernel posterior probabilities along with the kernel parameters Word Posterior Probability (WPP) State Occupancy Probability Kernel Occupancy Probability
7 More Details… (cont.) –Second step, clean speech S3 can be synthesized in ML sense the ML solution of S3 can be obtained by solving the weighted normal equation synthesized clean speech
8 More Details… (cont.) Wiener filtering of the Mel-filter bank energy is performed in the linear spectral domain In the last step, is converted to the cepstral domain, and then rescore the word graph –Re-decode S4 within the constrained search space defined by the word graph final estimate of clean speech
9 Experimental Results signal processing based feature enhancement consistently improves the recognition performance, and the overall relative error rate reduction is 35.44% the GER of the decoded word graph is significantly lower than the WER of the first best hypothesis (only about 1/4 ∼ 1/5) The results show that an overall relative error rate reduction of 57.89% is obtained Using word graph constrained second pass decoding, this result is obtained with a minor increase of the computational cost The experimental results suggest that the difference between the two decoding scenarios is minimal
STEREO-BASED STOCHASTIC MAPPING FOR ROBUST SPEECH RECOGNITION Mohamed Afif, Xiaodong Cui, and Yuqing Gao IBM T.J. Watson Research Center 1101 Old Kitchawan Road, Yorktown Heights, NY, SPE-L3: Robust Features and Acoustic Modeling
11 Introduction The idea is based on building a GMM for the joint distribution of the clean and noisy channels during training and using an iterative compensation algorithm during testing. –Also interpreted as a mixture of linear transforms that are estimated in a special way using stereo data –Stack both the clean and noisy channels to form a large augmented space and to build a statistical model in this new space The observed noisy speech and the augmented statistical model are used to predict the clean speech
12 Algorithm Formulation Assume we have a set of stereo data {(x i, y i )} Define z ≡ (x, y) as the concatenation of the two channels The first step in constructing the mapping is training the joint probability model for p(z) Once this model is constructed it can be used during testing to estimate the clean speech given the noisy observations –The problem of estimating x in Equation looks like a mixture estimation problem where
13 Algorithm Formulation (cont.) Hence, we will iteratively optimize an EM objective function given by – is the value of x from previous iteration
14 Algorithm Formulation (cont.) By differentiating Equation with respect to x, setting the resulting derivative to zero An interesting special case arises when x is a scalar
15 Experimental Results The first three lines refer to train/test conditions where the clean refers to the CT and noisy to the HF It can be observed that the proposed mapping outperforms SPLICE for all GMM sizes with the difference decreasing with increasing the GMM size Both methods are considerably better than the VTS result in the last row of Table 1 Using a time window gives an improvement over the baseline SSM with a slight cost during runtime These results are not given for SPLICE because using biases requires that both the input and output spaces have the same dimensions Digit recognition in the car
16 Experimental Results (cont.) English large vocabulary speech recognition SSM brings considerable improvement over MST even in the clean speech condition MFCC Feature LDA+MLLT Feature Building maps for the final feature space (after LDA and MLLT) looks to be slightly better than the original cepstral space
Combination of Recognizers and Fusion of Features Approach to Missing Data ASR Under Non-Stationary Noise Conditions Neil Joshi and Ling Guan Department of Electrical and Computer Engineering Ryerson University Toronto ON M5B 2K3, Canada SPE-P14: Robustness II
18 Introduction This paper proposes a method a enhance speech recognition performance using missing data techniques for non-stationary noise conditions –By incorporating more resilient feature sets into the decoding process –Two separate HMM based models One using spectral features The other MFCC features The statistical dependencies found in the models are based upon a coupled HMM methodology, the Fused HMM model –One using standard ASR techniques (traditional MFCC based HMM models) –The other missing data based (missing data theory spectral HMM models)
19 Coupled Fused HMM The fused HMM model models the relationship between HMMs using a probabilistic fusion model The statistical dependencies between two HMM process is thus,
20 Experimental Results the fused decoder is found to significantly increase recognition performance over conventional missing data decode process