Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Speech Recognition with Hidden Markov Models Winter 2011
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Visual Recognition Tutorial
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Speaker Adaptation for Vowel Classification
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.
Advances in WP1 and WP2 Paris Meeting – 11 febr
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
Radial Basis Function Networks
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Isolated-Word Speech Recognition Using Hidden Markov Models
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
7-Speech Recognition Speech Recognition Concepts
Codebook-based Feature Compensation for Robust Speech Recognition 2007/02/08 Shih-Hsiang Lin ( 林士翔 ) Graduate Student National Taiwan Normal University,
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition Shih-Hsiang Lin, Berlin Chen, Yao-Ming.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
A Tutorial on Bayesian Speech Feature Enhancement
DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark
EE513 Audio Signals and Systems
Missing feature theory
LECTURE 15: REESTIMATION, EM AND MIXTURES
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009

2

Introduction The Stereo-based stochastic mapping (SSM) is a front-end data-driven techniques for noise robustness – It assume a joint GMM in the stereo feature space – The mapping between clean and noisy features is estimated from the GMM to compensate the noisy features SSM can be estimated under various criteria – Maximum A Posteriori (MAP)  Iteratively Optimized – Minimum Mean Square Error (MMSE)  Closed Form Solution Moreover, the SSM compensated features are further modeled by Multi-Style MPE training Survey of Robust Speech Techniques in ICASSP 20093

Noise Robustness in feature space (1/2) Compared to the model space robust speech techniques, feature space noise robust techniques have the advantages of – low computational complexity – easy to decouple from the acoustic model end Front-end computation of an IBM LVCSR system with MFCC features – The computation evolves through various feature spaces linear spectral space, Mel spectral space, cepstral space, discriminatively trained feature space Survey of Robust Speech Techniques in ICASSP 20094

Noise Robustness in feature space (2/2) Depending on the nature of the algorithm, feature space noise robust techniques apply compensation at different space – spectral subtraction -> linear spectral – phase-sensitive feature enhancement -> log Mel spectral – data-driven approach -> can be flexibly applied to different feature spaces (e.g., MFCC, LDA or fMPE) Survey of Robust Speech Techniques in ICASSP 20095

SSM and Discriminative Training (1/6) SSM is based on stereo features that are concatenation of clean speech feature vectors and noisy speech feature vectors Define as the joint stereo feature vectors. A GMM is assumed and trained by the EM algorithm where – and are obtained by fMPE training on the LDA features Survey of Robust Speech Techniques in ICASSP 20096

SSM and Discriminative Training (2/6) MMSE-based SSM Given the observed noisy speech feature, the MMSE estimate of clean speech is given by where and is the posterior probability against, the marginal noisy speech distribution of the joint stereo distribution Survey of Robust Speech Techniques in ICASSP 20097

SSM and Discriminative Training (3/6) MAP-based SSM Given the observed noisy speech feature, the MAP estimate of clean speech is given by where equation can be solved using the EM algorithm, which results in an iterative estimation process where Survey of Robust Speech Techniques in ICASSP 20098

SSM and Discriminative Training (4/6) Mathematical Connections The MMSE estimate of SSM is a special tying case of one iteration of the corresponding MAP estimate – Assumes all Gaussians in the GMM share the same condition covariance matrix – It is a reasonable results of the “averaging” effect of the expectation function in the MMSE estimate of SSM Due to the iterative nature of the MAP estimate of SSM, an initial guess has to be made – A natural choice would be the noisy speech feature itself – or setting the MMSE estimate as the starting point Survey of Robust Speech Techniques in ICASSP 20099

SSM and Discriminative Training (5/6) Mathematical Connections SPLICE is a special case of the MMSE estimate of SSM under the assumption that is an identity matrix which is equivalent to and having a perfect correlation – SPLICE estimates the bias terms under the ML criteria Deng Li also gives a connection between SPLICE and fMPE – fMPE has under the minimum phone error criterion Both SPLICE and fMPE share a similar piece-wise linear structure with posterior probability Survey of Robust Speech Techniques in ICASSP

SSM and Discriminative Training (6/6) Mathematical Connections Therefore, the overall MAP-based SSM estimation in the fMPE space with the MMSE-based SSM estimate being the starting point can be expressed as – This amount to applying a sequence of posterior probability weighted piece-wise linear mappings on noisy LDA features After the stochastic mapping, the compensated features can be directly decoded by the clean acoustic models – For better performance, an environment adaptive multi-style discriminative re-training can be further applied (e.g., MPE) Survey of Robust Speech Techniques in ICASSP

Experimental Results (1/3) LVCSR tasks (a vocabulary of 32k English words) – Back-End 150 hrs / 55k Gaussians / 4.5k states (clean acoustic model) 300 hrs / 90k Gaussians / 5k states (multi-style acoustic model) noisy data are generated by adding a mix of humvee, tank and babble noise to the clean data around 15dB – Front-End 24 dims MFCCs (CMS) -> super-vector (9 frames: 216 dim) -> LDA 40 dims – GMMs are trained on the noisy training data and the maping is SNR- specific In test, a GMM-based environment classifier is used to estimate the SNR of sentence – The proposed technique is evaluated on two test sets Set A : 2070 utterances (around 1.7 hrs) recorded in clean condition Set B : 1421 utterances (around 1.2 hrs) recorded in a real world noisy condition (with humvee noise running in the background 5-8dB) Survey of Robust Speech Techniques in ICASSP

Experimental Results (2/3) All the MAP estimations are run for 3 iterations SSM gives the same results for Set A after environment detection As the acoustic model is discriminatively trained on clean speech, the baseline result on Set B noisy data is very poor – But SSM is able to significantly improve the results Compared to the SSM MAP, SSM MMSE MAP reduces the WER relatively by 50%. Survey of Robust Speech Techniques in ICASSP

Experimental Results (3/3) The baseline with multi-style training in Table 2 improves in the noisy condition (Set B) but degrades in the clean condition (Set A) When using compensated feature for multi-style training, the performance improves for both Set A and Set B It significantly reduces WER in the noisy condition (Set B) while maintaining a decent performance in the clean condition (Set A) Survey of Robust Speech Techniques in ICASSP

Summary and Discussion SSM is a data-driven feature space noise robust technique that exploits stereo data. Hence, it has its advantages and disadvantages – Since it is data-driven and does not rely on model for feature computation, it is quite flexible to apply to various speech features e.g., MFCC, PLP, linear or Mel-spectral space, cepstral space, LDA and fMPE spaces, etc – However, stereo data is usually expensive to collect A suboptimal alternative, as done in this paper, would be to artificially generate data for the noisy channel – SSM as a data-driven approach relies on the noise in the training data and may not handle the unseen noise very well Survey of Robust Speech Techniques in ICASSP

16Survey of Robustness Techniques in ICASSP 2009

Introduction (1/2) Recently, several techniques has been proposed which aim to exploit the speech signal properties – The spectral peaks being more robust to a broad-band noise than the spectral valleys or harmonicity information performs locking of the spectral peak-to-valley ratio – alleviate the mismatch between clean and noisy features caused by the spectral valleys being buried by noise appended the information on spectral peaks into the acoustic modified the likelihood calculation with the aim of emphasizing parts of the spectrum corresponding to peaks In this paper, they investigated an incorporation of the mask modeling into an HMM-based (ASR) system Survey of Robust Speech Techniques in ICASSP

Introduction (2/2) As the mask expresses which spectro-temporal regions are uncorrupted by noise – It can also be seen as a generalized and soft incorporation of the spectral peak information – The mask model is associated with each HMM state and mixture It expresses what mask information the state/mixture would expect to find in the signal The mask modeling is performed by employing the Bernoulli distribution The incorporation of the mask modeling is evaluated in a standard model and in two models that had compensated for the effect of the noise, missing feature and multi-condition training model Survey of Robust Speech Techniques in ICASSP

Incorporating Mask Modeling into HMM-based ASR System (1/6) The HMM-based ASR system with the incorporation of mask modeling is formulated as follow – The term corresponds to the employment of the missing-feature techniques – The term expresses how likely the given mask M is being generated by the HMM state Sequence S It serves as a penalization factor for states whose mask model is not in agreement with the mask extracted from the give signal Survey of Robust Speech Techniques in ICASSP HMM State Transition Probability Language Model ProbabilityMask-Model Probability Emission Probability

Incorporating Mask Modeling into HMM-based ASR System (2/6) How can we estimate the mask model? – Having an example of noise The mask model could be estimated based on masks obtained from the training data corrupted by the given noise – Having no information about noise It could be estimated by using a mask reflecting some a-priori knowledge about speech – the fact that high-energy regions of speech spectra are less likely to be corrupted by noise The estimation of the mask model is performed by a separate training procedure that is performed after the HMMs have been trained Survey of Robust Speech Techniques in ICASSP

Incorporating Mask Modeling into HMM-based ASR System (3/6) Estimating the mask model for HMM states – Let denotes the mask vector at a given frame where is the binary mask information of the channel – The mask-model probability for each HMM state and mixture is modeled by a multivariate Bernoulli distribution where is the parameter of the distribution The estimation of the parameter can be estimated by a Baum-Welch or Viterbi -style training procedure Survey of Robust Speech Techniques in ICASSP

Incorporating Mask Modeling into HMM-based ASR System (4/6) The Viterbi algorithm is used to obtain the state-time alignment of the sequence of feature vectors on the HMMs The posterior probability that the mixture component (at state ) have generated the feature vector is then calculated as Then, the parameters of the mask models are then estimated as Survey of Robust Speech Techniques in ICASSP

Incorporating Mask Modeling into HMM-based ASR System (5/6) Regions of a high value of the mask model parameter reflect that the masks associated with the given state were for those regions often one, i.e. little affected by noises Survey of Robust Speech Techniques in ICASSP

Incorporating Mask Modeling into HMM-based ASR System (6/6) The value of the mask probability when being incorporated in the overall probability calculation may need to be scaled (akin to language model scaling) – By employing a sigmoid function – The bigger the value of is the greater the effect of the mask probability on overall probability Survey of Robust Speech Techniques in ICASSP

Experimental Results (1/5) The experiments were carried out on the Aurora-2 database – The frequency-filtered logarithm filter-bank energies were used as speech feature representation Due to their suitability for missing-feature based recognition – The noisy speech data from the Set A were used for recognition experiments Survey of Robust Speech Techniques in ICASSP

Experimental Results (2/5) Survey of Robust Speech Techniques in ICASSP

Experimental Results (3/5) Survey of Robust Speech Techniques in ICASSP

Experimental Results (4/5) Survey of Robust Speech Techniques in ICASSP

Experimental Results (5/5) Survey of Robust Speech Techniques in ICASSP

30Survey of Robustness Techniques in ICASSP 2009

Introduction Survey of Robust Speech Techniques in ICASSP The idea of the feature mapping method is to obtain “enhanced” or “clean” features from the “noisy” features – In theory, the mapping need not be performed between equivalent domains In this paper – They firstly investigate the feature mapping between different domains with the consideration of MMSE criterion and regression optimizations – Secondly they investigate the data-driven filtering for the speech separation by using the neural network based mapping method

Mapping Approach (1/3) Assume that we have both the direction of the target and interfering sound sources through the use of microphone array The mapping approach which takes those two features, and, and maps them to “clean” recordings – To allow non-linear mapping, they used a generic multilayer perceptron (MLP) with one hidden layer, estimating the feature vector of the clean speech Survey of Robust Speech Techniques in ICASSP

Mapping Approach (2/3) The parameters are obtained by minimizing the mean squared error: – The optimal parameters can be found through the error back- propagation algorithm – Note that during training this requires that parallel recordings of clean and noisy data are available while only the noisy features are required for the estimation of clean data during testing Survey of Robust Speech Techniques in ICASSP

Mapping Approach (3/3) With the assumption that the distribution of the target data is Gaussian distributed  minimizing the mean square error in is the result of the principle of maximum likelihood From the perspective of Blind Source Separation (BSS) and Independent Component Analysis (ICA) – The principle of maximum likelihood, which is highly related to the minimization of mutual information between clean source – Their methods, however, lead to a linear transformation, and the probability densities of the sources must be estimated correctly Survey of Robust Speech Techniques in ICASSP

Experimental Data and Setup (1/2) The Multichannel Overlapping Numbers Corpus (MONC) was used to perform speech recognition experiments – There are four recording scenarios S1 (no overlapping speech), S12 (with 1 competing speaker L2), S13 (with 1 competing speaker L3), S123 (with 2 competing speakers L2 and L3) – Training data:6049 utterances, development:2026 utterances and testing (2061 utterances) – The MLP is trained from data drawn from the development data set which consists of 2,000 utterances (500 utterances of each recording scenario in the development set Survey of Robust Speech Techniques in ICASSP

Experimental Data and Setup (2/2) – In this paper, two delay-and-sum (DS) beamformer enhanced speech signals are used – The ASR frontend generated 12 MFCCs and log-energy with corresponding delta and acceleration coefficients Survey of Robust Speech Techniques in ICASSP

Feature Mapping Between Different Domains (1/3) Three domains are selected as the input – spectral amplitude, log Mel-filterbank energies (log MFBE), and Mel-frequency cepstral coefficients (MFCC) As earlier mentioned, the target data with a Gaussian distribution is optimal from the point view of the MMSE – The PDFs of the amplitudes of the clean speech are far from being Gaussian – The PDFs of the log MFBEs are bi-modal (the lower modal may be due to the low SNR segments) – The PDFs of MFCCs have approximative Gaussian distributions Survey of Robust Speech Techniques in ICASSP

Survey of Robust Speech Techniques in ICASSP Feature Mapping Between Different Domains (2/3)

In fact, the mapping to MFCCs is more straightforward in the context of the ASR system, in which MFCCs are used as the features Furthermore, MMSE in the MFCCs also results in MMSE in the delta coefficients (likewise for acceleration coefficients) Survey of Robust Speech Techniques in ICASSP Feature Mapping Between Different Domains (3/3)

Experimental Results (1/2) The mapping of the log MFBEs from two DS enhanced speech to MFCCs yields the best ASR performance – The smaller dynamic range of the log MFBE vectors is advantageous for regression optimization The gains from model adaptation are marginal – The mapping methods evaluated are already very effective at suppressing the influence of interfering speakers on the extracted features Survey of Robust Speech Techniques in ICASSP without model adaptation with model adaptation

Experimental Results (2/2) Survey of Robust Speech Techniques in ICASSP