ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

Slides:



Advertisements
Similar presentations
Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
HIWIRE MEETING Paris, February 11, 2005 JOSÉ C. SEGURA LUNA GSTC UGR.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Advances in WP1 Turin Meeting – 9-10 March
Advances in WP1 Nancy Meeting – 6-7 July
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.
Single-Channel Speech Enhancement in Both White and Colored Noise Xin Lei Xiao Li Han Yan June 5, 2002.
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Advances in WP1 and WP2 Paris Meeting – 11 febr
HIWIRE MEETING Trento, January 11-12, 2007 José C. Segura, Javier Ramírez.
1 Speech Enhancement Wiener Filtering: A linear estimation of clean signal from the noisy signal Using MMSE criterion.
HIWIRE MEETING Athens, November 3-4, 2005 José C. Segura, Ángel de la Torre.
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
Speech Enhancement Using Spectral Subtraction
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.
Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition Shih-Hsiang Lin, Berlin Chen, Yao-Ming.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Statistical Models for Automatic Speech Recognition
ECE539 final project Instructor: Yu Hen Hu Fall 2005
Statistical Models for Automatic Speech Recognition
SMEM Algorithm for Mixture Models
A Tutorial on Bayesian Speech Feature Enhancement
EE513 Audio Signals and Systems
Missing feature theory
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Wiener Filtering: A linear estimation of clean signal from the noisy signal Using MMSE criterion.
Speech / Non-speech Detection
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006

PARAMETRIC NONLINEAR FEATURE EQUALIZATION FOR ROBUST SPEECH RECOGNITION Luz Garc’ıa, Jos’e C. Segura, Javier Ram’ırez, Angel de la Torre, Carmen Ben’ıtez Dpto. Teor’ıa de la Se˜nal, Telem’atica y Comunicaciones (TSTC) Universidad de Granada

3 Introduction HEQ have been successfully applied to deal with the nonlinear effect of the acoustic environment in the feature domain –Normalize the probability distributions of the features in such a way that the acoustic environment effects are (partially) removed HEQ still suffer from several limitations –Rely on a local estimation of the probability distributions of the features based on a reduced number of observations belonging to a single utterance to be equalized –Nonlinear transformation is based on mapping the global CDF of each feature into a reference one –The transformations are usually based on a component-by- component equalization of the feature vector, thus discarding any cross-information between features in the equalization process

4 Introduction (cont.) In this paper, a parametric nonlinear equalization technique is proposed –Relies on a two Gaussian model for the probability distribution of the features –And on a simple Gaussian classifier to label the input frames as belonging to the speech or non-speech classes Recognition experiments on the AURORA 4 database have been performed and the effectiveness of the algorithm is analyzed in comparison with other linear and nonlinear feature equalization techniques

5 Review Histogram Equalization For a given random variable y with probability density function, a function mapping into a reference distribution

6 Review Histogram Equalization (cont.) The relative content of non-speech frames is a cause of variability in the HEQ transformation –because an estimation of the global probability distribution is used, that takes into account both speech and non-speech frames

7 Review Histogram Equalization (cont.) The unwanted variability of the transformation induced by the variable proportion of non-speech frames of each utterance can be –Reduced by removing non-speech frames before the estimation of the transformation –Another possibility is to use different transformations for speech and non-speech frames Instead of using a transformation to map the global CDF’s of the features, we can build separate mappings for speech and non-speech frames –As an alternative, the propose the use of a parametric form of the equalization transform based on a two Gaussian mixture model The first Gaussian is used to represent non-speech frames, while the second one represents speech frames

8 Two-class parametric equalization For each class, a parametric linear transformation is defined to map the clean and noisy representation spaces The clean Gaussians for speech and non-speech frames can be estimated from the training database, while the noisy Gaussians should be estimated from the utterance to be equalized non-speech speech clean Gaussian noise noisy Gaussian noisy speech

9 Two-class parametric equalization (cont.) In order to select whether the current frame y is speech or non-speech, a voice activity detector could be used –Implies a hard decision between both linear transformations that could create discontinuities in the limit of the non-speech/speech decision Instead, a soft decision can be used The posterior probabilities P(n|y) and P(s|y) are obtained using a simple two-class Gaussian classifier on MFCC C0

10 Two-class parametric equalization (cont.) Training the two-class Gaussian classifier –Initially, those frames with C0 below the mean value are assigned to the non-speech class and those with C0 above the mean are assigned to the speech class –The EM algorithm is then iterated until convergence (usually, 10 iterations are enough) to obtain the final classifier This classifier is used to obtain the class probabilities P(n|y) and P(s|y) and also to obtain the mean and covariance matricesμ n,y 、 Σ n,y 、 μ s,y and Σ s,y for the non- speech and speech classes for the given noisy input utterance

11 Two-class parametric equalization (cont.) The two Gaussian model for the C0 and C1 cepstral coefficients (used as reference model) along with the histograms of the speech and non-speech frames for a set of clean utterances

12 Experimental Results The proposed parametric equalization algorithm has been tested on the AURORA4 (WSJ0) database –The recognition system used in all cases is based on continuous crossword triphone models with 3 tied states and a mixture of 6 Gaussians per state –The language model is the standard bigram for the WSJ0 task –A feature vector of 13 cepstral coefficients is used as the basic parameterization of the speech signal using C0 instead of the logarithmic energy –The baseline reference system (BASE) uses sentence-by-sentence subtraction of the mean values of each cepstral coefficient (CMS) –The parameters of the reference distribution have been obtained by averaging over the whole clean training set of utterances

13 Experimental Results First row (BASE) corresponds to the baseline system which is based on a simple CMS linear normalization technique. The second row (HEQ) shows the word error rates when using a standard quantile-based implementation of HEQ –relative word error reduction of 17.8% The performance of HEQ is clearly improved by PEQ as shown in the third row, with a relative word error reduction of 30.8%. This result is very close to the one obtained for the AFE, which yields a 31.4% reduction of the word error rate Moreover, PEQ outperforms AFE in half of the tests (i.e. 02, 06, 08, 09, 10, 11 and 13).

14 Conclusions and Future Work The transformation is based on a nonlinear interpolation of two independent linear transformations –The linear transformations are obtained using a simple Gaussian model for the classes of speech and non-speech features The technique evaluated on a complex continuous speech recognition task showing its competitive performance against linear and nonlinear feature equalization techniques like CMS and HEQ A study of influence of within class cross-correlations is currently under development

MODEL-BASED WIENER FILTER FOR NOISE ROBUST SPEECH RECOGNITION Takayuki Arakawa, Masanori Tsujikawa and Ryosuke Isotani Media and Information Research Laboratories, NEC Corporation, Japan

16 Introduction Various kinds of background noise exist in the real world –Therefore robustness against various kinds of noise is quite important. Several approaches have been proposed to deal with this issue –Signal-processing-based spectral enhancement Spectral Subtraction (SS), Wiener Filter (WF) Less computational costs, but needs many tuning costs depending on the kind of noise and signal-to-noise ratio (SNR) –Statistical-model-based noise adaptation The acoustic model i.e., a hidden Markov model (HMM), is adapted to the noisy environment It needs huge computational costs to adapt the distributions to a noisy environment

17 Introduction (cont.) –Statistical-model-based compensation Using Gaussian mixture model (GMM) The computational cost is still much more than that of the signal-processing-based spectral enhancement. In this paper, they proposed Model-Based Wiener filter (MBW) Concept

18 Proposed Method (Cont.) A GMM with K Gaussian distributions is used as knowledge of clean speech in the cepstrum domain MBW algorithm

19 Proposed Method (Cont.) The noisy speech signal X(t) is modeled as Step 1: Perform Spectral Subtraction (SS) Step 2: Derive the expected value of the clean speech noisy speechclean speechnoise temporary clean speech estimated noise Spectrum Domain Cepstrum Domain MMSE Estimation flooring parameter

20 Proposed Method (Cont.) Step 3: Calculate Wiener Gain Step 4: Get the final estimated clean speech smoothing parameter

21 Experiments and Results Experiment Condition –The Mel-frequency cepstral coefficients (MFCC) and their 1st and 2nd derivatives are used as feature value of speech (include C0) –The feature value for GMM is composed of a 13-dimensional MFCC only –The flooring parameter α is set at 0.1, and the smoothing parameter β is set at 0.98 The MBW method was tested on the Aurora2-J task –contains utterances (in Japanese) of consecutive digit string recorded in clean environments –The other conditions are the same as Aurora2

22 Experiments and Results (cont.) The performance of different mixture number of the GMM 5dB restaurant noise At the point of 128 or 256, it becomes saturated

23 Experiments and Results (cont.) The word accuracy for each SNR almost equivalent to that of the AFE.

24 Experiments and Results (cont.) The Word Accuracy over the SNR for each kind of noise. These results show that the proposed method is much more robust than the AFE against various kinds of noise.

25 Conclusions Review MBW algorithm –Roughly estimates clean speech signals using SS –Compensates them using a GMM to improve robustness against non-stationary noise –The compensated speech signal is used to calculate the Wiener gain –Performing Wiener filtering The results show that the proposed method performs as well as the ETSI AFE These results demonstrate that the proposed method is robust against various kinds of noise