統計圖等化法於雜訊語音辨識之進一步研究 An Improved Histogram Equalization Approach for Robust Speech Recognition 2012/05/22 報告人：汪逸婷林士翔、葉耀明、陳柏琳 Department of Computer Science.

Slides:

Advertisements

Similar presentations

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.

Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Modulation Spectrum Factorization for Robust Speech Recognition Wen-Yi Chu 1, Jeih-weih Hung 2 and Berlin Chen 1 Presenter : 張庭豪.

Histogram-based Quantization for Distributed / Robust Speech Recognition Chia-yu Wan, Lin-shan Lee College of EECS, National Taiwan University, R. O. C.

Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.

HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.

Speaker Adaptation for Vowel Classification

Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000.

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,

HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Advances in WP1 and WP2 Paris Meeting – 11 febr

1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

Codebook-based Feature Compensation for Robust Speech Recognition 2007/02/08 Shih-Hsiang Lin ( 林士翔 ) Graduate Student National Taiwan Normal University,

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

Basics of Neural Networks Neural Network Topologies.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者：汪逸婷 1.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Subproject II: Robustness in Speech Recognition. Members (1/2) Hsiao-Chuan Wang (PI) National Tsing Hua University Jeih-Weih Hung (Co-PI) National Chi.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Autoregressive (AR) Spectral Estimation

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition Shih-Hsiang Lin, Berlin Chen, Yao-Ming.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.

Speech Enhancement Summer 2009

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

LECTURE 11: Advanced Discriminant Analysis

Statistical Models for Automatic Speech Recognition

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

Statistical Models for Automatic Speech Recognition

A Tutorial on Bayesian Speech Feature Enhancement

Missing feature theory

DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.

Speech / Non-speech Detection

Presented by Chen-Wei Liu

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

統計圖等化法於雜訊語音辨識之進一步研究 An Improved Histogram Equalization Approach for Robust Speech Recognition 2012/05/22 報告人：汪逸婷林士翔、葉耀明、陳柏琳 Department of Computer Science ＆ Information Engineering National Taiwan Normal University

2 Outline Introduction Review of Conventional Histogram Equalization (HEQ) Approaches Proposed Polynomial-Fit Histogram Equalization (PHEQ) Approach Integration with Other Robustness Techniques Experimental Setup and Results Conclusions and Future Work

3 Introduction Varying environmental effects lead to severe mismatch between the acoustic conditions for the training and test speech data –Accordingly, performance of an automatic speech recognition (ASR) system would dramatically degrade Techniques dealing with this issue generally fall into three categories –Speech Enhancement Spectral Subtraction (SS), Wiener Filter (WF), etc. –Robust Speech Feature or Feature Normalization Cepstral Mean Subtraction (CMS), Cepstrum Mean and Variance Normalization (CMVN), etc. –Acoustic Model Adaptation Maximum a Posteriori (MAP), Maximum Likelihood Linear Regression (MLLR), etc.

4 Introduction (cont.) A Simplified Distortion Framework –Channel effects are usually assumed to be constant while uttering –Additive noises can be either stationary or non-stationary (Convolutional Noise) Channel Effect Background Noise (Additive Noise) Noisy Speech Clean Speech

5 Non-linear Environmental Distortions –Clean speech was corrupted by 10dB subway noise Not only linear but also non-linear distortions were involved Introduction (cont.)

6 Constraint of the linear property of conventional CMN and CMVN approaches –Linear distortions can be effectively dealt with However, non-linear environmental distortions can not be adequately compensated Recently, histogram equalization (HEQ) approaches have been widely investigated for the compensation of non-linear environmental effects –HEQ attempts to not only match speech features’ means/variances but also completely match the features’ distributions of training and test data –Superior performance gains have been demonstrated

7 Roots of HEQ HEQ is a general non-parametric method to make the cumulative distribution function (CDF) of some given data match a reference one –E.g., the equalization of the CDF of test speech to that of training (reference) speech 1.0 Transformation Function CDF ofTest Speech CDF ofReference Speech

8 Practical Implementation of HEQ Due to a finite number of speech features being considered, the cumulative histograms are used instead of the CDFs HEQ can be simply implemented by table-lookup (THEQ) –e.g. {(Quantile i, Restored Feature Values)} –To achieve better performance, the table sizes cannot be too small The needs of huge disk storage consumption Table-lookup is also time-consuming

9 Quantile-based Histogram Equalization (QHEQ) QHEQ attempts to calibrate the CDF of each feature vector component of the test data to that of training data in a quantile-corrective manner –Instead of full matching of cumulative histograms A parametric transformation function is used For each sentence, the optimize parameters and should be obtained from the quantile correction step –Exhaustive online grid search is required: time-consuming

10 Polynomial-Fit Histogram Equalization (PHEQ) We propose to use least squares regression for the fitting of the inverse function of CDFs of training speech –For each speech feature vector dimension of the training data, a polynomial function can be expressed as follows, given a pair of and corresponding CDF –The corresponding squares error –Coefficients can be estimated by minimizing the squares error

11 PHEQ (cont.) Implementation details –For each feature vector dimension,, the CDF value of each frame can be estimated using the following steps are sorted in an ascending order The corresponding CDF value of each frame is approximated by –Where is an indication function, indicating the position of in the sorted data –During recognition The CDF value of each test frame is estimated and taken as the input to the corresponding inverse function G to obtain a restored feature component

12 Polynomial-Fit Histogram Equalization (cont.) Though, as will be indicated, PHEQ are effective –Some undesired sharp peaks or valleys caused by non-stationary noises or occurred during equalization can not be well compensated by HEQ

13 Temporal Averaging (TA) Several approaches using the moving averages of temporal information were also investigated –Non-Casual Moving Average –Casual Moving Average –Non-Casual Auto Regression Moving Average –Casual Auto Regression Moving Average

14 Block Diagram of Proposed Approach Speech Signal Feature Extraction (MFCC) PHEQ Feature Vector Training Data Calculation of CDFs Training Phase Estimation of Polynomial Coefficients Test Data Polynomial Coefficients Polynomial Regression Test Phase Temporal Averaging Calculation of CDFs

15 Experimental Setup The speech recognition experiments were conducted under various noise conditions using the Aurora-2 database and task –Front-end speech analysis 39-dimensional feature vectors were extracted at each time frame, including 12 MFCCs + log Energy, and the corresponding delta and acceleration coefficients –Back-end recognizer HTK recognition toolkit for training of acoustic models Each digit acoustic model was a left-to-right continuous density HMM with 16 states (3 diagonal Gaussian mixtures per state) Two additional silence models were defined –Short pause: 1 state (6 Gaussians) –Silence: 3 states (6 Gaussians per state)

16 Experimental Results: PHEQ –WER is slightly improved when the order of the polynomial regression becomes higher –100 quantiles and 7-th polynomial order were used in the following experiments Word Error Rate (WER)Polynomial Order 3579 Clean Condition Training All training data quantiles quantiles quantiles Multi Condition Training All training data quantiles quantiles quantiles Average word error rates (WERs) w.r.t different numbers of training data and different polynomial orders which were used in the estimation of the inverse functions of CDFs

17 Experimental Results: PHEQ-TA Word Error Rate (WER)Span Order Clean Condition Training Non-Casual MA Casual MA Non-Casual ARMA Casual ARMA Multi Condition Training Non-Casual MA Casual MA Non-Casual ARMA Casual ARMA Average word error rates (WERs) w.r.t combine PHEQ with different temporal averaging techniques and different span orders –Non-Casual ARMA can yield better performance –In clean-condition training, it can provide a relative improvement of about 20% compared with that of using PHEQ alone

18 Experimental Results: PHEQ-TA Word Error Rate (WER) Set ASet BSet CAverage Clean Condition Training MFCC AFE CMVN MS+VN+ARMA(3) THEQ QHEQ PHEQ PHEQ-TA Average word error rates (WERs) of different feature normalization approaches –PHEQ provides significant performance boosts for the baseline MFCC system –It is also better than CMVN, and competitive to HEQ and QHEQ

19 Experimental Results: PHEQ-TA (cont.) Word Error Rate (WER) Set ASet BSet CAverag e Multi Condition Training MFCC AFE CMVN MS+VN+ARMA(3) THEQ QHEQ PHEQ PHEQ-TA –In multi-condition training, PHEQ also provides consistently better results as that is done in clean-condition training Average word error rates (WERs) of different feature normalization approaches

20 Integration with Other Robustness Techniques Finally, we integrated our proposed feature normalization approach with two conventional feature de-correlation and compensation techniques –Heteroscedastic Linear Discriminant Analysis (HLDA) and Maximum Likelihood Linear Transform (MLLT) HLDA and MLLT were conducted directly on the Mel-frequency filter bank outputs HLDA is used for dimension reduction and MLLT is used for feature decorrelation –Stereo-based Piecewise LInear Compensation (SPLICE) The piecewise linearity is intended to approximate the true nonlinear relationship between clean and corresponding noisy utterances Provide accurate estimates of the bias or correction vectors without the need for an explicit noise model SPLICE is a frame-based bias removal algorithms

21 Integration with Other Robustness Techniques (cont.) Word Error Rate (WER) Set ASet BSet CAverage Clean Condition Training HLDA-MLLT+CMVN HLDA-MLLT+PHEQ-TA SPLICE+CMVN SPLICE+PHEQ-TA Multi Condition Training HLDA-MLLT+CMVN HLDA-MLLT+PHEQ-TA SPLICE+CMVN SPLICE+PHEQ-TA –Either the feature de-corrleation technique, like HLDA-MLLT, or the feature compensation technique, like SPLICE, can achieve significant performance gains when being combined with PHEQ-TA Average word error rates (WERs) achieved by combing different normalization and de-correlation approaches

22 Conclusions and Future Work The HEQ approaches for feature normalization were extensively investigated and compared –Wd have proposed the use of data fitting schemes to efficiently approximate the inverse of the CDF of the training speech for HEQ –Further investigation of PHEQ is currently undertaken Different moving average methods were also exploited to alleviate the influence of sharp peaks and valleys The combinations with the other feature de-correlation and compensation techniques indeed demonstrated very encouraging results