Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition Shih-Hsiang Lin, Berlin Chen, Yao-Ming.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Biointelligence Laboratory, Seoul National University

統計圖等化法於雜訊語音辨識之進一步研究 An Improved Histogram Equalization Approach for Robust Speech Recognition 2012/05/22 報告人：汪逸婷林士翔、葉耀明、陳柏琳 Department of Computer Science.

Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.

Chapter 4: Linear Models for Classification

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Visual Recognition Tutorial

Pattern Recognition and Machine Learning

Curve-Fitting Regression

Evaluating Hypotheses

Development of Empirical Models From Process Data

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Linear and generalised linear models

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Linear and generalised linear models

EE513 Audio Signals and Systems Wiener Inverse Filter Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

Radial Basis Function Networks

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

Curve-Fitting Regression

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

23 November Md. Tanvir Al Amin (Presenter) Anupam Bhattacharjee Department of Computer Science and Engineering,

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Computacion Inteligente Least-Square Methods for System Identification.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.

Chapter 7. Classification and Prediction

LECTURE 11: Advanced Discriminant Analysis

Maximum Likelihood Estimation

PSG College of Technology

10701 / Machine Learning Today: - Cross validation,

A Tutorial on Bayesian Speech Feature Enhancement

Missing feature theory

Generally Discriminant Analysis

DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Berlin Chen Department of Computer Science & Information Engineering

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition Shih-Hsiang Lin, Berlin Chen, Yao-Ming Yeh Wen-Yi Chu Department of Computer Science & Information Engineering National Taiwan Normal University

2 Outline Introduction Cluster-based polynomial-fit histogram equalization Polynomial-fit histogram equalization Experiments and results Summary Conclusions

Introduction(1/3) The performance of current automatic speech recognition (ASR) systems often deteriorates radically when the input speech is corrupted by various kinds of noise sources. Broadly speaking, the existing methods can be classified into two categories according to whether they function directly on the basis of the feature domain or consider some specific statistical feature characteristics. Methods employing the feature domain can be further divided into three subcategories: feature compensation, feature transformation, and feature reconstruction. Another school of thought attempts to seek remedies based on some of the noise-resistant statistical characteristics of speech features rather than the feature values themselves. 3

Introduction(2/3) Histogram equalization (HEQ) methods attempt not only to match the means and variances of speech features but also to completely match the distributions of speech features between training and test speech data. Noises will not only modify the distributions of the speech features but also inject uncertainties into the speech features due to the random behavior of noise. However, most of the HEQ approaches can only deal with the mismatch between the training and test conditions, while few can deal with such uncertainties. Therefore, we expect that researches conducted along the aforementioned two directions could complement each other, and it might be possible to inherit their individual merits to overcome their inherent limitations. 4

Introduction(3/3) In this paper, we propose a cluster-based polynomial-fit histogram equalization (CPHEQ) approach, which makes use of both the speech features and their corresponding distribution characteristics for speech feature compensation. CPHEQ inherits the merits of the above two orientations and uses the data fitting technique in a purely data-driven manner to approximate the actual distributions without the need for unrealistic assumptions about the speech feature distributions. 5

Cluster-based polynomial-fit histogram equalization(CPHEQ)(1/5) The basic idea behind CPHEQ stems from two diverse approaches. The first is stereo-based piecewise linear compensation for environments (SPLICE), which attempts to use a Gaussian mixture model (GMM) to characterize the noisy feature space. SPLICE might sometimes fail to handle the nonlinear relationship between the clean and noisy speech when the mixture number of the GMM model is insufficient to characterize the noisy feature space. In order to avoid this shortcoming, we add the idea of HEQ: HEQ uses nonlinear transformation functions to compensate for nonlinear distortions by utilizing the relationship between the cumulative distribution function (CDF) of the test speech and those of the corresponding training (or reference) one. 6

Cluster-based polynomial-fit histogram equalization(CPHEQ)(2/5) For CPHEQ, we first use the noisy speech data to train a GMM model whose parameters are estimated by the k-means algorithm followed by the expectation maximization (EM) algorithm. The GMM is expressed as follows: Furthermore, we assume that the compensated feature vector can be derived by where the posterior probability given by 7

Cluster-based polynomial-fit histogram equalization(CPHEQ)(3/5) The restored value of given the k-th mixture is defined as follows: Unlike SPLICE, which uses an additive bias to approximate the conditional expectation for the k-th mixture, we introduce the idea originating from HEQ to approximate the conditional expectation. Therefore, the restored value of for the k-th mixture is calculated as where is the inverse (or transformation) function, which maps each CDF value onto its corresponding predefined feature value for the k-th mixture. 8

Cluster-based polynomial-fit histogram equalization(CPHEQ)(4/5) For the feature vector component sequence of a specific dimension of a speech utterance, the corresponding CDF value of each feature component can be computed approximately through the following two steps Step 1 : The sequence is first sorted in ascending order according to the values of the feature vector components. Step 2 : The order-statistics-based approximation of the CDF value of a feature vector component is then given as In the training phase, the coefficients of the polynomial function for the k-th mixture can be estimated with a set of stereo data by minimizing the squared error defined by 9

Cluster-based polynomial-fit histogram equalization(CPHEQ)(5/5) In the test phase, each feature vector component of the test speech is first used to estimate its corresponding CDF value, and then the restored value of can be obtained by In order to reduce the computation time, we use the maximum a posteriori probability (MAP) criterion and redefine Eqs. ( 1 ) and ( 2 ), respectively, as follows: 10

Polynomial-fit histogram equalization(PHEQ)(1/2) In this paper, we present a variant of CPHEQ, named polynomial histogram equalization (PHEQ). In the implementation of PHEQ, only a single global transformation function is utilized to obtain the restored value of the noisy feature vector component, and therefore Eq. (2) can be rewritten as where the coefficients are estimated by merely using the clean training speech feature vector components and by minimizing the squared error expressed in the following equation: 11

Polynomial-fit histogram equalization(PHEQ)(2/2) A summary of the storage requirements and computational complexities of these three approaches is presented in Table I. In brief, PHEQ is advantageous in terms of storage and computational requirements as compared with the other two conventional HEQ approaches. 12

Experiments on CPHEQ(1/3) It can be found that CPHEQ provides significant performance boosts over the MFCC-based baseline system, especially when the number of mixtures is large (e.g., 512 or 1024). However, there is no significant difference between the soft-decision approach and the hard-decision approach. Accordingly, this may suggest that using Eq. (3) to derive the polynomial functions for CPHEQ is sufficient and can simplify the computation of CPHEQ either in the training or recognition phases. 13

Experiments on CPHEQ(2/3) In the next set of experiments, we assess the performance of CPHEQ with respect to different numbers of mixtures of the GMM model and different orders of the polynomial function. 14

Experiments on CPHEQ(3/3) In the third set of experiments, we attempt to combine CPHEQ with two other kinds of feature representations to further verify the effectiveness of CPHEQ. They are linear discriminant analysis (LDA) and heteroscedastic linear discriminant analysis (HLDA), both of which are derived directly from the outputs of the Mel-scaled log filter banks and postprocessed by the maximum likelihood linear transform (MLLT) for feature decorrelation. The feature vectors from every nine successive frames are spliced together to form the supervectors for the construction of the transformation matrix. The dimension of the resultant vectors is set to

Experiments on PHEQ(1/2) Next we evaluate the performance of PHEQ with respect the polynomial order; the results are presented in Table III. To go a step further, we integrate PHEQ with the two discriminative feature as described in the preceding section. The corresponding average WER results are shown in Table IV. 16

Experiments on PHEQ(2/2) However, as mentioned in the above experiments, a smaller mixture number may be insufficient to delineate the noise characteristics. Hence, we try to combine CPHEQ with PHEQ through a simple linear interpolation of the restored values derived from each of these two methods, to overcome this shortcoming. The results reveal that CPHEQ and PHEQ can, to some extent, complement each other well. 17

Comparison with Various Feature Normalization Methods(1/2) Here we compare our two proposed feature normalization methods (i.e., CPHEQ and PHEQ) with several typical feature normalization methods under the clean-condition training scenario. CPHEQ does not outperform the other approaches significantly on Test Set C. This is mainly because Test Set C additionally includes convolutional distortions, which might lead to a substantial discrepancy in calculating the posterior probability for the test speech. To avoid such a discrepancy, a straightforward remedy is to use CMS to remove the channel distortions. 18

Comparison with Various Feature Normalization Methods(2/2) In order to confirm that feature normalization based on both the speech features and the corresponding distribution characteristics is superior to that based on the speech features alone, we also investigated a cluster-based polynomial feature compensation (CPFC) approach that restored the speech features directly on the basis of their value domain rather than their distribution characteristics (i.e., the CDF values). 19

Further Comparison with Three Sophisticated Robustness Methods(1/2) Finally, we further compare CPHEQ with three more sophisticated and effective robustness methods, namely, ETSI advanced frontend (denoted as AFE), Mel-LPC-based Mel-Wiener filter (denoted as MLMWF), and feature-based vector Taylor-series speech enhancement (denoted as F-VTS). Given a test utterance, CPHEQ operates on the MFCC features directly without explicitly using any online noise estimation or reduction process; in other words, the corresponding noise characteristic of the test utterance is simply determined by the pretrained noisy GMM model. Such a deficiency will no doubt limit the performance of CPHEQ. 20

Further Comparison with Three Sophisticated Robustness Methods(2/2) Since CPHEQ uses both clean speech and its noisy counterpart to estimate the polynomial functions, we also compare its performance with AFE and MLMWF under the multi-condition training scenario. 21

Summary The results shown in Fig. 1 reveal that there exists a strong correlation between the order of the polynomial function and the mixture number of the GMM model. Even though CPHEQ performs worse than the sophisticated robustness methods described in Section IV-F, due to the nature of simplicity, CPHEQ still lends itself to dealing with noise distortions, alone or combined with the other more complicated robustness methods. Each of these methods has its own merits and defects. The need of stereo data sometimes limits the applicability of CPHEQ, since stereo data are not always easy to collect. One possible solution to this difficulty is to borrow the idea of VTS enhancement. 22

Conclusions Since it is sometimes difficult to collect stereo data, one future research direction would be the use of mono data (either clean or noisy speech data) to estimate the parameters of the transformation functions. The data-fitting technique is prone to be affected by abnormal values; therefore, another future research direction would be outlier detection/elimination, or the so-called robust regression. Speech signals are slowly time-varying, so the contextual information between consecutive speech feature vectors might be an important clue that can be employed by CPHEQ and PHEQ. 23