Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.

Slides:



Advertisements
Similar presentations
Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.
Advertisements

Active Appearance Models
Aggregating local image descriptors into compact codes
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Modulation Spectrum Factorization for Robust Speech Recognition Wen-Yi Chu 1, Jeih-weih Hung 2 and Berlin Chen 1 Presenter : 張庭豪.
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Data Modeling and Parameter Estimation Nov 9, 2005 PSCI 702.
Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech Recognition Kenichi Kumatani, Disney Research, Pittsburgh.
1 Simple Linear Regression and Correlation The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES Assessing the model –T-tests –R-square.
統計圖等化法於雜訊語音辨識之進一步研究 An Improved Histogram Equalization Approach for Robust Speech Recognition 2012/05/22 報告人:汪逸婷 林士翔、葉耀明、陳柏琳 Department of Computer Science.
Histogram-based Quantization for Distributed / Robust Speech Recognition Chia-yu Wan, Lin-shan Lee College of EECS, National Taiwan University, R. O. C.
An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Advances in WP1 Turin Meeting – 9-10 March
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.
Simple Linear Regression
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Advances in WP1 and WP2 Paris Meeting – 11 febr
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Multimodal Interaction Dr. Mike Spann
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.
Scatterplots & Regression Week 3 Lecture MG461 Dr. Meredith Rolfe.
Quantile Based Histogram Equalization for Noise Robust Speech Recognition von Diplom-Physiker Florian Erich Hilger aus Bonn - Bad Godesberg Berichter:
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Machine Learning 5. Parametric Methods.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition Shih-Hsiang Lin, Berlin Chen, Yao-Ming.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
Computacion Inteligente Least-Square Methods for System Identification.
Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Chapter 7. Classification and Prediction
Regression Analysis AGEC 784.
PATTERN COMPARISON TECHNIQUES
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
AN ANALYSIS OF TWO COMMON REFERENCE POINTS FOR EEGS
DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark
Missing feature theory
DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.
A maximum likelihood estimation and training on the fly approach
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speech / Non-speech Detection
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪

Outline Introduction Histogram equalization (HEQ) Proposed methods Experiments and results Conclusions 2

Introduction Varying environmental effects, such as ambient noise, noises caused by recording devices and transmission channels, etc., often give rise to a severe mismatch between the acoustic conditions for training and test. Such a mismatch will no doubt cause severe degradation in the performance of an automatic speech recognition (ASR) system. Among the most common schools of thought for robust speech feature extraction, feature normalization is believed to be a simpler and more effective way to compensate for the mismatch caused by noises. This is because it normally does not require a priori knowledge about the actual distortions. 3

Introduction The cepstral mean normalization (CMN is a simple and efficient technique for removing the time-invariant distortion introduced by the transmission channel. while a natural extension of CMN, named the cepstral mean and variance normalization (CMVN), attempts to normalize not only the means of speech features but also their variances. In order to compensate for non-linear environmental effects, the histogram equalization (HEQ) approach has been proposed and extensively studied in recent years. 4

Introduction In this paper, we present a novel extension to the conventional HEQ approach in two significant aspects. First, polynomial regression of various orders is employed to efficiently perform feature normalization building up the notion of HEQ. Second, not only the contextual distributional statistics but also the dynamics of feature values are taken as the input to the presented regression functions for better normalization performance. By doing so, we can to some extent relax the dimension-independence and bag-of-frames assumptions made by the conventional HEQ approach. 5

Histogram equalization (HEQ) Histogram equalization is a popular feature compensation technique that has been well studied and practiced in the field of image processing for normalizing the visual features of digital images, such as the brightness, grey-level scale. HEQ attempts not only to match the means and variances of the speech features, but also to completely match the speech feature distributions of training and test data using transformation functions that are estimated based on the cumulative density functions (CDFs) of the training and test data. 6

Proposed methods We recently presented an efficient method exploring the use of data fitting to approximate the inverse of the CDFs of speech feature vector components for HEQ, named the polynomial-fit histogram equalization (PHEQ). PHEQ makes use of data fitting (or so-called least squared error regression) to estimate the inverse functions of the CDFs of the training speech. For a specific speech feature vector dimension d of the clean training utterances at frame l. EG.( ax 3 + bx 2 +cx + d) 7

Proposed methods where the coefficients a d,m of each dimension d can be estimated by minimizing the sum of squared errors expressed in the following equation: where N is the total number of training speech feature vectors. PHEQ has been shown to perform comparably to the existing HEQ methods and have the additional merits of lower storage and computation requirements. 8

Proposed methods However, as with most conventional HEQ methods, PHEQ makes an independence assumption over speech feature vector components, and thereby performs equalization along the time trajectory of each feature component in isolation. Such an assumption would be invalid when the feature vector components of different dimensions are not completely decorrelated. Furthermore, since speech signals are slowly timevarying, the local contextual (or structural) relationships of consecutive speech frames might provide additional information clues for feature normalization. 9

Proposed methods The resulting CDF vector C l of frame l is further concatenated with its K preceding and K succeeding vectors to form a spliced CDF vector S l of (2K +1)D dimensions. Then, the restored value of a feature vector component x d,l of frame l can be obtained through the following regression equation: 10 (EPHEQ)

Proposed methods To this end, a simple derivative operation is first performed on each feature vector component x d,l of a given training (or test) utterance through the following equation: where h is the step size of difference. By doing so, we can additionally incorporate such cues of dynamics surrounding a feature vector component x d,l to be normalized when performing EPHEQ: where the a d, j,m and b d, j,m are two sets of regression coefficients. The formulation is referred to hereafter as EPHEQ-D. 11

Experiments and results The speech recognition experiments were conducted under various noise conditions using the Aurora-2 database and task, and further verified on the Aurora-4 database and task. Aurora-2: Test Set A and B are artificially contaminated with eight different types of real world noises (e.g., the subway noise, street noise, etc.) in a wide range of SNRs (-5dB, 0dB, 5dB, 10dB, 15dB, 20dB and Clean) and the Test Set C additionally includes the channel distortion. On the other hand, the Aurora-4 task is noise-corrupted medium to large vocabulary speech recognition based on the Wall Street Journal (WSJ) database, consisting of clean speech utterances interfered with various noise sources at different SNR levels ranging from 5 to 15 dB. 12

Experiments and results PHEQ is more attractive than THEQ since it has the additional merit of lower storage and computation costs. EPHEQ and EPHEQ-D which consider the contextual distributional statistics for feature normalization can further boost the performance of PHEQ. By virtue of the additional use of contextual dynamics of speech features during the normalization process, EPHEQ-D provides superior robustness over EPHEQ. 13

Experiments and results We can see that EPHEQ and EPHEQ-D do not give competitive performance relative to AFE. 14

Conclusions In this paper, we have presented a novel feature normalization framework stemming from HEQ. We have elaborated on how the notion of leveraging contextual statistics and/or dynamics cues surrounding speech feature components for feature normalization can be crystallized, which are deemed complementary to those discovered by state-of-the-art robust methods, such as AFE. In future work, we plan to explore more extra information cues for speech feature normalization. 15