Survey of Robust Techniques

Slides:

Advertisements

Similar presentations

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.

Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Speech Recognition with Hidden Markov Models Winter 2011

DETECTING REGIONS OF INTEREST IN DYNAMIC SCENES WITH CAMERA MOTIONS.

An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

Chapter 2 Random Vectors 與他們之間的性質 (Random vectors and their properties)

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.

Speaker Adaptation for Vowel Classification

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

電腦的基本單位類比訊號 (analog signal) 指的是連續的訊號數位訊號 (digital signal) 指的是以預先定義的符號表示不連續的訊號 one bit 8 bits=one byte 電腦裡的所有資料，包括文字、數據、影像、音訊、視訊，都是用二進位來表示的。

電腦的基本單位類比訊號 (analog signal) 指的是連續的訊號

Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models C. J. Leggetter and P. C. Woodland Department of.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

POWER CONTROL IN COGNITIVE RADIO SYSTEMS BASED ON SPECTRUM SENSING SIDE INFORMATION Karama Hamdi, Wei Zhang, and Khaled Ben Letaief The Hong Kong University.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Face Recognition by Support Vector Machines 指導教授 : 王啟州教授學生 : 陳桂華 Guodong Guo, Stan Z. Li, and Kapluk Chan School of Electrical and Electronic Engineering.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Subband Feature Statistics Normalization Techniques Based on a Discrete Wavelet Transform for Robust Speech Recognition Jeih-weih Hung, Member, IEEE, and.

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

PHASE-BASED DUAL-MICROPHONE SPEECH ENHANCEMENT USING A PRIOR SPEECH MODEL Guangji Shi, M.A.Sc. Ph.D. Candidate University of Toronto Research Supervisor:

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

LECTURE 11: Advanced Discriminant Analysis

Statistical Models for Automatic Speech Recognition

A Simple Artificial Neuron

Dynamical Statistical Shape Priors for Level Set Based Tracking

ECE539 final project Instructor: Yu Hen Hu Fall 2005

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

Statistical Models for Automatic Speech Recognition

A Tutorial on Bayesian Speech Feature Enhancement

EE513 Audio Signals and Systems

Generally Discriminant Analysis

Parametric Methods Berlin Chen, 2005 References:

Survey of Robust Techniques

Speech / Non-speech Detection

Presented by Chen-Wei Liu

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Survey of Robust Techniques 2005/5/26 Presented by Chen-Wei Liu

Conferences Log-Energy Dynamic Range Normalization for Robust Speech Recognition Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec, Canada ICASSP 2005 Static and Dynamic Spectral Features : Their Noise Robustness and Optimal Weights for ASR Chen Yang, Tan Lee, The Chinese University of Hong Kong Frank K. Soong, ATR, Kyoto, Japan

Introduction Methods of robust speech recognition can be classified into two approaches Front-end processing for speech feature extraction Back-end processing for HMM decoding Compensation for noise in The front-end processing method is to suppress the noise and get more robust parameters The back-end processing is to compensate for noise and adapt the parameters inside the HMM system This paper focus on the first approach

Introduction Comparing with cepstral coefficients, the log-energy feature has quite different characteristics Log of summation of energy of all samples in one frame (logE) Summation of log filter bank (c0) This paper tries to find a more effective way, named log-energy dynamic range normalization (ERN), to remove the effects of additive noise By minimizing mismatch between training and testing data

Energy Dynamic Range Normalization Observations Elevated minimum value Valleys are buried by additive noise energy, while peaks are not affected as much

Energy Dynamic Range Normalization The larger difference on valleys leads to a mismatch between the clean and noisy speech To minimize the mismatch, this paper suggests an algorithm to scale the log-energy feature sequence of clean speech In which it lifts valleys while it keeps peaks unchanged Log-energy dynamic range is defined as follows 分貝( decibel ) dB 是decibel 的簡稱及簡寫.中文一般譯為「分貝」或「分貝爾」, 分貝是「貝爾」( Bell ) 的十分之一 ( 1 / 10 ). 「貝爾」是用來表示電信功率訊號的增益和衰減的單位.1個貝爾的增益是以功率在放大後與放大之前的比值.在實用上,為了方便,通常使用貝爾的十分之一,即「dB 」為單位.在術學上,貝爾就是對數的倍數值, 乘以10 的值即為分貝值.

Energy Dynamic Range Normalization In the presence of noise, is affected by additive noise, while is not affected as much Let and target energy dynamic range as X ;then the above equation becomes Is this way, it can use to set the target minimum value based on a given target dynamic range

Energy Dynamic Range Normalization The following are the steps of the proposed log-energy feature dynamic range normalization algorithm 1st : find Max = and Min = 2nd : calculate target 3rd : if then 4th 4th : for i=1…n,

Energy Dynamic Range Normalization The scaling effect is decreased as its own value goes up and the maximum of the sequence is unchanged

Experimental Results Linear Scaling The proposed method was evaluated on the Aurora 2.0

Experimental Results Non Linear Scaling Using non-linear scaling of equation as follows

Experimental Results Comparison of Linear Scaling and Non-linear Scaling Performance comparisons at different SNR levels are shown as follows

Experimental Results Combination with other techniques

Conclusions When systems were trained on a clean speech training set, the proposed technique can have overall about a 30.83% relative performance Like CMS, the proposed method does not require any prior knowledge of noise and level Reducing mismatch in log-energy leads to a large recognition improvement

Conferences Log-Energy Dynamic Range Normalization for Robust Speech Recognition Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec, Canada ICASSP 2005 Static and Dynamic Spectral Features : Their Noise Robustness and Optimal Weights for ASR Chen Yang, Tan Lee, The Chinese University of Hong Kong Frank K. Soong, ATR, Kyoto, Japan

Introduction Dynamic cepstral features can help static features to characterize the speech trajectory on its time varying rate It has been shown that such a representation (static + dynamic) yields higher speech and speaker recognition performance than static cepstra only This paper tries to quantify the robustness of static and dynamic features under different types of noise and variable SNRs

Noise Robustness Analysis Recognition with only Static or Dynamic Features

Noise Robustness Analysis Static and Dynamic Cepstral Distances between Clean and Noisy Speech For a given sequence of noisy speech observation, the output likelihood is presented as follows by single Gaussian for simplicity : The mismatch between clean and noisy conditions lies mainly on the exponent term which can be re-written as : Sequence of noisy speech observation A B A B Expected value is zero

Noise Robustness Analysis Static and Dynamic Cepstral Distances between Clean and Noisy Speech Since the expected value of the second term is zero, the difference of likelihood between noisy and clean speech is just the first term, measured by defining a cepstral distance as follows : Where is used to approximate the diagonal covariance, in the clean speech model denotes the time average over the whole utterance

Noise Robustness Analysis Static and Dynamic Cepstral Distances between Clean and Noisy Speech The weighted distances between clean and noisy speech for both the static and dynamic features, respectively: Where the superscripts d and s denote the dynamic and the static features

Noise Robustness Analysis Static and Dynamic Cepstral Distances between Clean and Noisy Speech The following depicts the scatter diagrams of dynamic distance (between clean and noisy dynamic cepstra) vs. its static counterpart :

Noise Robustness Analysis Static and Dynamic Cepstral Distances between Clean and Noisy Speech Two observations can be made on the figure Both distances are larger for increasingly mismatched conditions at lower SNRs Majority points fall below the diagonal line. In other words, the dynamic cepstral distance between noisy and clean features is smaller than its static counterpart

Exponential Weighting in Decoding Exponential Weightings Based on the findings in the previous figure It would make sense to weight the log likelihoods of the static and dynamic features differently in decoding to exploit their uneven noise robustness The output likelihood of an observation can be split into two separate corresponding terms, d and s, as : The acoustic likelihood components can be computed with different exponential weightings as :

Exponential Weighting in Decoding Recognition with Bracketed Weightings Testing by bracketing the two weights at a step of 0.1 with the constraint of unity sum

Exponential Weighting in Decoding Discriminative Weight Training (Weight Optimization) The log likelihood difference (lld) between the recognized and the correct states is chosen as the objective function for optimization For the u-th speech utterance of T observations The lld is as follows : The cost averaged over the whole training set of U utterances is :

Exponential Weighting in Decoding Discriminative Weight Training (Weight Optimization) This cost is minimized by adjusting iteratively the dynamic weight and the static weight, via the steepest descent as :

Experimental Results Evaluation on Aurora2.0 Database Overall, a 36.6% relative WER reduction is obtained :

Experimental Results Evaluation on CUDIGIT Database The relative WER improvement is 41.9%, averaged over all noise conditions

Conclusions The dynamic features were found to be more resilient to additive noise interference than their static counterpart Optimal exponential weights for exploiting the unequal robustness of the two cepstral features were used, and better performance were obtained