Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Multipitch Tracking for Noisy Speech

Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.

An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.

Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,

HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Advances in WP1 and WP2 Paris Meeting – 11 febr

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

Random Processes and LSI Systems What happedns when a random signal is processed by an LSI system? This is illustrated below, where x(n) and y(n) are random.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者：汪逸婷 1.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj NIPS 2009.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER

Performance Comparison of Speaker and Emotion Recognition

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

Autoregressive (AR) Spectral Estimation

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

- A Maximum Likelihood Approach Vinod Kumar Ramachandran ID:

语音与音频信号处理研究室 Speech and Audio Signal Processing Lab Multiplicative Update of AR gains in Codebook- driven Speech.

UNIT-IV. Introduction Speech signal is generated from a system. Generation is via excitation of system. Speech travels through various media. Nature of.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.

Speech Enhancement Algorithm for Digital Hearing Aids

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Speech Enhancement with Binaural Cues Derived from a Priori Codebook

Statistical Models for Automatic Speech Recognition

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

Statistical Models for Automatic Speech Recognition

EE513 Audio Signals and Systems

Missing feature theory

LECTURE 15: REESTIMATION, EM AND MIXTURES

Presented by Chen-Wei Liu

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN, UK Presented by Shih-Hsiang IEEE Trans. on Audio, Speech, and Language Processing, Vol. 14, No.3, May 2006

2 Introduction Speech recognition performance is known to degrade dramatically when a mismatch occurs between training and testing conditions Traditional approaches for removing the mismatch thereby reducing the effect of noise on recognition include –Removing the noise from the test signal Noise filtering or speech enhancement –Spectral subtraction, Wiener filtering, RASTA filtering Assuming the availability of a priori knowledge –Construction a new acoustic model to match the appropriate test environment Noise or environment compensation –Model adaptation, Parallel model combination (PMC), Multi-condition training, SPLICE Real-world noisy training data is needed More recent studies are focused on the methods requiring less knowledge –Since this knowledge can be difficult to obtain in real-world application

3 Introduction (cont.) This paper investigates noise compensation for speech recognition –Involving additive noise, assuming any corruption type (e.g. full, partial, stationary, or time varying) –Assuming no knowledge about the noise characteristics and no training data from the noisy environment This paper proposes a method which focuses recognition only on reliable features but robust to full noise corruption that affects all time-frequency components of the speech representation –Combining artificial noise compensation with the missing-feature method, to accommodate mismatches between the simulated noise condition and the actual noise condition It is possible to accommodate sophisticated spectral distortion, e.g. full, partial, white, colored or none –Based on clean speech training data and simulated noise data –Namely, “Universal Compensation (UC)”

4 Methodology The UC method comprises three step –Construct a set of models for short-time speech spectra using artificial multi-condition speech data Generated by corruption the clean training data with artificial wide-band flat- spectrum noise at consecutive SNRS –Given a test spectrum Search for the spectral components in each model spectrum that best match the corresponding spectral components in the test spectrum Produce a score based on the matched components for each model spectrum –Combine the scores from the individual model spectra to form an overall score for recognition

5 Methodology (cont.) Step 1 –Generating noise by passing a white noise through a low-pass filter Step 2 –Calculating a score for each model spectrum based only on the match spectral components Step 3 –Combining the individual score from each model spectra to product an overall score Clean training spectrum Artificial wide-band flat-spectrum noise Noisy test spectrum

6 Methodology (cont.) A key to the success of the UC method is the accuracy for converting a full band corruption into partial-band corruption This accuracy is determined by two factors –The frequency-band resolution Determines the bandwidth for each spectral component The smaller the bandwidth, the more accurate the approximation for arbitrary noise spectral by piecewise flat spectra –But usually results in a loss of correlation between the spectral components, thus giving a poor phonetic discrimination An optimum frequency-band subdivision, in term of a good balance between the noise spectral resolution and the phonetic discrimination remains a topic for study –The amplitude resolution Refers to the number of steps used to quantize the SNR The finer the quantizing steps, the more accurate the approximation for any given level of noise –The use of a large number of SNRs may result in a low computational efficiency

7 Formulation A. Model and Training Algorithms Assume that each training frame is represented by spectral vector consisting of sub-band spectral components Assume that level of SNR are used to generate the wide-band flat- spectrum noise to form the noisy training, Let represent a model spectrum, expressed as the probability distribution of the model spectral vector, associated with speech state and trained on SNR level Let be a test spectral vector Recognition involves classifying each test spectrum into an appropriate speech state, based on the probabilities of the test spectrum associated with the individual model spectra within the state Computing the probability for each model spectrum –Only the matched spectral components are retained, –The mismatched components are ignored

8 Formulation (cont.) A. Model and Training Algorithms The probability can be approximated by which is the marginal distribution of obtained from with the mismatched spectral components in ignored to improve mismatch robustness Given for each model spectrum, the overall probability of,associated with speech state,can be obtained by combining over all different SNRs For simplicity, assume that the individual spectral components are independent of one another. So the probability for any subset can be written as (1) (2)

9 Formulation (cont.) A. Model and Training Algorithms The model spectrum may be constructed in two different ways –Firstly, we may estimate each explicitly by using the training data corresponding to a specific SNR –Alternatively, we may build the model by polling the training data from all SNR conditions together, and training the model as a usual mixture model on the mixed dataset (more flexible) Use EM algorithm decide the association between data / mixture / weights

10 Formulation (cont.) B. Recognition Algorithm Given a test spectral vector, the mixture probability in (1) using only a subset of the data for each of the mixture densities –Reducing the effect of mismatched noisy spectral components –But we need to decide the matched subset that contains all the matched components for each model spectrum If we can assume that the matched subset produces a large probability, then may be defined as the subset that maximize the probability among all possible subsets in However, (2) indicates that the values of for different sized subsets are of a different order of magnitude and are thus not directly comparable –An appropriate normalization is needed for the probability –A possible solution is to replace the condition probability of the test subset with the posterior probability of the model spectrum  always producing a value in the range [0,1]

11 Formulation (cont.) B. Recognition Algorithm By maximizing the posterior probability, we should be able to obtain the subset for model spectrum that contains all the matched components. The following shows the optimum decision: The above optimized posterior probability can be incorporated into a HMM to form the state based emission probability  MAP Criterion Don ’ t care Assuming an Equal prior p(s) for all the states (3)

12 Experimental Evaluation (cont.) A. Databases Tow databases are used to evaluate the performance of the UC method –The first database is Aurora 2 For speaker independent recognition of digit sequences in noisy conditions –The second database containing the highly confusing E-set words Used as an example to further examine the ability of the new UC model to deal with acoustically confusing recognition tasks E-set words include b, c, d, e, g, p, t, v

13 Experimental Evaluation (cont.) Acoustic Modeling for Aurora 2 The performance of UC model is compared with the performances of four baseline systems –The first one trained on the clean training set 3 mixtures per state for the digits / 6 mixtures per stat for the silence –The second one trained on the multi condition training set 3 mixtures per state for the digits / 6 mixtures per state for the silence –The third one improved model correspond to the complex back-end model 20 mixtures per states for the digits / 36 mixtures per state for the silence –The forth one uses 32 mixtures for all the state Which thus has the same model complexity as the UC model The UC model is trained using only the clean training set –Expanded by adding wide-band flat-spectrum noise to each of the utterance –10 different SNR levels, from 20dB to 2dB, reducing 2dB every level –The wide-band flat-spectrum is computer-generated white noise filtered by a low-pass filter with a 3-dB bandwidth of 3.5 kHz

14 Experimental Evaluation (cont.) Acoustic Modeling for Aurora 2 The speech is divided into frames of 25 ms at a frame rate of 10 ms For each frame –13-channel mel filter bank to obtain 13 log filter-bank amplitudes –These 13 amplitudes are then decorrelated by using a high-pass filter resulting in 12 decorrelated log filter-bank amplitudes, denoted by –The bandwidth of the subband can be increased conveniently by grouping neighboring subband components together to form a new subband component, for example a 6-subband spectral vector can be express as –In this paper, each feature vector consists 18 components 6 static subband spectra, 6 delta subband spectra and 6 delta-delta subband spectra The overall size of the feature vector for a frame is 18 x 2 = 36

15 Experimental Evaluation (cont.) Tests on Aurora 2 Condition Table shows the recognition result for clean test data For the clean data, best accuracy rates were obtained by the multi- condition baseline model with 20 and 32 mixtures per states The UC model performed on average slightly better than the multi- condition model with 3 mixtures models

16 Experimental Evaluation (cont.) Tests on Aurora 2 Condition Tables show the recognition result on test set A and test set B The UC model significantly improved over the baseline model trained on clean data, and achieve an average performance close to that obtained by the multi-condition model with three mixtures per state Car noise exhibits a less sophisticates spectral structure than the babble noise, and thus may be more accurately matched by the piece-wise flat spectra as implemented in the UC model

17 Experimental Evaluation (cont.) Tests on Aurora 2 Condition Table shows the recognition result on test set C The channel mismatch problem can be solved by Multi-20 and Multi-32 The UC model also showed a capability of coping with this mismatch –The performance is little affected by channel mismatch Figure summarizes the average word accuracy results for the five system

18 Experimental Evaluation (cont.) Tests on Noise Unseen in Aurora 2 The purpose of this study is to further investigate the capability of the UC model to offer robustness for a wide variety of noise –Three additional noise are used A polyphonic mobile phone ring, A pop song segment, A broadcast news segment –The spectral characteristics of the three noise are shown in follow figure A polyphonic mobile phone ring A pop song segment A broadcast news segment

19 Experimental Evaluation (cont.) Tests on Noise Unseen in Aurora 2 The UC model offered improved accuracy over all the three baseline model The UC model produced particularly good result for the ringtone noise –because the noise mainly partial corruption over the speech frequency band Table also indicates that increasing the number of mixtures in the mismatched baseline model –produced only a small improvement for the news noise –no improvement for the phone ring noise

20 Experimental Evaluation (cont.) Tests on Noise Unseen in Aurora 2 The UC model, with a complexity similar to that of Multi-32, performed similarly to Multi-3 trained in matched conditions The UC model was able to outperform Multi-32 in the case of unknown/mismatched noise conditions

21 Experimental Evaluation (cont.) Discrimination Study on an E-Set Database This experiment is conducted into the ability of the UC model to discriminate between acoustically confusing words –While it reduces the mismatch between training and testing conditions, does it also reduce the discrimination between utterances of different words They experimented on a new database, containing the highly confusing E-set words (b, c, d, e, g, p, t, v), extracted from the Connex speaker- independent alphabetic database provided by British Telecom –Contains three repetitions of each word by 104 speakers 53 male and 51 female Among 104 speakers, 52 for training and the other 52 for testing For each word, about 156 utterances are available for training A total of 1219 utterances are available for testing For different noise from Aurora 2 test set A are artificially added Two baseline HMMs are buit –One with the clean training set (1 mixture per state) –The other with the multi-condition training set (11 mixtures per state)

22 Experimental Evaluation (cont.) Discrimination Study on an E-Set Database For the clean E-set, the UC model achieved a recognition accuracy rate close to the rate obtained by the baseline model, with only small loss in accuracy (84.91%  83.33%) For the given noise conditions, the UC model achieved an average performance close to that obtained by the multi-condition baseline model

23 Experimental Evaluation (cont.) Discrimination Study on an E-Set Database Finally, tested the performance of the UC model with different resolutions for quantizing the SNR –Three different training sets are generated with an increasing SNR resolution Coarse quantization (6 mixtures per state ) –Including only five different SNRs, from 20dB to 4dB with a 4 dB step Medium-resolution quantization (11 mixtures per state) –Including ten different SNRs, from 20dB to 2dB with a 2dB step Fine quantization (21 mixtures per state) –Including twenty different SNRs, from 20dB to 2dB with an 1dB step Additionally, all the three sets also include the clean training data

24 Experimental Evaluation (cont.) Discrimination Study on an E-Set Database The two models with the medium and fine quantization produce quite similar recognition accuracy in many test conditions The model with the coarse quantization trained with 6 SNRs produced poorer results than the other two models, but still showed significant performance improvement in comparison to the baseline model trained on the clean data

25 Summary This paper investigated noise compensation for speech recognition –Assuming no knowledge about the noise characteristics and no training data from the noisy environment –Universal compensation (UC) is proposed as a possible solution to the problem –The UC method involves a novel combination of the principle of multi- condition training and the principle of the missing feature method Experiments on the Aurora 2 have shown that the UC model has the potential to achieve a recognition performance close to the multi- condition model performance without assuming knowledge of the noise Further experiments with noises unseend in Aurora 2 have indicated the ability of the UC model to offer robust performance for a wide variety of noises Finally, the experimental results on an E-set database have demonstrated the ability of the UC model to deal with acoustically confusing recognition tasks