Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Slides:



Advertisements
Similar presentations
Signal Estimation Technology Inc. Maher S. Maklad Optimal Resolution of Noisy Seismic Data Resolve++
Advertisements

Ch 3 Analysis and Transmission of Signals
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Modulation Spectrum Factorization for Robust Speech Recognition Wen-Yi Chu 1, Jeih-weih Hung 2 and Berlin Chen 1 Presenter : 張庭豪.
Signal Analysis Using Autoregressive Models of Amplitude Modulation Sriram Ganapathy Advisor - Hynek Hermansky
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
2004 COMP.DSP CONFERENCE Survey of Noise Reduction Techniques Maurice Givens.
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
Speech recognition from spectral dynamics HYNEK HERMANSKY The Johns Hopkins University, Baltimore, Maryland, USA Presenter : 張庭豪.
Advances in WP1 Nancy Meeting – 6-7 July
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Speech Recognition in Noise
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:
Representing Acoustic Information
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Chapter 2. Signals Husheng Li The University of Tennessee.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Basics of Neural Networks Neural Network Topologies.
Part I: Image Transforms DIGITAL IMAGE PROCESSING.
Systems (filters) Non-periodic signal has continuous spectrum Sampling in one domain implies periodicity in another domain time frequency Periodic sampled.
Chapter 6. Effect of Noise on Analog Communication Systems
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
Noise Reduction Two Stage Mel-Warped Weiner Filter Approach.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
PART II: TRANSIENT SUPPRESSION. IntroductionIntroduction Cohen, Gannot and Talmon\11 2 Transient Interference Suppression Transient Interference Suppression.
Speech Enhancement Summer 2009
UNIT-III Signal Transmission through Linear Systems
PATTERN COMPARISON TECHNIQUES
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Digital Communications Chapter 13. Source Coding
Speech Enhancement with Binaural Cues Derived from a Priori Codebook
Linear Prediction.
Frequency Domain Perceptual Linear Predicton (FDPLP)
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
A Tutorial on Bayesian Speech Feature Enhancement
7.1 Introduction to Fourier Transforms
Digital Systems: Hardware Organization and Design
Coherence spectrum (coherency squared)
Linear Prediction.
DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.
Digital Image Procesing Unitary Transforms Discrete Fourier Trasform (DFT) in Image Processing DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
Human Speech Communication
Digital Image Procesing Unitary Transforms Discrete Fourier Trasform (DFT) in Image Processing DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum

Outline Introduction Frequency domain linear prediction Noise compensation in FDLP Gain normalization in FDLP Robustness in telephone channel noise Feature extraction Experiments Phoneme recognition task TIMIT database Results Phoneme recognition in CTS Relative contribution of various processing stages Modifications Results Summary

Introduction (1/2) Conventional speech analysis techniques start with estimating the spectral content of relatively short (about 10–20 ms) segments of the signal (short-term spectrum). Most of the information contained in these acoustic features relate to formants which provide important cues for recognition of basic speech units. It has been shown that important information for phoneme perception lies in the 1–16 Hz range of the modulation frequencies. The recognition of consonants, especially the stops, suffers more when the temporal modulations below 16 Hz are filtered out. Even when the spectral information is limited to four sub-bands, the use of temporal amplitude modulations alone provides good human phoneme recognition.

Introduction (2/2) However, in the presence of noise, the number of spectral channels needed for good vowel recognition increases, whereas the contribution of temporal modulations remain similar in clean and noisy conditions. For machine recognition of phonemes in noisy speech, there is considerable benefit in using larger temporal context for feature representation of a single phoneme. The techniques that are based on deriving long-term modulation frequencies do not preserve fine temporal events like onsets and offsets which are important in separating some phoneme classes. On the other hand, signal adaptive techniques, which try to represent local temporal fluctuation, cause strong attenuation of higher modulation frequencies which makes them less effective even in clean conditions.

Frequency domain linear prediction (1/2) Typically, AR models have been used in speech/audio applications for representing the envelope of the power spectrum of the signal [time domain linear prediction (TDLP). This paper utilizes AR models for obtaining smoothed, minimum phase, and parametric models for temporal rather than spectral envelopes. The input signal is transformed into frequency domain using DCT. The full-band DCT is windowed using bark-spaced windows to yield sub-band DCT components. In each sub- band, the inverse discrete Fourier transform (IDFT) of the DCT coefficients represents the discrete time analytic signal. Spectral autocorrelations are derived by the application of DFT on the squared magnitude of analytic signal. These autocorrelations are used for linear prediction. The output of linear prediction is a set of AR model parameters which characterize the sub-band Hilbert envelopes.

Frequency domain linear prediction (2/2)

Noise compensation in FDLP (1/2) When speech signal is corrupted by additive noise: Assuming that the speech and noise are uncorrelated: short-term power spectral densities (PSD) at frequency. Conventional feature extraction techniques for ASR estimate the short- term (10–30 ms) PSD of speech in bark or mel scale. Hence, most of the recently proposed noise robust feature extraction techniques apply some kind of spectral subtraction in which an estimate of the noise PSD is subtracted from the noisy speech PSD. A VAD operates on the input speech signal to indicate the presence of non-speech frames. Noisy, Clean, Noise

Noise compensation in FDLP (2/2) Long segments of the input speech signal are transformed to DCT domain where they are decomposed into sub-band DCT components. The discrete time analytic signal is obtained as the squared magnitude IDFT of the DCT signal. We apply short-term noise subtraction on the analytic signal. the proposed approach operates like an envelope normalization procedure as opposed to a noise removal technique.

Gain normalization in FDLP In reverberant environments, the speech signal that reaches the microphone is superimposed with multiple reflected versions of the original speech signal. The envelope functions of typical room impulse responses: reverberant speech, original speech, impulse response slowly varying, positive, envelope function instantaneous phase function

Robustness in telephone channel noise When speech signal is passed through a telephone channel: channel noise, additive noise In such conditions, the combination of noise compensation and gain normalization provides suppression of additive and convolutive distortions.

Feature extraction (1/2) logarithm time constants

Feature extraction (2/2)

Experiments (1/3) MRASTA: multi-resolution RASTA LDMN: log-DFT mean normalization LTLSS: long-term log spectral subtraction

Experiments (2/3) gammatone frequency cepstral coefficients

Experiments (3/3) The Hilbert envelopes form an improved representation compared to short-term critical band energy trajectories.

Summary The application of linear prediction in frequency domain forms an efficient method for deriving sub-band modulations. The two-stage compression scheme of deriving static and dynamic modulation spectrum results in good phoneme recognition for all phoneme classes even in the presence of noise. The noise compensation technique provides a way to derive robust representation of speech in almost all types of noise and SNR conditions. The robustness of the proposed features is further enhanced by the application of gain normalization technique. These envelope normalization techniques provide substantial improvements in noisy conditions over the previous work.