ICASSP 20041 Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

Advanced Speech Enhancement in Noisy Environments
Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,
Subband-based Independent Component Analysis Y. Qi, P.S. Krishnaprasad, and S.A. Shamma ECE Department University of Maryland, College Park.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Representing Acoustic Information
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Signal ProcessingES & BM MUET1 Lecture 2. Signal ProcessingES & BM MUET2 This lecture Concept of Signal Processing Introduction to Signals Classification.
Kinect Player Gender Recognition from Speech Analysis
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
C enter for A uditory and A coustic R esearch Representation of Timbre in the Auditory System Shihab A. Shamma Center for Auditory and Acoustic Research.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
2010/12/11 Frequency Domain Blind Source Separation Based Noise Suppression to Hearing Aids (Part 1) Presenter: Cian-Bei Hong Advisor: Dr. Yeou-Jiunn Chen.
Survey of ICASSP 2013 section: feature for robust automatic speech recognition Repoter: Yi-Ting Wang 2013/06/19.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Basics of Neural Networks Neural Network Topologies.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Detection of nerves in Ultrasound Images using edge detection techniques NIRANJAN TALLAPALLY.
Advanced Topics in Speech Processing (IT60116) K Sreenivasa Rao School of Information Technology IIT Kharagpur.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
Gammachirp Auditory Filter
A NEW FEATURE EXTRACTION MOTIVATED BY HUMAN EAR Amin Fazel Sharif University of Technology Hossein Sameti, S. K. Ghiathi February 2005.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Wavelets, Filter Banks and Applications Wavelet-Based Feature Extraction for Phoneme Recognition and Classification Ghinwa Choueiter.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
By Sarita Jondhale 1 The process of removing the formants is called inverse filtering The remaining signal after the subtraction of the filtered modeled.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.
Speech Enhancement based on
Detection of nerves in Ultrasound Images using edge detection techniques NIRANJAN TALLAPALLY.
WLD: A Robust Local Image Descriptor Jie Chen, Shiguang Shan, Chu He, Guoying Zhao, Matti Pietikäinen, Xilin Chen, Wen Gao 报告人:蒲薇榄.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
PCA vs ICA vs LDA.
Outline S. C. Zhu, X. Liu, and Y. Wu, “Exploring Texture Ensembles by Efficient Markov Chain Monte Carlo”, IEEE Transactions On Pattern Analysis And Machine.
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
EE513 Audio Signals and Systems
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
John H.L. Hansen & Taufiq Al Babba Hasan
A maximum likelihood estimation and training on the fly approach
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Dealing with Acoustic Noise Part 1: Spectral Estimation
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney IBM Reporter : Chen, Hung-Bin

ICASSP Outline Introduction VAD ( Voice Activity Detection and Speech Segmentation ) –discriminate speech from non-speech which consists of noise sounds –multiscale spectro-temporal modulation features extracted using a model of auditory cortex Two state-of-the-art systems –Robust Multifeature Speech/Music Discriminator –Robust Speech Recognition In Noisy Environments Auditory model Experimental results Summary and Conclusions

ICASSP Introduction - VAD significance –Speech recognition systems designed for real world conditions, a robust discrimination of speech from other sounds is a crucial step. advantage –Speech discrimination can also be used for coding or telecommunication applications. proposed system –a feature set inspired by investigations of various stages of the auditory system

ICASSP Two state-of-the-art systems Multi–feature System –Features Thirteen features in Time, Frequency, and Cepstrum domain are used to model speech and music (noise). –Classification A Gaussian mixture model (GMM) models each class of data as the union of several Gaussian clusters in the feature space. Reference: –[1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997.

ICASSP Two state-of-the-art systems (cont) Voicing–energy System –Features frame-by-frame maximum autocorrelation and log-energy features is making the speech/non-speech decision. PLP LDA+MLLT –Segmentation use an HMM-based segmentation procedure with two models, one for speech segments and one for non-speech segments. Reference: –[2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002,

ICASSP Auditory model The computational auditory model is based on neurophysiological, biophysical, and psychoacoustical investigations at various stages of the auditory system. transformation of the acoustic signal into an internal neural representation (auditory spectrogram)

ICASSP Auditory model (cont) a complex spatiotemporal pattern –vibrations along the basilar membrane of the cochlea 3–step process 1)highpass filter, by an instantaneous nonlinear compression 2)lowpass filter (hair cell membrane leakage) 3)detects discontinuities in the responses across the tonotopic axis of the auditory nerve array –computationally via a bank of modulation-selective filters centered at each frequency along the tonotopic axis.

ICASSP Auditory model (cont) Sound is analyzed by a model of the cochlea (depicted on the left) consisting of a bank of 128 constant-Q bandpass.lters with center frequencies equally spaced on a logarithmic frequency axis

ICASSP Multilinear Analysis Of Cortical Representation auditory model is a multidimensional array. the time dimension is averaged over a given time window which results in a three mode tensor for each time window with each elements representing the overall modulations at corresponding frequency, rate and scale (128(frequency channels) ×26 (rates) ×6 (scales)

ICASSP Multilinear Analysis Of Cortical Representation (cont) Using multi-dimensional PCA to tailor the amount of reduction in each subspace independently. To generalize the multidimensional tensors concept, we consider a generalization of SVD (Singular Value Decomposition) to tensors. D = S×1Ufrequency×2Urate×3Uscale×4Usamples –D : The resulting data –S : I 1 × I 2 ×... × I N Original : (128(frequency channels) ×26 (rates) ×6 (scales) The resulting tensor whose retained singular vectors in each mode ( 7 for frequency, 5 for rate and 3 for scale dimensions) is used for classification. Classification was performed using a Support Vector Machine (SVM)

ICASSP Experimental Results Audio Database from TIMIT –Training data : 300 samples –Testing data : 150 different sentences spoken by 50 different speakers (25 male, 25 female) –training and test sets were different. To make the non-speech class –from BBC Sound Effects audio CD, RWC Genre Database and Noisex and Aurora databases were assembled together. The training set –300 speech and 740 non-speech samples the testing set –150 speech and 450 non-speech samples The audio length is equal.

ICASSP Experimental Results (cont) speech detection/discrimination – Table 1 and 2 shows the effect

ICASSP Experimental Results (cont) tests white and pink noise were added to speech with specified signal to noise ratio (SNR).

ICASSP Experimental Results (cont) different levels of reverberation on the performance

ICASSP Summary and Conclusions This work is but one in a series of efforts at incorporating multi–scale cortical representations (and more broadly, perceptual insights) in a variety of audio and speech processing applications. Applications such as –automatic classification –segmentation of animal sounds –an efficient encoding of speech and music

ICASSP Reference Two state-of-the-art systems –[1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, –[2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002, vol. I, pp. 53–56, Central Auditory System –[4] K. Wang and S. A. Shamma, ”Spectral shape analysis in the central auditory system”, IEEE Trans. Speech Audio Proc. vol. 3 (5), pp. 382–395, –[6] M. Elhilali, T. Chi and S. A. Shamma, ”A spectro-temporal modulation index (STMI) for assessment of speech intelligibility”, Speech comm., vol. 41, pp. 331– 348, –Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method SHIHAB A. SHAMMA –