Presentation is loading. Please wait.

Presentation is loading. Please wait.

ICASSP 20041 Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.

Similar presentations


Presentation on theme: "ICASSP 20041 Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney."— Presentation transcript:

1 ICASSP 20041 Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney IBM Reporter : Chen, Hung-Bin

2 ICASSP 20042 Outline Introduction VAD ( Voice Activity Detection and Speech Segmentation ) –discriminate speech from non-speech which consists of noise sounds –multiscale spectro-temporal modulation features extracted using a model of auditory cortex Two state-of-the-art systems –Robust Multifeature Speech/Music Discriminator –Robust Speech Recognition In Noisy Environments Auditory model Experimental results Summary and Conclusions

3 ICASSP 20043 Introduction - VAD significance –Speech recognition systems designed for real world conditions, a robust discrimination of speech from other sounds is a crucial step. advantage –Speech discrimination can also be used for coding or telecommunication applications. proposed system –a feature set inspired by investigations of various stages of the auditory system

4 ICASSP 20044 Two state-of-the-art systems Multi–feature System –Features Thirteen features in Time, Frequency, and Cepstrum domain are used to model speech and music (noise). –Classification A Gaussian mixture model (GMM) models each class of data as the union of several Gaussian clusters in the feature space. Reference: –[1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997.

5 ICASSP 20045 Two state-of-the-art systems (cont) Voicing–energy System –Features frame-by-frame maximum autocorrelation and log-energy features is making the speech/non-speech decision. PLP LDA+MLLT –Segmentation use an HMM-based segmentation procedure with two models, one for speech segments and one for non-speech segments. Reference: –[2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002,

6 ICASSP 20046 Auditory model The computational auditory model is based on neurophysiological, biophysical, and psychoacoustical investigations at various stages of the auditory system. transformation of the acoustic signal into an internal neural representation (auditory spectrogram)

7 ICASSP 20047 Auditory model (cont) a complex spatiotemporal pattern –vibrations along the basilar membrane of the cochlea 3–step process 1)highpass filter, by an instantaneous nonlinear compression 2)lowpass filter (hair cell membrane leakage) 3)detects discontinuities in the responses across the tonotopic axis of the auditory nerve array –computationally via a bank of modulation-selective filters centered at each frequency along the tonotopic axis.

8 ICASSP 20048 Auditory model (cont) Sound is analyzed by a model of the cochlea (depicted on the left) consisting of a bank of 128 constant-Q bandpass.lters with center frequencies equally spaced on a logarithmic frequency axis

9 ICASSP 20049 Multilinear Analysis Of Cortical Representation auditory model is a multidimensional array. the time dimension is averaged over a given time window which results in a three mode tensor for each time window with each elements representing the overall modulations at corresponding frequency, rate and scale (128(frequency channels) ×26 (rates) ×6 (scales)

10 ICASSP 200410 Multilinear Analysis Of Cortical Representation (cont) Using multi-dimensional PCA to tailor the amount of reduction in each subspace independently. To generalize the multidimensional tensors concept, we consider a generalization of SVD (Singular Value Decomposition) to tensors. D = S×1Ufrequency×2Urate×3Uscale×4Usamples –D : The resulting data –S : I 1 × I 2 ×... × I N Original : (128(frequency channels) ×26 (rates) ×6 (scales) The resulting tensor whose retained singular vectors in each mode ( 7 for frequency, 5 for rate and 3 for scale dimensions) is used for classification. Classification was performed using a Support Vector Machine (SVM)

11 ICASSP 200411 Experimental Results Audio Database from TIMIT –Training data : 300 samples –Testing data : 150 different sentences spoken by 50 different speakers (25 male, 25 female) –training and test sets were different. To make the non-speech class –from BBC Sound Effects audio CD, RWC Genre Database and Noisex and Aurora databases were assembled together. The training set –300 speech and 740 non-speech samples the testing set –150 speech and 450 non-speech samples The audio length is equal.

12 ICASSP 200412 Experimental Results (cont) speech detection/discrimination – Table 1 and 2 shows the effect

13 ICASSP 200413 Experimental Results (cont) tests white and pink noise were added to speech with specified signal to noise ratio (SNR).

14 ICASSP 200414 Experimental Results (cont) different levels of reverberation on the performance

15 ICASSP 200415 Summary and Conclusions This work is but one in a series of efforts at incorporating multi–scale cortical representations (and more broadly, perceptual insights) in a variety of audio and speech processing applications. Applications such as –automatic classification –segmentation of animal sounds –an efficient encoding of speech and music

16 ICASSP 200416 Reference Two state-of-the-art systems –[1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997. –[2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002, vol. I, pp. 53–56, 2002. Central Auditory System –[4] K. Wang and S. A. Shamma, ”Spectral shape analysis in the central auditory system”, IEEE Trans. Speech Audio Proc. vol. 3 (5), pp. 382–395, 1995. –[6] M. Elhilali, T. Chi and S. A. Shamma, ”A spectro-temporal modulation index (STMI) for assessment of speech intelligibility”, Speech comm., vol. 41, pp. 331– 348, 2003. –Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method SHIHAB A. SHAMMA –http://www.isr.umd.edu/People/faculty/Shamma.html


Download ppt "ICASSP 20041 Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney."

Similar presentations


Ads by Google