Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.

Slides:



Advertisements
Similar presentations
Considerations for the Development and Fitting of Hearing-Aids for Auditory-Visual Communication Ken W. Grant and Brian E. Walden Walter Reed Army Medical.
Advertisements

Frequency Band-Importance Functions for Auditory and Auditory- Visual Speech Recognition Ken W. Grant Walter Reed Army Medical Center Washington, D.C.
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Introduction to MP3 and psychoacoustics Material from website by Mark S. Drew
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
August 2004Multirate DSP (Part 2/2)1 Multirate DSP Digital Filter Banks Filter Banks and Subband Processing Applications and Advantages Perfect Reconstruction.
Speech Enhancement through Noise Reduction By Yating & Kundan.
AUDIO COMPRESSION TOOLS & TECHNIQUES Gautam Bhattacharya.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
Pre-processing for EEG and MEG
Survey of INTERSPEECH 2013 Reporter: Yi-Ting Wang 2013/09/10.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Speech perception Relating features of hearing to the perception of speech.
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Probabilistic video stabilization using Kalman filtering and mosaicking.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Lecture 7 AM and FM Signal Demodulation
MPEG Audio Compression by V. Loumos. Introduction Motion Picture Experts Group (MPEG) International Standards Organization (ISO) First High Fidelity Audio.
Relationship between perception of spectral ripple and speech recognition in cochlear implant and vocoder listeners L.M. Litvak, A.J. Spahr, A.A. Saoji,
Angle Modulation Objectives
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Why is ASR Hard? Natural speech is continuous
Fundamentals of Perceptual Audio Encoding Craig Lewiston HST.723 Lab II 3/23/06.
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.
Conclusions  Constriction Type does influence AV speech perception when it is visibly distinct Constriction is more effective than Articulator in this.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
Audio Compression Usha Sree CMSC 691M 10/12/04. Motivation Efficient Storage Streaming Interactive Multimedia Applications.
Microphone Integration – Can Improve ARS Accuracy? Tom Houy
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
Chapter 3.2 Speech Communication Human Performance Engineering Robert W. Bailey, Ph.D. Third Edition.
Multimodal Information Analysis for Emotion Recognition
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
Sh s Children with CIs produce ‘s’ with a lower spectral peak than their peers with NH, but both groups of children produce ‘sh’ similarly [1]. This effect.
Sounds in a reverberant room can interfere with the direct sound source. The normal hearing (NH) auditory system has a mechanism by which the echoes, or.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
To examine the feasibility of using confusion matrices from speech recognition tests to identify impaired channels, impairments in this study were simulated.
Hearing Research Center
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
SwissQual AG – Your QoS Partner Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 1 8th and 9th June Mainz,
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
Present document contains informations proprietary to France Telecom. Accepting this document means for its recipient he or she recognizes the confidential.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Predicting the Intelligibility of Cochlear-implant Vocoded Speech from Objective Quality Measure(1) Department of Electrical Engineering, The University.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
On the improvement of virtual localization in vertical directions using HRTF synthesis and additional filtering Wersényi György SZÉCHENYI ISTVÁN UNIVERSITY,
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
COMMUNICATION AND SIGNALING: PENGUINS USE THE TWO- VOICE SYSTEM TO RECOGNIZE EACH OTHER Spencer Hildie Sara Wang.
Speech Enhancement Algorithm for Digital Hearing Aids
Speech and Singing Voice Enhancement via DNN
Speech Enhancement Summer 2009
4aPPa32. How Susceptibility To Noise Varies Across Speech Frequencies
Precedence-based speech segregation in a virtual auditory environment
Robust Data Hiding for MCLT Based Acoustic Data Transmission
III. Analysis of Modulation Metrics IV. Modifications
Sam Norman-Haignere, Nancy G. Kanwisher, Josh H. McDermott  Neuron 
Giovanni M. Di Liberto, James A. O’Sullivan, Edmund C. Lalor 
Speech Perception (acoustic cues)
Speech Communications
Rapid Formation of Robust Auditory Memories: Insights from Noise
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av. Félix viallet, Grenoble, France

Introduction and motivations We used the experimental paradigm proposed by [Shannon et al., 95], from which we developed a series of experiments. As proposed by (Horii et al., 1971) they varied the spectro-temporal resolution of speech utterances. The stimuli were composed of white noise modulated by the filtered envelopes extracted in 4 subbands. The task was consonant identification for VCVCV within 16 French consonants. Then, we evaluated the transmission of their phonetic features: voicing, mode and place of articulation. We extent this paradigm by masking this residual signal with stationary [Lorenzi et al., 99], or non stationary noises [Grosgeorges et al., 00]. In this framework, we substitute to the couple (local SNR/acoustic representation) and to the analysis in terms of identification rate another couple (global SNR/phonetic representation) with an analysis in terms of feature transmission. Then, we focus on the problem of acoustic phonetic decoding in noise, and on the impact of the noise on the features grounding the classification process. In other words, we postulate the existence of an intermediate level preceding the phonetic categorisation, and we study its properties.

Introduction and motivations (2) So, we expect a set of complementary results from this approach, at the same time informative about the study of the link between auditory and speech processes, useful for CASA, and informative for developing ASR for noisy and distorted speech. For RESPITE, the goal of this project is to set-up a plausible multi-stream model in which the phonetic identification of consonants is grounded by the extraction of these three phonetic characteristics, voicing, place and mode, this in specialised modules having different spectro-temporal resolution. A pre-classification according this appropriate phonetic representation could be more robust than the direct classification, the streams easier to weight according their information content, and the fusion process easier to control. Remark: vowel identification is considered as well modelled in current implementations. Moreover, the visual modality can be integrated in this model easily for the same reason: the audio-visual complementarity is optimally represented.

The Shannon et al. ’ experiment Spectral degradation: signal was divided into one, two, three or four frequency bands. Temporal degradation: the amplitude envelope extracted from each band was low-pass filtered with cutoff frequencies Fc:16, 50, 160 or 500Hz. The identification of 3 features (voicing, manner and place) for 16 French consonants « a/C/a » was evaluated by the classical information transmission analysis (Miller and Nicely, 1955). The main conclusion is: despite the great spectro-temporal reduction, voicing and manner are remarkably well transmited by the residual envelope, i.e. by the temporal components of the speech. Some questions arise: how this residue is processed ? how to use it for increasing robustness ? …. one way is to mask it and to analyse what occurs.

Factorial design of the masking experiment Factor n°1: The spectral resolution was constant at 4 frequency bands, and the envelope was filtered with cutoff frequency Fc at 10 or 500Hz. Factor n°2: We added different temporal maskers in order to selectively degrade the different components of the residual signal: (1) in order to mask the coarse component of temporal information, we used a low frequency AM (amplitude modulation < 8Hz) white noise applied in each subband, for all maskers. (2) to degrade the residual spectral information, we decorrelated the low frequency AM across the 4 frequency bands. (3) to mask the fine temporal information, we re-modulated the low frequency AM of the masker at 100Hz.

Factorial design (2) Task: Consonant identification task in a quiet room, with forced choice and no feedback Subjects: 6 normal hearing listeners not trained. However all subjects had experience in psychoacoustical experiments Stimuli: 384 stimuli composed of 6 different conditions were presented in random order

Speech and signal processing 16 utterances aCaCa : - with C = {b, d, g,v, Z, z, m, n, r, l,p,t,k,f,s,S} consonant features: voicing: voiced={b,d,g,v,Z,z,m,n,r,l} / voiceless={p,t,k,f,s,S} manner: fricative + liquid ={f,s,S,v,Z,z,r,l} / occlusive + nasal={p,t,k,b,d,g,m,n} place: dental={p,b,f,v,m} / labial={t,d,s,z,n,l} / palatal={k,g,Z,S,r} Stimulus FS = Hz and Frame analysis92.8ms Nonsense Speech: SNR=+6dB Temporal masker FFT spectral bands decomposition Low-pass filtering at 500 Hz or 10 Hz Signal rectification iFFT Bandpass filtering Signal reconstruction + White noise White noise (1) (1) + (2) (1) + (2) + (3)

Exemple of stimulus

Results of the experiment For all conditions, chance was set at 6.25% (1/16) for consonant recognition. Overall mean correct identification for the 6 subjects was 28%. A confusion matrix was generated for each listener and summed across listeners. Then, the mean transmission information (Miller and Nicely, J. Acoust. Soc. Am., 1955) for voicing, manner and place of articulation was evaluated. The average information received for each consonant feature is plotted as a function of the level number, as compared with the average information received when there was no temporal masker (dashed lines).

Results : transmission of voicing Voicing is not transmitted by the fine temporal modulation (as in Shannon et al.) and it decreases slightly with the degradation of residual spectral information allowed by decorrelation. So we conclude that voicing features are acoustically “distributed”, and then, the degradation according the different maskers’ characteristics (low frequency AM, decorrelation and 100Hz re-modulation) is cumulative.

Results : transmission of the manner Manner of consonant articulation is completely suppressed for all temporal maskers, having in common a low AM characteristic. There is no significant difference with 0% information received. When spectral information is reduced, manner is conveyed by the coarse envelope component, and it strongly interferes with a low AM masker: the differentiation between fricatives and occlusives is encoded temporally and it is well masked by noise having close temporal characteristics.

Nullification of manner transmission

Results : place transmission Place of articulation is significantly less transmitted (P<0.05; t-test) for Level 2 and 3 comparatively to Level 1, for Fc=10 Hz (*). Decorrelation degrades the residual spectral information (for Fc at 10Hz).

Conclusion of the masking experiment We retrieve the main Shannon et al.’s results. Our experiment suggests that: -voicing is a redundant consonant feature which depends on both categories of information: coarse temporal envelope and spectral information, -but manner is mainly carried by the coarse temporal envelope. This experiment supports the hypothesis that consonant identification is a complex process which can compensate for the reduction or the masking of both temporal or spectral information by the use of residual information for voicing and place, but not for the manner.

Clean signal 10 Hz500 Hz The intelligibility is weak for 1 and 2 subbands, with a poor transmission of the place of articulation. The difference between Fc at 10 and 500 Hz is weak. Perspective (1) : variation of the spectro-temporal resolution

Perspective (2): interaction between spectral reduction and masking Information received (%) 4sb SNR=+6dB 4sb clean 16sb clean 16sb SNR=+6dB Voicing Place of articulation Manner of articulation This preliminary experiment (Fc=10Hz) shows that for the mode, there is a rather independent effect of spectral reduction and of temporal masking, the later having the stronger impact. This confirms that the mode is mainly encoded temporally. So one proposal for multistream ASR is to decode this feature temporally in a separate 4 subbands stream.

Perspective (3): audio-visual complementarity As shown by Erber (1972), intelligibility is high even for 1 and 2 subbands: the place of articulation is the best transmitted by the visual modality, whereas this is the worse transmitted for the audio reduced speech, so the global intelligibility is restored thanks to the direct complementarity of transmission in the two modalities.