Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.

Slides:



Advertisements
Similar presentations
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Advertisements

Advanced Speech Enhancement in Noisy Environments
Multipitch Tracking for Noisy Speech
An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University.
USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO IMPROVE AUTOMATIC SPEECH RECOGNITION: Promise, Progress, and Problems Richard Stern Department of Electrical.
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.
A Hidden Markov Model Framework for Multi-target Tracking DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
3-D Spatialization and Localization and Simulated Surround Sound with Headphones Lucas O’Neil Brendan Cassidy.
Hearing & Deafness (3) Auditory Localisation
Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.
HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Binaural Sonification of Disparity Maps Alfonso Alba, Carlos Zubieta, Edgar Arce Facultad de Ciencias Universidad Autónoma de San Luis Potosí.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Improved 3D Sound Delivered to Headphones Using Wavelets By Ozlem KALINLI EE-Systems University of Southern California December 4, 2003.
From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.
Applied Psychoacoustics Lecture: Binaural Hearing Jonas Braasch Jens Blauert.
Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
Studies of Information Coding in the Auditory Nerve Laurel H. Carney Syracuse University Institute for Sensory Research Departments of Biomedical & Chemical.
Sounds in a reverberant room can interfere with the direct sound source. The normal hearing (NH) auditory system has a mechanism by which the echoes, or.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Dynamic Aspects of the Cocktail Party Listening Problem Douglas S. Brungart Air Force Research Laboratory.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
3-D Sound and Spatial Audio MUS_TECH 348. Physical Modeling Problem: Can we model the physical acoustics of the directional hearing system and thereby.
Gammachirp Auditory Filter
Spatial and Spectral Properties of the Dummy-Head During Measurements in the Head-Shadow Area based on HRTF Evaluation Wersényi György SZÉCHENYI ISTVÁN.
Hearing Research Center
Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
The Time Dimension for Scene Analysis DeLiang Wang Perception & Neurodynamics Lab The Ohio State University, USA.
Autonomous Robots Vision © Manfred Huber 2014.
Performance Comparison of Speaker and Emotion Recognition
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.
Fletcher’s band-widening experiment (1940)
The role of reverberation in release from masking due to spatial separation of sources for speech identification Gerald Kidd, Jr. et al. Acta Acustica.
What can we expect of cochlear implants for listening to speech in noisy environments? Andrew Faulkner: UCL Speech Hearing and Phonetic Sciences.
SPATIAL HEARING Ability to locate the direction of a sound. Ability to locate the direction of a sound. Localization: In free field Localization: In free.
Feature Matching and Signal Recognition using Wavelet Analysis Dr. Robert Barsanti, Edwin Spencer, James Cares, Lucas Parobek.
Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University.
Speech Enhancement Algorithm for Digital Hearing Aids
CS 445/656 Computer & New Media
Speech and Singing Voice Enhancement via DNN
Speech Enhancement Summer 2009
Auditory Localization in Rooms: Acoustic Analysis and Behavior
4aPPa32. How Susceptibility To Noise Varies Across Speech Frequencies
Precedence-based speech segregation in a virtual auditory environment
Cocktail Party Problem as Binary Classification
Attentional Tracking in Real-Room Reverberation
Volume 62, Issue 1, Pages (April 2009)
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark
Volume 62, Issue 1, Pages (April 2009)
Audio and Speech Computers & New Media.
A maximum likelihood estimation and training on the fly approach
Presentation transcript:

Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.

2 Outline of presentation  Background & objective  Description of a novel approach  Evaluation –Using SNR and ASR measures –Speech intelligibility measure –A comparison with an existing model  Summary

3 Cocktail-party problem  How to model a listener’s remarkable ability to selectively attend to one talker while filtering out other acoustic interferences?  The auditory system performs auditory scene analysis (Bregman 1990) using various cues, including fundamental frequency, onset/offset, location, etc.  Our study focuses on location cues: –Interaural time difference (ITD) –Interaural intensity difference (IID)

4 Background  Auditory masking phenomenon: –In a narrowband, a stronger signal masks a weaker one.  In the case of multiple sources, generally one source dominates in a local time-frequency region.  Our computational goal for speech segregation is to identify a time-frequency (T-F) binary mask, in order to extract the T-F units dominated by target speech.

5 Ideal binary mask  An ideal binary mask is defined as follows (s: signal; n: noise): –Relative strength: –Binary mask:  So our research aims at computing, or estimating, the ideal binary mask.

6 Model architecture

7 Head-Related transfer function  Pinna, torso and head function acoustically as a linear filter whose transfer function depends on the direction of and distance to a sound source.  We use a catalogue of HRTF measurements collected by Gardner and Martin (1994) from a KEMAR dummy head under anechoic conditions.

8 Auditory periphery  128 gammatone filters for the frequency range 80 Hz - 5 kHz to model cochlear filtering.  Adjusted the gains of the gammatone filters to simulate the middle ear transfer function.  A simple model of auditory nerve: Half-wave rectification and square-root operation (to simulate saturation)

9 Azimuth localization  Cross-correlation mechanism for ITD detection (Jeffress 1948).  Frequency-dependent nonlinear transformation from the time-delay axis to the azimuth axis.  Sharpening of the cross-correlogram with a similar effect as the lateral inhibition mechanism, resulting in skeleton cross-correlogram.  Locations are identified as peaks in the skeleton cross-correlogram.

10 Azimuth localization: Example (Target: 0 , Noise: 20  ) Conventional cross-correlogram for one frameSkeleton cross-correlogram

11 Binaural cue extraction  Interaural time difference –Cross-correlation mechanism. –To resolve the multiple-peak problem at high frequencies, ITD is estimated as the peak in the cross- correlation pattern within a period centering at ITD target  Interaural intensity difference: Ratio of right-ear energy to left-ear energy. –

12 Ideal binary mask estimation  For narrowband stimuli, we observe that systematic changes of extracted ITD and IID values occur as the relative strength of the original signals changes. This interaction produces characteristic clustering in the joint ITD-IID space.  The core of our model lies in deriving the statistical relationship of the relative strength and the values of the binaural cues.  We employ utterances from the TIMIT corpus for training, and the same corpus and that collected by Cooke (1993) for testing.

13 Theoretical analysis  We perform a theoretical analysis with two pure tones to derive the relationship between ITD and IID values and the relative strength between them.  The main conclusion is that both ITD and IID values shift systematically as the relative strength changes.  The theoretical results from pure tones match closely with the corresponding data from real speech.

14 2-source configuration: ITD Theoretical Mean ITD: One channel data (CF: 500 Hz)

15 2-source configuration: IID Theoretical Mean IID: One channel data (CF: 2.5 kHz)

16 3-source configuration - Data histograms for one channel (CF: 1.5 kHz) from speech sources with target at 0  and two intrusions at -30  and 30  - Clustering in the joint ITD-IID space

17 Pattern classification  Independent supervised learning for different spatial configurations and different frequency bands in the joint ITD-IID feature space.  Define:  Decision rule (MAP):

18 Pattern classification (Cont.)  Nonparametric method for the estimation of probability densities : Kernel Density Estimation.  We employ the least squares cross- validation method (Sain et al. 1994) to determine optimal smoothing parameters.

19 Example (Target: 0 o, Noise: 30 o ) TargetNoiseMixtureIdeal binary maskResult

20 Demo: 2-source configuration (Target: 0 o, Noise: 30 o ) Target NoiseMixtureSegregated target White Noise ‘Cocktail Party’ Rock Music Siren Female Speech

21 Demo: 3-source configuration (Target: 0 o, Noise1: -30 o, Noise2: 30 o ) TargetNoise2 Noise1MixtureSegregated target ‘Cocktail-party’ Female Speech

22 Systematic evaluation: 2-source SNR (dB) Average SNR gain (at the better ear) ranges from 13.7 dB for upper two panels to 5 dB for lower left panel

23 3-source configuration Average SNR gain is 11.3 dB

24 Comparison with Bodden model We have implemented and compared with the Bodden model (1993), which estimates a Wiener filter for segregation. Our system produces 3.5 dB average improvement.

25 ASR evaluation  We employ the missing-data technique for robust speech recognition developed by Cooke et al. (2001). The decoder uses only acoustic features indicated as reliable in a binary mask.  The task domain is recognition of connected digits and both training and testing are performed on the left ear signal using the male speaker dataset from TIDigits database.

26 ASR evaluation: Results Target at 0  Intrusion (male speech) at 30  Target at 0  Two intrusions at 30  and -30 

27 Speech intelligibility tests  We employ the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences as target. The score is evaluated as the percentage of keywords correctly identified.  In the unprocessed condition, binaural signals are convolved with HRTF and presented dichotically to the listener. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically.

28 Speech intelligibility results UnprocessedSegregated Two-source (0 , 5  ) condition Interference: babble noise Three-source (0 , 30 , -30  ) condition Interference: male utterance & female utterance

29 Summary  We have proposed a classification-based approach to speech segregation in the joint ITD-IID feature space.  Evaluation using both SNR and ASR measures shows that our model estimates ideal binary masks very well.  The system produces substantial ASR and speech intelligibility improvements in noisy conditions.  Our work shows that computed location cues can be very effective for across-frequency grouping  Future work needs to address reverberant and moving conditions

30 Acknowledgement  Work supported by AFOSR and NSF