USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO IMPROVE AUTOMATIC SPEECH RECOGNITION: Promise, Progress, and Problems Richard Stern Department of Electrical.

Slides:



Advertisements
Similar presentations
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Advertisements

Multipitch Tracking for Noisy Speech
Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech Recognition Kenichi Kumatani, Disney Research, Pittsburgh.
ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION
Motivation Application driven -- VoD, Information on Demand (WWW), education, telemedicine, videoconference, videophone Storage capacity Large capacity.
Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress Richard Stern (with Chanwoo Kim and Yu-Hsiang Chiu) Department.
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
School of Computing Science Simon Fraser University
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh,
SOME SIMPLE MANIPULATIONS OF SOUND USING DIGITAL SIGNAL PROCESSING Richard M. Stern demo August 31, 2004 Department of Electrical and Computer.
Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
HIWIRE meeting ITC-irst Activity report Marco Matassoni, Piergiorgio Svaizer March Torino.
1 Recent development in hearing aid technology Lena L N Wong Division of Speech & Hearing Sciences University of Hong Kong.
Robust Automatic Speech Recognition In the 21 st Century Richard Stern (with Alex Acero, Yu-Hsiang Chiu, Evandro Gouvêa, Mark Harvilla, Chanwoo Kim, Kshitiz.
Despeckle Filtering in Medical Ultrasound Imaging
1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield.
For 3-G Systems Tara Larzelere EE 497A Semester Project.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Binaural Sonification of Disparity Maps Alfonso Alba, Carlos Zubieta, Edgar Arce Facultad de Ciencias Universidad Autónoma de San Luis Potosí.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Audio Scene Analysis and Music Cognitive Elements of Music Listening
Microphone Integration – Can Improve ARS Accuracy? Tom Houy
Nico De Clercq Pieter Gijsenbergh Noise reduction in hearing aids: Generalised Sidelobe Canceller.
METHODOLOGY INTRODUCTION ACKNOWLEDGEMENTS LITERATURE Low frequency information via a hearing aid has been shown to increase speech intelligibility in noise.
Applied Psychoacoustics Lecture: Binaural Hearing Jonas Braasch Jens Blauert.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Hearing: auditory coding mechanisms. Harmonics/ Fundamentals ● Recall: most tones are complex tones, consisting of multiple pure tones ● The lowest frequency.
Nico De Clercq Pieter Gijsenbergh.  Problem  Solutions  Single-channel approach  Multichannel approach  Our assignment Overview.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
Jens Blauert, Bochum Binaural Hearing and Human Sound Localization.
Gammachirp Auditory Filter
Hearing Research Center
Study of Broadband Postbeamformer Interference Canceler Antenna Array Processor using Orthogonal Interference Beamformer Lal C. Godara and Presila Israt.
A Model of Binaural Processing Based on Tree-Structure Filter-Bank
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.
Laboratory for Experimental ORL K.U.Leuven, Belgium Dept. of Electrotechn. Eng. ESAT/SISTA K.U.Leuven, Belgium Combining noise reduction and binaural cue.
Microphone Array Project ECE5525 – Speech Processing Robert Villmow 12/11/03.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
3-D Sound and Spatial Audio MUS_TECH 348. Are IID and ITD sufficient for localization? No, consider the “Cone of Confusion”
Suppression of Musical Noise Artifacts in Audio Noise Reduction by Adaptive 2D filtering Alexey Lukin AES Member Moscow State University, Moscow, Russia.
SOME SIMPLE MANIPULATIONS OF SOUND USING DIGITAL SIGNAL PROCESSING Richard M. Stern demo January 15, 2015 Department of Electrical and Computer.
Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.
The role of reverberation in release from masking due to spatial separation of sources for speech identification Gerald Kidd, Jr. et al. Acta Acustica.
UNIT-IV. Introduction Speech signal is generated from a system. Generation is via excitation of system. Speech travels through various media. Nature of.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Auditory Localization in Rooms: Acoustic Analysis and Behavior
Precedence-based speech segregation in a virtual auditory environment
Liverpool Keele Contribution.
Outlier Processing via L1-Principal Subspaces
Term Project Presentation By: Keerthi C Nagaraj Dated: 30th April 2003
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
Volume 62, Issue 1, Pages (April 2009)
Volume 62, Issue 1, Pages (April 2009)
Richard M. Stern demo January 12, 2009
Missing feature theory
A maximum likelihood estimation and training on the fly approach
EXAMPLES OF POLYAURAL PROCESSING
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
INTRODUCTION TO ADVANCED DIGITAL SIGNAL PROCESSING
Robust Speech Recognition in the 21st Century
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO IMPROVE AUTOMATIC SPEECH RECOGNITION: Promise, Progress, and Problems Richard Stern Department of Electrical and Computer Engineering and School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania Telephone: (412) ; FAX: (412) AFOSR Workshop on Computational Audition August 9, 2002

Carnegie Mellon Slide 2ECE and SCS Robust Speech Group Introduction - using binaural processing to improve speech recognition accuracy We’re doing better than I expected but there is still a lot to be done Talk outline …. –Briefly review some relevant binaural phenomena –Talk briefly about “classical” modeling approaches –Discuss some specific implementations of binaural processors that are particularly relevant to automatic speech recognition (ASR) –Comment a bit on results and future prospects

Carnegie Mellon Slide 3ECE and SCS Robust Speech Group I’ll focus on … Studies that –are based on the binaural temporal representation and are –applied in a meaningful fashion to automatic speech recognition Will not consider … –Other types of multiple microphone processing »Fixed and adaptive beamforming »Noise cancellation using adaptive arrays –Applications to source localization and separation –Applications to hearing impaired –Other ASR application using other approaches (like sub-band coherence-based enhancement)

Carnegie Mellon Slide 4ECE and SCS Robust Speech Group How can two (or more) ears help us? The binaural system can focus attention on a single speaker in a complex acoustic scene

Carnegie Mellon Slide 5ECE and SCS Robust Speech Group Another big binaural advantage … The binaural system can suppress reflected components of sound in a reverberant environment

Carnegie Mellon Slide 6ECE and SCS Robust Speech Group Primary binaural phenomena Interaural time delays (ITDs) Interaural intensity differences (IIDs)

Carnegie Mellon Slide 7ECE and SCS Robust Speech Group The classical model of binaural processing (Colburn and Durlach 1978)

Carnegie Mellon Slide 8ECE and SCS Robust Speech Group Jeffress’s model of ITD extraction (1948)

Carnegie Mellon Slide 9ECE and SCS Robust Speech Group Response to a 500-Hz tone with –1500-  s ITD

Carnegie Mellon Slide 10ECE and SCS Robust Speech Group Response to 500-Hz noise with –1500-  s ITD

Carnegie Mellon Slide 11ECE and SCS Robust Speech Group Other important binaural phenomena Binaural “sluggishness” –Temporal integration “blurs” effects of running time Information from head-related transfer functions (HRTFs) –Impetus for significant work in externalization and virtual environments –Also enables analysis of relationships between ITDs and IIDs The precedence effect –First-arriving wavefront has greatest impact on localization Secondary “slipped-cycle” effects and “straightness weighting” –Localization mechanisms also responsive to consistency over frequency

Carnegie Mellon Slide 12ECE and SCS Robust Speech Group Some of the groups involved Pittsburgh - Stern, Sullivan, Palm, Raj, Seltzer et al. Bochum - Blauert, Lindemann, Gaik, Bodden et al. Oldenburg - Kollmeier, Peissig, Kleinschmidt, Schortz et al. Dayton - Anderson, DeSimio, Francis Sheffield - Green, Cooke, Brown, (and Ellis, Wang, Roman) et al.

Carnegie Mellon Slide 13ECE and SCS Robust Speech Group Typical elements of binaural models for ASR Peripheral processing –HRTFs or explicit extraction of ITDs and IIDs vs. frequency band Model of auditory transduction –Prosaic (BPF, rectification, nonlinear compression) or AIM Interaural timing comparison –Direct (cross-correlation,stereoausis, etc.) or enhanced for precedence (a la Lindemann) Time-intensity interaction –Use of interaural intensity information to reinforce/vitiate temporal information (e.g. Gaik, Peissig) Possible restoration of “missing” features Feature extraction of enhanced display Decision making (Bayesian or using neural networks)

Carnegie Mellon Slide 14ECE and SCS Robust Speech Group Some (old) work from CMU: correlation-based ASR motivated by binaural hearing

Carnegie Mellon Slide 15ECE and SCS Robust Speech Group The good news: vowel representations improved by correlation processing Reconstructed features of vowel /a/ But the bad news is that error rates in real environments go down only a smalll amount, with a lot more processing Two inputs zero delay Two inputs 120-  s delay Eight inputs 120-  s delay

Carnegie Mellon Slide 16ECE and SCS Robust Speech Group The Lindemann model to accomplish the precedence effect Blauert cross-correlation Lindemann inhibition

Carnegie Mellon Slide 17ECE and SCS Robust Speech Group Sharpening effect of Lindemann inhibition Comment: Also observe precedence phenomena (as expected) and a natural time-intensity trade.

Carnegie Mellon Slide 18ECE and SCS Robust Speech Group Other techniques use by the Bochum group Gaik –Collected statistics of ITDs and IIDs of signals through HFTF filters –Used statistics to estimate joint pdf of ITD and IID, conditioned on source location Bodden –Detected source location and implemented source separation algorithm by differentially weighting different frequency bands Comment: Oldenburg group has developed a similar model (that differs in many details), but without the Lindemann inhibition

Carnegie Mellon Slide 19ECE and SCS Robust Speech Group Missing-feature recognition General approach: –Determine which cells of a spectrogram-like display are unreliable (or “missing”) –Ignore missing features or make best guess about their values based on data that are present

Carnegie Mellon Slide 20ECE and SCS Robust Speech Group Original speech spectrogram

Carnegie Mellon Slide 21ECE and SCS Robust Speech Group Spectrogram corrupted by white noise at SNR 15 dB Some regions are affected far more than others

Carnegie Mellon Slide 22ECE and SCS Robust Speech Group Ignoring regions in the spectrogram that are corrupted by noise All regions with SNR less than 0 dB deemed missing (dark blue) Recognition performed based on colored regions alone

Carnegie Mellon Slide 23ECE and SCS Robust Speech Group Recognition accuracy using compensated cepstra, speech corrupted by white noise Large improvements in recognition accuracy can be obtained by reconstruction of corrupted regions of noisy speech spectrograms Knowledge of locations of “missing” features needed SNR (dB) Accuracy (%) Cluster Based Recon. Temporal Correlations Spectral Subtraction Baseline

Carnegie Mellon Slide 24ECE and SCS Robust Speech Group Recognition accuracy using compensated cepstra, speech corrupted by music Recognition accuracy goes up from 7% to 69% at 0 dB with cluster based reconstruction SNR (dB) Accuracy (%) Cluster Based Recon. Temporal Correlations Spectral Subtraction Baseline

Carnegie Mellon Slide 25ECE and SCS Robust Speech Group Latest system from the Oldenburg group Peripheral processing: –Gammatone filters –Envelope extraction, lowpass filtering –Nonlinear temporal adaptation –Lowpass filtering Binaural processing: –Direct running cross-correlation (no inhibition) –Learning of corresponding ITD, IID using a neural network –Feature extraction from representation in “look direction”

Carnegie Mellon Slide 26ECE and SCS Robust Speech Group Sample results from the Oldenburg group (Kleinschmidt et al. 2001) Anechoic environment: “Moderate” reverberation: Comment: System performs worse in reverberation

Carnegie Mellon Slide 27ECE and SCS Robust Speech Group Some systems developed by the Dayton group Binaural Auditory Image Model (BAIM): –HRTFs –Auditory image model (AIM) –Cross-correlation with and without Lindemann inhibition –ITD/IID comparison using Kohonon self-organizing feature map Cocktail-party Processor (1995): –HRTFs –Conventional peripheral processing with Kates model –Cross-correlation with Lindemann inhibition [BAIM worked somewhat better for most conditions]

Carnegie Mellon Slide 28ECE and SCS Robust Speech Group Some comments, kudos, and concerns … Be very skeptical with results obtained using artificially added signals and noise! Nevertheless some progress has definitely been made. –Digitally adding noise almost invariably inflates performance –Use of room image models to simulate reverberant room acoustics may be more reasonable Lots of information is being ignored in many current models –Synchrony info a la Seneff, Ghitza; –Complex timbre information as suggested by Lyon, Slaney? The Lindemann model may not be the best way to capture precedence Missing feature approaches should be very promising Too much computational modeling and not enough insight into fundamental processes

Carnegie Mellon Slide 29ECE and SCS Robust Speech Group Summary Binaural processing has the ability (in principle) to improve speech recognition accuracy by providing spatial filtering and by combating the effects of room reverberation. Current systems are realizing some gains but are just now beginning to realize that promise. Faster and more efficient computation will be a real spur for research in this area over the next five years.