1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield.

Slides:

Advertisements

Similar presentations

Multipitch Tracking for Noisy Speech

Advertisements

Binaural Hearing Or now hear this! Upcoming Talk: Isabelle Peretz Musical & Non-musical Brains Nov. 12 noon + Lunch Rm 2068B South Building.

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO IMPROVE AUTOMATIC SPEECH RECOGNITION: Promise, Progress, and Problems Richard Stern Department of Electrical.

Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,

Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress Richard Stern (with Chanwoo Kim and Yu-Hsiang Chiu) Department.

Introduction to Image Quality Assessment

Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000.

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.

1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.

Sound source segregation (determination)

Acoustical Society of America, Chicago 7 June 2001 Effect of Reverberation on Spatial Unmasking for Nearby Speech Sources Barbara Shinn-Cunningham, Lisa.

1 Recent development in hearing aid technology Lena L N Wong Division of Speech & Hearing Sciences University of Hong Kong.

Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.

HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.

Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.

From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.

Applied Psychoacoustics Lecture: Binaural Hearing Jonas Braasch Jens Blauert.

Jacob Zurasky ECE5526 – Spring 2011

Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.

Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.

Sounds in a reverberant room can interfere with the direct sound source. The normal hearing (NH) auditory system has a mechanism by which the echoes, or.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Jens Blauert, Bochum Binaural Hearing and Human Sound Localization.

Gammachirp Auditory Filter

Hearing Research Center

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Performance Comparison of Speaker and Emotion Recognition

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,

Listeners weighting of cues for lateral angle: The duplex theory of sound localization revisited E. A. MacPherson & J. C. Middlebrooks (2002) HST. 723.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

PSYC Auditory Science Spatial Hearing Chris Plack.

Fletcher’s band-widening experiment (1940)

Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.

The role of reverberation in release from masking due to spatial separation of sources for speech identification Gerald Kidd, Jr. et al. Acta Acustica.

What can we expect of cochlear implants for listening to speech in noisy environments? Andrew Faulkner: UCL Speech Hearing and Phonetic Sciences.

SPATIAL HEARING Ability to locate the direction of a sound. Ability to locate the direction of a sound. Localization: In free field Localization: In free.

[1] National Institute of Science & Technology Technical Seminar Presentation 2004 Suresh Chandra Martha National Institute of Science & Technology Audio.

1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.

Speech and Singing Voice Enhancement via DNN

Speech Enhancement Summer 2009

Auditory Localization in Rooms: Acoustic Analysis and Behavior

Precedence-based speech segregation in a virtual auditory environment

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Speech Enhancement with Binaural Cues Derived from a Priori Codebook

Volume 62, Issue 1, Pages (April 2009)

DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark

Josh H. McDermott, Eero P. Simoncelli Neuron

Volume 62, Issue 1, Pages (April 2009)

A maximum likelihood estimation and training on the fly approach

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield Collaborators Kalle Palomäki, University of Sheffield and Helsinki University of Technology DeLiang Wang, The Ohio State University

2 Introduction Human speech perception is remarkably robust, even in the presence of interfering sounds and reverberation. In contrast, automatic speech recognition (ASR) is very problematic in such conditions: “error rates of humans are much lower than those of machines in quiet, and error rates of current recognizers increase substantially at noise levels which have little effect on human listeners” – Lippmann (1997) Can we improve ASR performance by taking an approach that models auditory processing more closely?

3 Auditory processing in ASR Until recently, the influence of auditory processing on ASR has been largely limited to the front-end. ‘Noise robust’ feature vectors, e.g. RASTA-PLP, modulation filtered spectrograms. Can auditory processing be applied in the recogniser itself? Cooke et al. (2001) suggest that speech perception is robust because listeners can recognise speech from a partial description, i.e. with missing data. Modify conventional recogniser to deal with missing or unreliable features.

4 Missing data approach to ASR Aim of ASR is to assign an acoustic vector Y to a class W such that the posterior probability P(W|Y) is maximised: P(W|Y)  P(Y|W) P(W) If components of Y are unreliable or missing, cannot compute P(Y|W) as usual. Solution: partition Y into reliable parts Y r and unreliable parts Y u, and use marginal distribution P(Y r |W). Provide a time-frequency mask showing reliable regions. acoustic model language model

5 Missing data mask Time Frequency Time Rate map Mask

6 Binaural hearing and ASA Spatial location of sound sources is encoded by –Interaural time difference (ITD) –Interaural level difference (ILD) –Spectral (pinna) cues Intelligibility of masked speech is improved if the speech and masker originate from different locations in space (Spieth, 1954). Gestalt principle of similarity/proximity; events that arise from a similar location are grouped.

7 Binaural processor for MD ASR Assumptions: –Two sound sources, speech and an interfering sound; –Sources spatialised by filtering with realistic head-related impulse responses (HRIR); –Reverberation may be present. Key features of the system: –Components of the same source identified by common azimuth; –Azimuth estimated by ITD, with ILD constraint; –Spectral normalisation technique for handling convolutional distortion due to HRIR filtering and reverberation.

8 Block diagram of the system Auditory filterbank Envelope Precedence model Grouping common azimuth Cross correlation Missing data ASR

9 Stimulus generation Speech and noise sources are located in a virtual room; same height, different azimuthal angle. Transfer function of path between source and ears is modelled by a binaural room impulse response. Impulse response has three components: –Surface reflections estimated by the image model; –Air propagation filter (assume 50% relative humidity); –Head-related impulse response (HRIR); Alter surface absorption to vary reverberation time.

10 Virtual room Length 6m Width 4m Height 3m Speech source Noise source

11 Auditory periphery Cochlear frequency analysis modelled by bank of 32 gammatone filters, rectify and cube root compress. Instantaneous envelope computed. Smooth envelope and downsample to obtain ‘rate map’; feature vectors for the recogniser. Frequency Time

12 A model of precedence processing A simple model of a complex phenomenon! Create inhibitory signal by lowpass filtering envelope with: h lp (t) = A t exp(-t/  ) Inhibited auditory nerve response r(t,f) given by r(t,f) = [a(t,f) - G (h lp (t) * env(t,f))] + where a(t,f) is auditory nerve response, [] + is half-wave rectification and G determines the strength of inhibition.

13 Output from the precedence model Amplitude Time [ms] 050 Channel envelope and fine time structure Inhibitory signal Inhibited fine structure

14 Azimuth estimation Estimate ITD by computing cross-correlation in each frequency band. Form a cross-correlogram (CCG), a two-dimensional plot of ITD against frequency band. Sum across frequency, giving pooled cross-correlogram. Warp to azimuth axis, since HRIR-filtered sounds show weak frequency-dependence in ITD. Sharpen CCG by replacing local peaks with narrow Gaussians – skeleton CCG. Like lateral inhibition.

15 Cross-correlogram (ITD) Interaural time difference (ITD) Channel centre frequency Mixture of male and female speech Azimuths: Male speech +20 deg Female speech -20 deg

16 Skeleton cross-correlogram (azimuth) Azimuth (degrees) Channel centre frequency Mixture of male and female speech Azimuths: Male speech +20 deg Female speech -20 deg

17 Grouping by common azimuth Locate source azimuths from pooled CCG. For each channel i at each time frame j, set mask to 1 iff [C(i,j,  s ) > C(i,j,  n )] and C(i,j,  s ) >  where C(i,j,  is cross-correlogram,  s is azimuth of speech,  n is azimuth of noise and  is a threshold. Motivation: Select channels in missing data mask in which speech dominates the noise, and energy is not too low. Hint given: system knows that  s >  n

18 ILD constraint Compute interaural level difference as: ILD(i,j) = 10 log 10 [eng R (i,j)/eng L (i,j)] where eng k (i,j,n) is energy in channel i at time frame j for ear k. Store ‘ideal’ ILD for a particular azimuth in a lookup table. Cross-check observed ILD against ‘ideal’ ILD for observed azimuth; if they do not agree to within 0.5 dB set mask to zero.

19 Spectral energy normalisation HRIR filtering and reverberation introduces convolutional distortion. Usually normalise by mean and variance of features in each frequency band; but what if data is missing? Current approach is simple: normalise by the mean of the N largest reliable feature valuesY r in each channel. Motivation: Features that have high energy and are marked as reliable should be least affected by the noise background.

20 A priori mask To assess limits of the missing data approach, we employ an a priori mask. Derived by measuring the difference between the rate map for clean speech and its noise/reverberation contaminated counterpart. Only set mask elements to 1 if this difference lies within a threshold value (tuned for each condition). Should give near-optimal performance.

21 Masks estimated by binaural grouping Rate mapsA priori mask Mask estimated by binaural processor Mixture of speech (+20 deg azimuth) and interfering talker (-20 deg azimuth) SNR 0dB Top: anechoic Bottom: T60 reverberation time of 0.3 sec

22 Evaluation Hidden Markov model (HMM) recogniser, modified for missing data approach. Tested on 240 utterances from TiDigits connected digit corpus. 12 word-level HMMs (silence, ‘oh’, ‘zero’ and ‘1’ to ‘9’). Noise intrusions from Cooke’s (1993) corpus; male speaker and rock music. Baseline recogniser for comparison, trained on mel- frequency cepstral coefficients (MFCCs) and derivatives.

23 Example sounds ‘one five zero zero six’, male speaker, anechoic With T60 reverberation time 0.3 sec With interfering male speaker, 0 dB SNR, anechoic, 40 degrees azimuth separation Two speakers, T60 reverberation time 0.3 sec

24 Effect of reverberation (anechoic) Reverberation time 0 sec MFCC A priori Binaural Accuracy [%] Signal-to-noise ratio (dB) Male speech masker 40 degrees separation

25 Effect of reverberation (small office) Reverberation time 0.3 sec MFCC A priori Binaural Accuracy [%] Signal-to-noise ratio (dB) Male speech masker 40 degrees separation

26 Effect of spatial separation (10 deg) Signal-to-noise ratio (dB) Accuracy [%] MFCC A priori Binaural Reverberation time 0.3 sec

27 Effect of spatial separation (20 deg) Signal-to-noise ratio (dB) Accuracy [%] MFCC A priori Binaural Reverberation time 0.3 sec

28 Effect of spatial separation (40 deg) Signal-to-noise ratio (dB) Accuracy [%] MFCC A priori Binaural Reverberation time 0.3 sec

29 Effect of noise source (rock music) Signal-to-noise ratio (dB) Accuracy [%] MFCC A priori Binaural Reverberation time 0.3 sec

30 Effect of noise source (male speech) Signal-to-noise ratio (dB) Accuracy [%] MFCC A priori Binaural Reverberation time 0.3 sec

31 Effect of precedence processing Without inhibition (G=0.0)With inhibition (G=1.0)

32 Summary of results The binaural missing data system is more robust than a conventional MFCC-based recogniser when interfering sounds and reverberation are present. The performance of the binaural system depends on the angular separation between sources. Source characteristics influence performance of binaural system; most helpful when spectra of speech and interfering sounds substantially overlap. Performance of binaural system is close to a priori masks in anechoic conditions; room for improvement elsewhere.

33 Conclusions and future work Combination of binaural model and missing data framework appears promising. However, still far from matching human performance. Major outstanding issues: –Better model of precedence processing; –Source identification (top-down constraints); –Source selection (role of attention); –Moving sound sources; –More complex acoustic environments.

34 Additional Slides

35 Precedence effect A group of phenomena which underlie the ability of listeners to localise sound sources in reverberant spaces. Direct sound followed by reflections; but listeners usually report that source originates from direction corresponding to first wavefront. Usually explained by delayed inhibition, which suppresses location information 1ms after onset of abrupt sound.

36 Full set of example sounds ‘one five zero zero six’, male speaker, anechoic With T60 reverberation time 0.3 sec (small office) With T60 reverberation time 0.45 sec (larger office) With interfering male speaker, 0 dB SNR, anechoic, 40 degrees azimuth separation Two speakers, T60 reverberation time 0.3 sec Two speakers, T60 reverberation time 0.45 sec

37 Effect of reverberation (larger office) Reverberation time 0.45 sec MFCC A priori Binaural Accuracy [%] Signal-to-noise ratio (dB) Male speech masker 40 degrees separation

38 Effect of noise source (female speech) Signal-to-noise ratio (dB) Accuracy [%] MFCC A priori Binaural Reverberation time 0.3 sec