A glimpsing model of speech perception

Slides:



Advertisements
Similar presentations
Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements Christopher A. Shera, John J. Guinan, Jr., and Andrew J. Oxenham.
Advertisements

Psycho-acoustics and MP3 audio encoding
Hearing relative phases for two harmonic components D. Timothy Ives 1, H. Martin Reimann 2, Ralph van Dinther 1 and Roy D. Patterson 1 1. Introduction.
Speech Enhancement through Noise Reduction By Yating & Kundan.
A few pilots to drive this research Well-trained subjects: 15 hours, including 5 of practice. Stimuli: Holistic: Separable: Task: "Same"-"Different" task.
Advanced Speech Enhancement in Noisy Environments
Multipitch Tracking for Noisy Speech
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Pitch Perception.
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
A.Diederich– International University Bremen – Sensation and Perception – Fall Frequency Analysis in the Cochlea and Auditory Nerve cont'd The Perception.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Using Fo and vocal-tract length to attend to one of two talkers. Chris Darwin University of Sussex With thanks to : Rob Hukin John Culling John Bird MRC.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
William Stallings Data and Computer Communications 7th Edition (Selected slides used for lectures at Bina Nusantara University) Data, Signal.
Cross-Spectral Channel Gap Detection in the Aging CBA Mouse Jason T. Moore, Paul D. Allen, James R. Ison Department of Brain & Cognitive Sciences, University.
Sound source segregation (determination)
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
Fundamentals of Perceptual Audio Encoding Craig Lewiston HST.723 Lab II 3/23/06.
Applied Psychoacoustics Lecture 2: Thresholds of Hearing Jonas Braasch.
1 Audio Compression Multimedia Systems (Module 4 Lesson 4) Summary: r Simple Audio Compression: m Lossy: Prediction based r Psychoacoustic Model r MPEG.
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.
HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.
Forecasting and Statistical Process Control MBA Statistics COURSE #5.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
DIGITAL WATERMARKING OF AUDIO SIGNALS USING A PSYCHOACOUSTIC AUDITORY MODEL AND SPREAD SPECTRUM THEORY By: Ricardo A. Garcia University of Miami School.
Audio Compression Usha Sree CMSC 691M 10/12/04. Motivation Efficient Storage Streaming Interactive Multimedia Applications.
By Sarita Jondhale1 Pattern Comparison Techniques.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
Speech Enhancement Using Spectral Subtraction
From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.
METHODOLOGY INTRODUCTION ACKNOWLEDGEMENTS LITERATURE Low frequency information via a hearing aid has been shown to increase speech intelligibility in noise.
Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Dynamic Aspects of the Cocktail Party Listening Problem Douglas S. Brungart Air Force Research Laboratory.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
Gammachirp Auditory Filter
Hearing Research Center
Pitch perception in auditory scenes 2 Papers on pitch perception… of a single sound source of more than one sound source LOTS - too many? Almost none.
Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.
Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.
Additivity of auditory masking using Gaussian-shaped tones a Laback, B., a Balazs, P., a Toupin, G., b Necciari, T., b Savel, S., b Meunier, S., b Ystad,
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Katherine Morrow, Sarah Williams, and Chang Liu Department of Communication Sciences and Disorders The University of Texas at Austin, Austin, TX
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.
The role of reverberation in release from masking due to spatial separation of sources for speech identification Gerald Kidd, Jr. et al. Acta Acustica.
What can we expect of cochlear implants for listening to speech in noisy environments? Andrew Faulkner: UCL Speech Hearing and Phonetic Sciences.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Speech and Singing Voice Enhancement via DNN
Speech Enhancement Summer 2009
4aPPa32. How Susceptibility To Noise Varies Across Speech Frequencies
Precedence-based speech segregation in a virtual auditory environment
Consistent and inconsistent interaural cues don't differ for tone detection but do differ for speech recognition Frederick Gallun Kasey Jakien Rachel Ellinger.
Liverpool Keele Contribution.
Ana Alves-Pinto, Joseph Sollini, Toby Wells, and Christian J. Sumner
Cocktail Party Problem as Binary Classification
Human Speech Perception and Feature Extraction
Speech Perception (acoustic cues)
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

A glimpsing model of speech perception Martin Cooke & Sarah Simpson Speech and Hearing Research Department of Computer Science University of Sheffield http://www.dcs.shef.ac.uk/~martin

Motivation: The nonstationarity ‘paradox’ speech technology performance falls with the nonstationarity of the noise background … Aurora eval Simpson & Cooke (2003)

Motivation: The nonstationarity ‘paradox’ speech technology performance falls with the nonstationarity of the noise background … Miller (1947) … while listeners appear to prefer a nonstationary background (8-12 dB SRT gain) Simpson & Cooke (2003)

Possible factors In a 1-speaker background, listeners can … … employ organisational cues from the background source to help segregate foreground … employ schemas for both foreground and background … benefit from better glimpses of the speech target but: multi-speaker backgrounds have certain advantages … … less chance of informational masking … easier enhancement algorithm

Glimpsing opportunities Spectro-temporal glimpse densities % of time-frequency regions with a locally-positive SNR

Glimpsing Informal definition a glimpse is some time-frequency region which contains a reasonably undistorted ‘view’ of local signal properties Precursors Term used by Miller & Licklider (1950) to explain intelligibility of interrupted speech Related to ‘multiple looks’ model of Viemeister & Wakefield (1991) which demonstrated ‘intelligent’ temporal integration of tone bursts Assmann & Summerfield (in press) suggest ‘glimpsing & tracking’ as way of understanding how listeners cope with adverse conditions Culling & Darwin (1994) developed a glimpsing model to explain double vowel identification for small ΔF0s de Cheveigné & Kawahara (1999) can be considered a glimpsing model of vowel identification Close relation to missing data processing (Cooke et al, 1994)

Types of glimpses Comodulated Eg Miller & Licklider (1950) Spectral Eg Warren et al (1995) General uncomodulated Eg Howard-Jones & Rosen (1993), Buss et al (2003)

Evidence from distorted speech e.g. Drullman (1995) filtered noisy speech into 24 ¼-octave bands, extracted the temporal envelope in each band, and replaced those parts of the envelope below a target level with a constant value. Found intelligibility of 60% when 98% of signal was missing

Glimpsing in natural conditions: the dominance effect Although audio signals add ‘additively’, the occlusion metaphor is more appropriate due to loglike compression in the auditory system Consequently, most regions in a mixture are dominated by one or other source, leaving very few ambiguous regions, even for a pair of speech signals mixed at 0 dB.

Issues for a glimpsing model What constitutes a useful glimpse? Is sufficient information contained in glimpses? How do listeners detect glimpses? How can they be integrated? Glimpse detection Glimpse integration

Glimpsing study Aims Determine if glimpses contain sufficient information Explore definition of useful glimpse Comparison between listeners and model using natural VCV stimuli Subset of Shannon et al (1999) corpus V = /a/ C = { b, d, g, p, t, k, m, n, l, r, f, v, s, z, sh, ch } Background source reversed multispeaker babbler for N=1, 8 Allows variation in glimpsing opportunities 3 SNRs (TMRs): 0, -6 and -12 dB 12 listeners heard 160 tokens in each condition 2 repeats X 16 VCVs X 5 male speakers

Identification results 1-speaker 8-speaker

Glimpsing model CDHMM employing missing data techniques 16 whole-word HMMs 8 states 4 component Gaussian mixture per state Input representation 10 ms frames of modelled auditory excitation pattern (40 gammatone filters, Hilbert envelope, 8 ms smoothing) NB: only simultaneous masking is modelled Training 8 repetitions of each VCV by 5 male speakers per model Testing As for listeners viz. 2 repetitions of each VCV by 5 male speakers Performance in clean: > 99%

Model performance I: ideal glimpses All time-frequency regions whose local SNR exceeds a threshold Optimum threshold = 0 dB For this task, there is more than sufficient information in the glimpsed regions Listeners perform suboptimally with respect to this glimpse definition 1 8

Model performance: variation in detection threshold Q Can varying the local SNR threshold for glimpse detection prodce a better match? No choice of local SNR threshold provides good fit to listeners Closest fit shown (-6 dB) 1 8

Analysis Unreasonable to expect listeners to detect individual glimpses in a sea of noise unless glimpse region is large enough

Analysis Unreasonable to expect listeners to detect individual glimpses in a sea of noise unless glimpse region is large enough

Model performance: useable glimpses Definition: glimpsed region must occupy at least N ERBs and T ms Search over 1-15 ERBs, 10-100 ms, at various detection thresholds Best match at 6.3 ERBs (9 channels) 40 ms 0 dB local SNR threshold 1 8 Howard-Jones & Rosen (1993) suggested 2-4 bands limit for uncomodulated glimpsing Buss et al (2003) found evidence for uncomodulated glimpsing in up to 9 bands

Consonant identification Reasonable matches overall apart from b, s & z However, little token-by-token agreement between common listener errors and model errors. Why?

Factors ‘Confusability’ Audibility of target Informational masking Energetic masking Existence of schemas for target Successful identification Organisational cues in target Existence of schemas for background Organisational cues in background

Measuring energetic masking Approach: resynthesise glimpses alone Filter, time-reverse, refilter to remove phase distortion Select regions based on local SNR mask Results Little difference for 1-speaker background, suggesting relatively low contribution of info masking in this case (due to reversed masker?) Larger difference for 8-speaker case possibly due to ‘unrealistic’ glimpses 1 8 glimpses alone speech+noise

Comparison with ideal model Results Ideal model performs well in excess of listeners when supplied with precisely the same information Possible reasons: Distortions Glimpses do not occur in isolation: possibility that a noise background will help Lack of nonsimultaneous masking model will inflate model performance Ideal (model) Ideal? (listeners)

The glimpse decoder Attempt at a unifying statistical theory for primitive and model-driven processes in CASA Basic idea: decoder not only determines the most likely speech hypothesis but also decides which glimpses to use Key advantage: no longer need to rely on clean acoustics! Can interpret (some) informational masking effects as the incorrect assignment of glimpses during signal interpretation Barker, J, Cooke, M.P. & Ellis, D.P.W. “Decoding speech in the presence of other sources”, accepted for Speech Communication

Summary & outlook Proposed a glimpsing model of speech identification in noise Demonstrated sufficiency of information in target glimpses, at least for VCV task Preliminary definition of useful glimpse gives good overall model-listener match Introduced 2 procedures for measuring the amount of energetic masking (i) via ASR (ii) via glimpse resynthesis Need nonsimultaneous masking model Need to isolate affects due to schemas Repeat using non-reversed speech to introduce more informational masking Need to quantify affect of distortion in glimpse resynthesis …

Masking noise can be beneficial Warren et al (1995) demonstrated spectral induction effect with 2 narrow bands of speech with intervening noise fullband Cooke & Cunningham (in prep) Spectral induction with single speech-bands.

Speech modulated noise As in Brungart (2001) Model results and glimpse distributions indicate increase in energetic masking for this type of masker Natural speech natural, 1 spkr natural, 8 spkr SMN, 1 spkr SMN, 8 spkr Speech modulated noise

Speech modulated noise Listeners perform better with SMN than predicted on the basis of reduced glimpses (cf SMN model), but not quite as well as they do with natural speech masker Suggests energetic masking is not the whole story (cf Brungart, 2001), but further work needed to quantify relative contribution of Release from IM Absence of background models/cues 1 8 SMN (model) NAT (model) SMN (listeners) NAT (listeners)