Attentional Tracking in Real-Room Reverberation

Slides:



Advertisements
Similar presentations
Binaural Hearing Or now hear this! Upcoming Talk: Isabelle Peretz Musical & Non-musical Brains Nov. 12 noon + Lunch Rm 2068B South Building.
Advertisements

Sound source segregation Development of the ability to separate concurrent sounds into auditory objects.
Spatial Perception of Audio J. D. (jj) Johnston Neural Audio Corporation.
Effect of reverberation on loudness perceptionInsert footer on Slide Master© University of Reading Department of Psychology 12.
Localizing Sounds. When we perceive a sound, we often simultaneously perceive the location of that sound. Even new born infants orient their eyes toward.
Chapter 1: Information and Computation. Cognitive Science  José Luis Bermúdez / Cambridge University Press 2010 Overview Review key ideas from last few.
Effectiveness of spatial cues, prosody, and talker characteristics in selective attention C.J. Darwin & R.W. Hukin.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
Watkins, Raimond & Makin (2011) J Acoust Soc Am –2788 temporal envelopes in auditory filters: [s] vs [st] distinction is most apparent; - at higher.
Using Fo and vocal-tract length to attend to one of two talkers. Chris Darwin University of Sussex With thanks to : Rob Hukin John Culling John Bird MRC.
ICA Madrid 9/7/ Simulating distance cues in virtual reverberant environments Norbert Kopčo 1, Scott Santarelli, Virginia Best, and Barbara Shinn-Cunningham.
Auditory Objects of Attention Chris Darwin University of Sussex With thanks to : Rob Hukin (RA) Nick Hill (DPhil) Gustav Kuhn (3° year proj) MRC.
Hearing & Deafness (3) Auditory Localisation
Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.
Sound source segregation (determination)
1 Recent development in hearing aid technology Lena L N Wong Division of Speech & Hearing Sciences University of Hong Kong.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Binaural Sonification of Disparity Maps Alfonso Alba, Carlos Zubieta, Edgar Arce Facultad de Ciencias Universidad Autónoma de San Luis Potosí.
What they asked... What are the long term effects of fitting bilateral amplification simultaneously (both aids on Day #1) versus sequentially (the second.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
The Care and Feeding of Loudness Models J. D. (jj) Johnston Chief Scientist Neural Audio Kirkland, Washington, USA.
Virtual Worlds: Audio and Other Senses. VR Worlds: Output Overview Visual Displays: –Visual depth cues –Properties –Kinds: monitor, projection, head-based,
3-D Sound and Spatial Audio MUS_TECH 348. Main Types of Errors Front-back reversals Angle error Some Experimental Results Most front-back errors are front-to-back.
perceptual constancy in hearing speech played in a room, several metres from the listener has much the same phonetic content as when played nearby despite.
Sounds in a reverberant room can interfere with the direct sound source. The normal hearing (NH) auditory system has a mechanism by which the echoes, or.
speech, played several metres from the listener in a room - seems to have the same phonetic content as when played nearby - that is, perception is constant.
Developing a model to explain and stimulate the perception of sounds in three dimensions David Kraljevich and Chris Dove.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
 Space… the sonic frontier. Perception of Direction  Spatial/Binaural Localization  Capability of the two ears to localize a sound source within an.
Dynamic Aspects of the Cocktail Party Listening Problem Douglas S. Brungart Air Force Research Laboratory.
Audio Systems Survey of Methods for Modelling Sound Propagation in Interactive Virtual Environments Ben Tagger Andriana Machaira.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
3-D Sound and Spatial Audio MUS_TECH 348. Physical Modeling Problem: Can we model the physical acoustics of the directional hearing system and thereby.
L INKWITZ L AB S e n s i b l e R e p r o d u c t i o n & R e c o r d i n g o f A u d i t o r y S c e n e s Hearing Spatial Detail in Stereo Recordings.
Spatial and Spectral Properties of the Dummy-Head During Measurements in the Head-Shadow Area based on HRTF Evaluation Wersényi György SZÉCHENYI ISTVÁN.
Hearing Research Center
Laboratory for Experimental ORL K.U.Leuven, Belgium Dept. of Electrotechn. Eng. ESAT/SISTA K.U.Leuven, Belgium Combining noise reduction and binaural cue.
Perceptual attention Theories of attention Early selection Late selection Resource theories Repetition blindness and the attentional blink.
3-D Sound and Spatial Audio MUS_TECH 348. Are IID and ITD sufficient for localization? No, consider the “Cone of Confusion”
PSYC Auditory Science Spatial Hearing Chris Plack.
Fletcher’s band-widening experiment (1940)
The role of reverberation in release from masking due to spatial separation of sources for speech identification Gerald Kidd, Jr. et al. Acta Acustica.
SPATIAL HEARING Ability to locate the direction of a sound. Ability to locate the direction of a sound. Localization: In free field Localization: In free.
Speech and Singing Voice Enhancement via DNN
distance, m (log scale) -25o 0o +25o C50, dB left right L-shaped
Auditory Localization in Rooms: Acoustic Analysis and Behavior
Precedence-based speech segregation in a virtual auditory environment
CS 591 S1 – Computational Audio
CS 591 S1 – Computational Audio
Consistent and inconsistent interaural cues don't differ for tone detection but do differ for speech recognition Frederick Gallun Kasey Jakien Rachel Ellinger.
Ana Alves-Pinto, Joseph Sollini, Toby Wells, and Christian J. Sumner
Mid-Term Review John W. Worley AudioGroup, WCL
Inferential Statistics
Loudness asymmetry in real-room reverberation: cross-band effects
Volume 62, Issue 1, Pages (April 2009)
Mark Sayles, Ian M. Winter  Neuron 
Volume 77, Issue 5, Pages (March 2013)
The cocktail party problem
Hearing Spatial Detail
Perceptual Echoes at 10 Hz in the Human Brain
Volume 62, Issue 1, Pages (April 2009)
Localizing Sounds.
Attentional Modulations Related to Spatial Gating but Not to Allocation of Limited Resources in Primate V1  Yuzhi Chen, Eyal Seidemann  Neuron  Volume.
Neural Entrainment to Speech Modulates Speech Intelligibility
Attentive Tracking of Sound Sources
Word embeddings (continued)
Encoding of Stimulus Probability in Macaque Inferior Temporal Cortex
3 primary cues for auditory localization: Interaural time difference (ITD) Interaural intensity difference Directional transfer function.
Volume 50, Issue 1, Pages (April 2006)
Presentation transcript:

Attentional Tracking in Real-Room Reverberation S. J. Makin, A. J. Watkins and A. P. Raimond. November 13, 2018

Intro: Attending at Cocktail Parties Typical listening situations: multiple sound sources numerous reflecting surfaces reflections degrade both intelligibility and segregation of sounds Yet listeners able to selectively attend to a single source Any detectable difference seems to help listeners track messages over time e.g. a filtering difference (Spieth and Webster, 1955) Two main sources of differences: such as the famous “cocktail party” – required sound must be selected from a mixture of multiple direct sounds and numerous reflected “images” of EACH sound – reverb both degrades intelligibility of single sources and impairs separation of streams in such situations… showed that filtering EITHER target message or interfering message improved performance on task requiring selection and recognition of one of 2 simultaneous messages – and the degree of filtering was non-critical There are 2 main sources of naturally occurring differences between signals….

Intro: Attending at Cocktail Parties Spatial position: Source characteristics: Cues from spatial separation seem to aid tracking (Spieth et al., 1954; Broadbent, 1954) Interaural differences can be used to track (e.g. ITD, Darwin & Hukin, 1999) But these differences are corrupted by reverberation (Rakerd & Hartmann, 1985; Kidd et al., 2005) Talker-difference tracking cues (that include vocal-tract size) are very robust to reverberation (Darwin & Hukin, 2000) So - in realistic levels of reverberation do listeners simply ignore corrupted cues from spatial position and rely on source (i.e. talker) characteristics for tracking? A number of authors, such as… , in studies similar to that of Spieth & Webster, found… In a study specifically designed to measure effects on auditory attention or “tracking”, D&H showed… a speech message over time Both ITD and ILD are increasingly degraded as reverb increases But which can also include pitch differences, prosody, and possibly other factors

Experimental Paradigm Based on Darwin & Hukin’s (1999a, b, 2000) paradigm, where listeners hear two simultaneous sentences played in a (simulated) room target sentence test words on this trial you’ll get the word to select bead line difference you’ll also hear the sound played here globe The EP we used was… One is designated the target sentence, which in our experiments was… there were also 2 simultaneous test words, and a ‘distractor’ sentence. Listeners are asked to attend to the target sentence and report which test word they hear as ‘belonging’ with it Just substitute position in a room for position on the page, and talker for color, and you have our experiments So their responses tell us which is the more influential source of cues for tracking Which target word ‘belongs with’ the target sentence? Above: colour/boldness competes with a line difference Listeners here: talker versus room-position Response indicates which cue they were tracking

Stimuli BRIRs recorded in room (135 m3) using dummy heads: sine-sweep dummy-head talker real-room dummy-head listener, mics in ears BRIR deconvolution Convolved with dry recordings to reproduce real-room reverberation: Measurement signal – log-sine-sweep (Farina) Talker – B&K HATS Listener – KEMAR ‘virtualised’ or ‘spatialised’ ? The listener has an experience very close to being in the room where the recordings were made – all that is missing is individualised HRTF (pinna and head) effects convolution headphones real-room listening ‘dry’ speech recording BRIR listener = ‘spatialised’ speech recording

Stimuli Numerous room-position pairs sampled, varying distance separation, bearing difference, or both talkers listener Probability of room response averaged across position-pairs Presented both dichotically and diotically Diotic stimuli were L or R channel presented to both ears, level-corrected to match dichotic stimuli

Results: Talker Differences Different sex Same sex 0.2 0.4 0.6 0.8 diotic dichotic room 0.2 0.4 0.6 0.8 response probability room-position response probability room-position talker dichotic diotic Talker difference dominant - especially when diotic Diotic – talker (if anything) Dichotic – room-position Which cues from room- position are responsible?

Methods: BRIR Processing BRIRs processed to limit available cues: ‘Spectral Only’ (SO) BRIRs removes ITDs and temporal-envelope ‘tails’, leaving only spectral-envelope and level info ‘Spectral-plus-Temporal’ (S+T) BRIRs as SO, but with temporal-envelope ‘tails’ restored BRIR FFT Rotate all components to cosine phase IFFT Window and time-align SO BRIR Take the inverse transform and window the resulting function with a short Hann window …by flipping the polarity of a randomly selected half of the signal samples. (Schroeder 68?) – check all this! SCN Convolve S+T BRIR

Results: Comparing BRIRs Diotic Dichotic 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 Unprocessed SO S+T response probability room-position Unprocessed SO S+T Talker-difference dominant Less so when ‘tails’ are present Suggests an effect of ‘tails’ is to disrupt pitch cues? (Culling et al., 1994) Room-position much more dominant - even without ITD cues Is this due to ear ‘selection’? Or inter-aural processing? this suggests to us… because with changing F0s, as reflections arriving at different times tend to have different F0s, harmonicity gets disrupted. This would impair pitch-based tracking cues which would tend to promote “talker” responses, leading to either an increase in reliance on other, in this instance, position-based cues, or else just to a reduction in the info available leading to a “guessing” situation So. What might be going on here? There seems to me to be 2 candidate explanations. Listeners could be ‘selecting an ear’ – or it could be due to genuine inter-aural processing. What do I mean by this? - Let me explain by reference to the information available to listeners - The only info remaining in processed conditions is spectral and level info., so lets have a look at some IR spectra…

Spectral Characteristics of BRIRs Power (dB) in channels of an ‘auditory’ (gammatone) filterbank: Can also compute distance between ILD profiles for each BRIR in a pair So - is dichotic effect due to ‘selecting’ ear with biggest distance, or is it due to different ILD profiles of the room-positions? Left ear Right ear ILD 0.65 m + 25 -10 10 -70 -50 dB - 25 d = 25 d = 43.5 d = 25.5 d = 43.8 d = 16.6 200 531 1117 2155 3994 7252 5 m -70 -50 10 d = 16.6 200 531 1117 2155 3994 7252 200 531 1117 2155 3994 7252 200 531 1117 2155 3994 7252 Spectral distances between channels of 2 BRIRs in each position-pair – which differ among positions – and between ears Here I’m plotting “auditory” spectra generated by processing the BRIRs with a gammatone filterbank (32 channs, equally spaced on ERB-rate scale between 200 – 8k) dB on ordinate, channel CF (Hz) on abscissa. You can see that the two spectra are different. So below that I’m plotting the absolute distance between the two. And just taking the rms of these values in each channel gives us a Euclidean distance between these spectra (plot d number) – so we’ve computed a monaural spectral distance between the corresponding channels of the two BRIRs of a particular position-pair (plot text). This is at the closest distance. – But it differs amongst diff room positions… and its different for the two ears… We can also compute an ILD profile for each BRIR position by subtracting the R channel from the L – and subtract the ILD profiles of the 2 BRIRs in a position-pair to get a distance between the 2 ILD profiles… …and these also differ widely among different room positions… In other words - are these MONAURAL distances associated with an increased frequency of room-p responses? – which also differs widely among room positions

Correlation with Spectral Distances Diotic Dichotic .5 1 SO r = 0.17 SO r = 0.68 room-position response probability 20 40 60 20 40 60 Monaural Euclidean spectral distance (dB) Euclidean distance between ILD profiles (dB) Monaural distance does not predict listeners’ responses The associated cues seem to be ignored So listeners are unlikely to be ‘selecting’ an ear on this basis in dichotic conditions Distances between ILD profiles are predictive of responses (r2 = 0.46) So the dichotic effect appears to arise through inter-aural processing of ILD Here I’m treating R and L versions as distinct and plotting each distinct positioning of IRs which includes cases where the near IR is on the right and far on the left, and vice versa. And the answer is… no. Monaural distances are only very WEAKLY correlated with the frequency of room-p responses

Conclusions When tracking speech in simultaneous messages, cues from talker differences are not always dominant Cues from position differences can sometimes be more influential - even in reverberation This is mostly seen when talker differences are subtle, listening is dichotic, and pitch cues are degraded The cues from position differences are not the messages’ ITDs - they seem to be the ILDs - which still differ among positions in a typical room This dichotic effect doesn’t seem to arise through listeners 'selecting' an ear – appears to be due to inter-aural processing

Thanks for attending!