Auditory Scene Analysis Chris Darwin Need for sound segregation Ears receive mixture of sounds We hear each sound source as having its own appropriate.

Slides:



Advertisements
Similar presentations
Acoustic/Prosodic Features
Advertisements

Acoustic Characteristics of Consonants
Auditory scene analysis 2
Psychoacoustics Perception of Direction AUD202 Audio and Acoustics Theory.
Binaural Hearing Or now hear this! Upcoming Talk: Isabelle Peretz Musical & Non-musical Brains Nov. 12 noon + Lunch Rm 2068B South Building.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
Timbre perception. Objective Timbre perception and the physical properties of the sound on which it depends Formal definition: ‘that attribute of auditory.
Periodicity and Pitch Importance of fine structure representation in hearing.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
A.Diederich– International University Bremen – USC – MMM – Spring Auditory scene.
Speech Science XII Speech Perception (acoustic cues) Version
Lecture 8  Perceived pitch of a pure tone  Absolute pitch  Midterm review Instructor: David Kirkby
Pitch Perception.
Auditory Scene Analysis (ASA). Auditory Demonstrations Albert S. Bregman / Pierre A. Ahad “Demonstration of Auditory Scene Analysis, The perceptual Organisation.
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
A.Diederich– International University Bremen – Sensation and Perception – Fall Frequency Analysis in the Cochlea and Auditory Nerve cont'd The Perception.
All you have is a pair of instruments (basilar membranes) that measure air pressure fluctuations over time Localization.
The Science of Sound Chapter 8
Using Fo and vocal-tract length to attend to one of two talkers. Chris Darwin University of Sussex With thanks to : Rob Hukin John Culling John Bird MRC.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
Source Segregation Chris Darwin Experimental Psychology University of Sussex.
Hearing & Deafness (4) Pitch Perception 1. Pitch of pure tones 2. Pitch of complex tones.
Hearing & Deafness (5) Timbre, Music & Speech Vocal Tract.
Localising multiple sounds. Phenomenology Different sounds localised appropriately The whole of a sound is localised appropriately …even when cues mangled.
Tone and Voice: A Derivation of the Rules of Voice- Leading from Perceptual Principles DAVID HURON Music Perception 2001, 19, 1-64.
Auditory Objects of Attention Chris Darwin University of Sussex With thanks to : Rob Hukin (RA) Nick Hill (DPhil) Gustav Kuhn (3° year proj) MRC.
Hearing & Deafness (3) Auditory Localisation
Hearing & Deafness (5) Timbre, Music & Speech.
Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.
A.Diederich– International University Bremen – USC – MMM – Spring Onset and offset Sounds that stop and start at different times tend to be produced.
Sound source segregation (determination)
The Science of Sound Chapter 8
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Beats and Tuning Pitch recognition Physics of Music PHY103.
COMBINATION TONES The Science of Sound Chapter 8 MUSICAL ACOUSTICS.
Audio Scene Analysis and Music Cognitive Elements of Music Listening
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
SOUND IN THE WORLD AROUND US. OVERVIEW OF QUESTIONS What makes it possible to tell where a sound is coming from in space? When we are listening to a number.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
Mr Background Noise and Miss Speech Perception in: by Elvira Perez and Georg Meyer.
Hearing: auditory coding mechanisms. Harmonics/ Fundamentals ● Recall: most tones are complex tones, consisting of multiple pure tones ● The lowest frequency.
Sounds in a reverberant room can interfere with the direct sound source. The normal hearing (NH) auditory system has a mechanism by which the echoes, or.
Chapter 12: Sound Localization and the Auditory Scene.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
How Does auditory perception organization works ? by Elvira Perez and Georg Meyer Dept. Psychology, Liverpool University, UK Hoarse Meeting, Chrysler Ulm,
Hearing Research Center
This research was supported by Delphi Automotive Systems
Pitch perception in auditory scenes 2 Papers on pitch perception… of a single sound source of more than one sound source LOTS - too many? Almost none.
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
The Ear As a Frequency Analyzer Reinier Plomp, 1976.
When the Brain is attending a cocktail party When the Brain is attending a cocktail party Rossitza Draganova.
Acoustic Illusions & Sound Segregation Reading Assignments for Quizette 3 Chapters 10 & 11.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
Hearing Detection Loudness Localization Scene Analysis Music Speech.
Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.
Pitch What is pitch? Pitch (as well as loudness) is a subjective characteristic of sound Some listeners even assign pitch differently depending upon whether.
PSYC Auditory Science Spatial Hearing Chris Plack.
Fletcher’s band-widening experiment (1940)
Audio Scene Analysis and Music Cognitive Elements of Music Listening Kevin D. Donohue Databeam Professor Electrical and Computer Engineering University.
SOUND PRESSURE, POWER AND LOUDNESS
What can we expect of cochlear implants for listening to speech in noisy environments? Andrew Faulkner: UCL Speech Hearing and Phonetic Sciences.
SPATIAL HEARING Ability to locate the direction of a sound. Ability to locate the direction of a sound. Localization: In free field Localization: In free.
Auditory Perception 1 Streaming 400 vs. 504 Hz 400 vs. 566 Hz 400 vs. 635 Hz 400 vs. 713 Hz A 400-Hz tone (tone A) is alternated with a tone of a higher.
COMBINATION TONES The Science of Sound Chapter 8 MUSICAL ACOUSTICS.
PSYCHOACOUSTICS A branch of psychophysics
Precedence-based speech segregation in a virtual auditory environment
Speech Perception (acoustic cues)
Auditory, Tactical, and Olfactory Displays
Presentation transcript:

Auditory Scene Analysis Chris Darwin

Need for sound segregation Ears receive mixture of sounds We hear each sound source as having its own appropriate timbre, pitch, location Stored information about sounds (eg acoustic/phonetic relations) probably concerns a single source Need to make single source properties (eg silence) explicit

Making properties explicit Single-source properties not explicit in input signal eg silence (Darwin & Bethel-Fox, JEP:HPP 1977) NB experience of yodelling may alter your susceptibility to this effect

Mechanisms of segregation Primitive grouping mechanisms based on general heuristics such as harmonicity and onset-time - “bottom-up” / “pure audition” Schema-based mechanisms based on specific knowledge (general speech constraints?) - “top-down.

Segregation of simple musical sounds Successive segregation –Different frequency (or pitch) –Different spatial position –Different timbre Simultaneous segregation –Different onset-time –Irregular spacing in frequency –Location (rather unreliable) not –Uncorrelated FM not used

Successive grouping by frequency Track 8 Track 7 Bugandan xylophone music: “Ssematimba ne Kikwabanga”

Not peripheral channelling Streaming occurs for sounds –with same auditory excitation pattern, but different periodicities Vliegen, J. and Oxenham, A. J. (1999). "Sequential stream segregation in the absence of spectral cues," J. Acoust. Soc. Am. 105, –with Huggins pitch sounds that are only defined binaurally Akeroyd, Carlyon & Deeks (2005) JASA, 118,

Huggins pitch Noise Frequency Time "a faint tone" Interaural phase difference 0 2π Frequency500 Hz ∆ø

Successive grouping by frequency Track 2

Successive grouping by timbre Wessell illusion fo

Successive grouping by spatial separation Track 41

Sach & Bailey - rhythm unmasking by ITD or spatial position ? ITD sufficient but, sequential segregation by spatial position rather than by ITD alone. Target ITD=0, ILD = 0Target ITD=0, ILD = +4 dB Masker

Simultaneous: Frequency vs Space Left Right Tchaikovsky: Intro to 4th movement of Path é tique Symphony From Diana Deutsch (1999) Frequency wins !

Build-up of segregation Horse Morse -LHL-LHL-LHL- --> --H---H---H-- -L-L-L-L-L-L-L Segregation takes a few seconds to build up. Then between-stream temporal / rhythmic judgments are very difficult

Some interesting points: Sequential streaming may require attention - rather than being a pre-attentive process.

Attention necessary for build-up of streaming (Carlyon et al, JEP:HPP 2000) Horse Morse -LHL-LHL-LHL- --> --H---H---H-- -L-L-L-L-L-L-L Horse -> Morse takes a few seconds to segregate These have to be seconds spent attending to the tone stream Does this also apply to other types of segregation?

Continuity and streaming Discontinuous frequency changes produce more streaming than do continuous changes

Capturing a component from a mixture by frequency proximity A-B A-BC Freq separation of AB Harmonicity & synchrony of BC Bregman & Pinker, 1978, Canad J Psychol Disjoint Allocation?

Rhythmic masking release Simultaneous onset prevents a component from forming part of a sequential stream Track 22

Simultaneous grouping What is the timbre / pitch / location of a particular sound source ? Important grouping cues continuity (or repetition) “Old + New” onset time harmonicity (or regularity of frequency spacing)

Bregman’s Old + New principle Stimulus: A followed by A+B -> Percept of: A as continuous (or repeated) with B added as separate percept

B MAMB Old+New Heuristic A A A A B B B B MM MM MM MM M A MAMB

Percept M

Rate of onset and continuity Rapid increases in level lead to Old+New Gradual just heard as increase. Track 32

time frequency Grouping & vowel quality

Grouping & vowel quality (2) + time frequency time frequency time frequency continuation removed from vowel continuation not removed from vowel time frequency captor + time frequency time frequency

Onset-time: allocation is subtractive not exclusive Bregman’s Old-plus-New heuristic Indicates importance of coding change.

Asynchrony & vowel quality 90 ms T Onset Asynchrony T (ms) F1 boundary (Hz) 8 subjects No 500 Hz component

Mistuning & pitch vowel complex Mean pitch shift (Hz) % Mistuning of 4th Harmonic 8 subjects 90 ms

Onset asynchrony & pitch vowel complex Onset Asynchrony T (ms) Mean pitch shift (Hz) 8 subjects ±3% mistuning 90 ms T

Some interesting points: Sequential streaming may require attention - rather than being a pre-attentive process. Parametric behaviour of grouping depends on what it is for.

Grouping for Effectiveness of a parameter on grouping depends on the task. Eg 10-ms onset time allows a harmonic to be heard out 40-ms onset-time needed to remove from vowel quality >100-ms needed to remove it from pitch.

c. 10 ms Harmonic in vowel to be heard out: 40 ms Harmonic to be removed from vowel: 200 ms Harmonic to be removed from pitch: Minimum onset needed for:

Grouping not absolute and independent of classification group classify

Apparent continuity Track If B would have masked if it HAD been there, then you don’t notice that it is not there. iTunes

Continuity & grouping Harmonic 1. Pulsing complex1.Pulsing high tone 2.Steady low tone Enharmonic Group tones; then decide on continuity.

Some interesting points: Sequential streaming may require attention - rather than being a pre-attentive process. Parametric behaviour of grouping depends on what it is for. Not everything that is obvious on an auditory spectrogram can be used : FM of Fo irrelevant for segregation (Carlyon, JASA 1991; Summerfield & Culling 1992)

 Fo between two sentences (Bird & Darwin 1998; after Brokx & Nooteboom, 1982) Two sentences (same talker) only voiced consonants (with very few stops) Thus maximising Fo effect Task: write down target sentence Replicates & extends Brokx & Nooteboom Target sentence Fo = 140 Hz Masking sentence = 140 Hz ± 0,1,2,5,10 semitones

Carlyon: across-frequency FM coherence Odd-one in 2 or 3 ? frequency 5 Hz, 2.5% FM Easy Impossible Harm Inharmonic Carlyon, R. P. (1991). "Discriminating between coherent and incoherent frequency modulation of complex tones," J. Acoust. Soc. Am. 89,

McAdams FM in sung vowels Bregman demo 24

Role of localisation cues What role do localisation cues play in helping us to hear one voice in the presence of another ? Head shadow increases S/N at the nearer ear (Bronkhurst & Plomp, 1988). –… but this advantage is reduced if high frequencies inaudible (B & P, 1989) But do localisation cues also contribute to selectively grouping different sound sources?

Some interesting points: Sequential streaming may require attention - rather than being a pre- attentive process. Parametric behaviour of grouping depends on what it is for. Not everything that is obvious on an auditory spectrogram can be used : FM of Fo irrelevant for segregation (Carlyon, JASA 1991; Summerfield & Culling 1992) Although we can group sounds by ear, ITDs by themselves remarkably useless for simultaneous grouping. Group first then localise grouped object.

Separating two simultaneous sound sources Noise bands played to different ears group by ear, but... Noise bands differing in ITD do not group by ear

Segregation by ear but not by ITD (Culling & Summerfield 1995) ear ITD AREEAREE delay AR EE ER OO Task - what vowel is on your left ? (“ee”)

Two models of attention

Phase Ambiguity 500 Hz: period = 2ms R leads by 1.5 msL leads by 0.5 ms LLR cross-correlation peaks at +0.5ms and -1.5ms auditory system weighted toone closest to zero 500-Hz pure tone leading in Right ear by 1.5 ms Heard on Left side

Disambiguating phase-ambiguity Narrowband noise at 500 Hz with ITD of 1.5 ms (3/4 cycle) heard at lagging side. Increasing noise bandwidth changes location to the leading side. Explained by across-frequency consistency of ITD. (Jeffress, Trahiotis & Stern)

Resolving phase ambiguity 500 Hz: period = 2ms L lags by 1.5 msor L leads by 0.5 ms ? Delay of cross-correlator ms Frequency of auditory filter Hz Cross-correlation peaks for noise delayed in one ear by 1.5 ms 300 Hz: period = 3.3ms R RLLR Actual delay Left ear actually lags by 1.5 ms L lags by 1.5 msor L leads by 1.8 ms ? R

Segregation by onset-time Frequency (Hz) Duration (ms) 0400 Duration (ms) SynchronousAsynchronous ITD: ± 1.5 ms (3/4 cycle at 500 Hz)

Segregated tone changes location Onset Asynchrony (ms) Pointer IID (dB) Pure Complex RL

Segregation by mistuning Frequency (Hz) Duration (ms) 0400 Duration (ms) In tuneMistuned

Mistuned tone changes location

Mechanisms of segregation Primitive grouping mechanisms based on general heuristics such as harmonicity and onset-time - “bottom-up” / “pure audition” Schema-based mechanisms based on specific knowledge (general speech constraints?) - “top-down.

Hierarchy of sound sources ? Orchestra 1° Violin section Leader Chord Lowest note Attack 2° violins… Corresponding hierarchy of constraints ?

Is speech a single sound source ? Multiple sources of sound: Vocal folds vibrating Aspiration Frication Burst explosion Clicks Nama: Baboon's arse

Tuvan throat music

Mechanisms of grouping / segregation Primitive grouping mechanisms based on general heuristics such as harmonicity and onset-time - “bottom-up” / “pure audition” –Evidence: Fo-diffs on simultaneous speech Schema-based mechanisms based on specific knowledge (general speech constraints?) - “top- down” / “segregation by recognition” –Evidence: sine-wave speech

Sine-wave speech: one is OK... (Bailey et al., Haskins SR 1977; Remez et al., Science 1981)

... but how about two? Barker & Cooke, Speech Comm 1999 Onset-time provides only bottom-up cue

Both approaches could be true Bottom-up processes constrain alternatives considered by top-down processes e.g. cafeteria model (Darwin, QJEP 1981) Evidence: Onset-time segregates a harmonic from a vowel, even if it produces a “worse” vowel (Darwin, JASA 1984) time +

Low-level cues for separating a mixture of two sounds such as speech Look for: harmonic series sounds starting at the same time

Plan How does  Fo help in separating sound sources? –within vs across-formant grouping Effect of localisation cues on grouping & attention –Grouping by ear & by ITD –Maintaining attention to sound source (ITD, prosody, VT length)

Broadbent & Ladefoged (1957) PAT-generated sentence “What did you say before that?” F1F2 when Fo the same -125 Hz (either natural or monotone), listeners heard: one voice only 16/18 in one place 18/18 when Fo different -125 /135 (monotone), listeners heard: two voices 15/18 in two places 12/18 But as B & L admit...

... Harvey Fletcher (1953) was there first ! (almost) p 216 describes experiment (suggested by Arnold). Speech fuses but polyphonic music sounds weird since different notes are heard at different

Conclusion Common Fo integrates –broadband frequency regions of a single voice –coming simultaneously to different ears –into a single voice heard in one position. But what about Fo’s ability to separate different voices? (original B & L question)

…but Cutting (1976) /da/ F1 + F2 on same Fo to difft ears, only 60% “one-item” responses Listening to Broadbent & Ladefoged-type sentences gives me a very clear impression of two different things on the two ears. Does common Fo help to integrate formants?

 Fo improves identification double vowels sentences

Mechanisms of  Fo improvement A. Across formant grouping by Fo B. Better definition of individual formants - especially where harmonics resolved B more important than A for double vowels (Culling & Darwin, JASA 1993). Also true for sentences?

 Fo between two sentences (Bird & Darwin 1998; after Brokx & Nooteboom, 1982) Target sentence Fo = 140 Hz Masking sentence = 140 Hz ± 0,1,2,5,10 semitones Two sentences (same talker) only voiced consonants (with very few stops) Task: write down target sentence Replicates & extends Brokx & Nooteboom

Chimeric sentences (Bird & Darwin, Grantham 1998) Fo below 800 HzFo above 800 Hz semitones

 Fo only in low or high freq. regions all the action is in the low freq region (<800 Hz)

Segregating Fo-chimeric sentences inappropriate pairing of Fo only detrimental at/ above 4 semitones so across- formant grouping only important at/ above 4 semitones

Summary of  Fo effects in separating competing voices Intelligibility increased by small  Fo in F1 region... … but not by  Fo in higher freq. region. Across-formant consistency of Fo only important at larger  Fo

Hi / Low complementarity Intelligibility of competing voices increased in: Low frequencies: Fo differences allow better estimates of F1 High frequencies: spatial separation head-shadow -> better S/N ratio (Bronkhorst & Plomp, 1988). But may not be audible?

Harmonicity or regular spacing? Roberts and Brunstrom: Perceptual coherence of complex tones (2001) J. Acoust. Soc. Am. 110 time frequency adjust mistuned Similar results for harmonic and for linearly frequency- shifted complexes

Auditory grouping and ICA / BSS Do grouping principles work because they provide some degree of stastistical independence in a time-frequency space? If so, why do the parametric values vary with the task?

Speech music

Bregman long summary Cues used by the ASA process *The perceptual segregation of sounds in a sequence depends upon differences in their frequencies, pitches, timbres (spectral envelopes), center frequencies (of noise bands), amplitudes, and locations, and upon sudden changes of these variables. Segregation also increases as the duration of silence between sounds in the same frequency range gets longer. *The perceptual fusion of simultaneous components to form single perceived sounds depends on their onset and offset synchrony, frequency separation, regularity of spectral spacing, binaural frequency matches, harmonic relations, parallel amplitude modulation, and parallel gliding of components. [Note to physicists: All these cases of fusion can be obtained at room temperature.] *Different cues for stream segregation compete to control the grouping, and different cues have different strengths. *Primitive grouping occurs even when the frequency and timing of the sequence is unpredictable. *An increased biasing toward stream segregation builds up with longer exposure to sounds in the same frequency region. *Stream segregation is context-dependent, involving the competition of alternative organizations, Effects of ASA on perception *A change in perceptual grouping can alter the perception of rhythms, melodic patterns, and overlap of sounds. *Patterns of sounds whose members are distributed into more than one perceptual stream are much harder to perceive than those wholly contained within a single stream. *Perceptual organization can affect perceived loudness and spatial location. *The rules of ASA try to prevent the crossing of streams in frequency, whether the acoustic material is a sequence of discrete tones or continuously gliding tones. *Known principles of ASA can predict the camouflage of melodies and rhythms when interfering sounds are interspersed or mixed with a to- be-recognized sequence of sounds. *The apparent continuity of sounds through masking noise depends on ASA principles. Stimuli have included frequency glides, amplitude- varying tones, and narrow-band noises. *A perceptual stream can alter another one by capturing some of its elements. *The apparent spatial position of a sound can be altered if some of its energy becomes grouped with other sounds, *Comodulation masking release (CMR) does not make the presence of the target more discriminable by simply altering the timbre of the target- masker mixture. It actually increases the subjective experience that the target is present. *Sequential capturing can affect the perception of speech, specifically the integration of perceptually isolated components in speech-sound identification. *The segregation of vowels increases when they have different pitches and different pitch transitions. We have looked at synthetic vowels that do or do not have harmonic relations between frequency components, *ASA principles help explain the construction of music, e.g., rules of voice leading. *ASA principles are used intuitively by composers to control dissonance in polyphonic music. *The segregation of streams of visual apparent motion works in exactly the same way as auditory stream segregation.