Mr Background Noise and Miss Speech Perception in: by Elvira Perez and Georg Meyer.

Mr Background Noise and Miss Speech Perception in: by Elvira Perez and Georg Meyer

Structure 1.Introduction 2.Background 3.Experiments 4.Conclusions 5.Future research

1.Introduction: Ears received mixtures of sounds. We can tolerate surprisingly high levels of noise and still orientate our attention to whatever we want to attend. But... how the auditory system can do this so accurately?

Auditory scene analysis (Bregman, 1990) is a theoretical framework that aims to explain auditory perceptual organisation. Basics: –Environment contains multiples objects Decomposition into its constituent elements. Grouping. It proposes two grouping mechanisms: –1. ‘Bottom-up’: Primitive cues (F0, intensity, location) Grouping mechanism based on Gestalt principles. –2. ‘Top-down’: Schema-based (speech pattern matching) 1.Introduction:

Primitive process (Gelstalt. Koffka, 1935): –Similarity: Sounds will be grouped into a single perceptual stream if they are similar in pitch, timbre, loudness or location. –Good continuation: Smooth changes in frequency, intensity, location or spectrum will be perceived as changes in a single source, whereas abrupt changes indicate change in source. –Common fate: Two components in a sound sharing the same kinds of changes at the same time (e.g.: onset-offset) will perceived and grouped as part of the same single source. –Disjoint locations: A given element in a sound can only form part of one stream at a time (  duplex perception). –Closure: When parts of a sound are masked or occluded, that sound will be perceived as continuous, if there is no direct sensory evidence to indicate that it has been interrupted 1.Introduction:

1. Introduction Criticisms: Too simplistic. Whatever cannot be explained through the primitive processes, it is explained by the schema-based processes. Primitive processes only work in the lab. Sine-wave replicas of utterances (Remez et al., 1992) –Phonetic principles of organization find a single speech stream, whereas auditory principles find several simultaneous whistles. –Grouping by phonetic rather than by simple auditory coherence.

Previous studies (Meyer and Barry, 1999: Harding and Meyer, 2003) have shown that perceptual streams may be formed also from complex features such as formants in speech, and not just from low-level sounds like tones. 2. Background:

Perception of synthetic /m/ changes to /n/ when preceding by a vowel with a high second formant and no transition between the vowel and consonant. Hear as /m/ (Grouped F2 matches that of an /m/) Heard as /n/ (Grouped F2 matches that of an /n/) vowel m

3.Experiments (baseline): The purpose of these studies is to explore how noise (a chirp) affects speech perception. The stimulus used is a vowel-nasal syllable which is perceived as /en/ if presented in isolation but as /em/ if it is presented with a frequency modulated sine wave in the position where the second formant transition would be expected. In the three experiments participants categorised the synthetic syllable heard as /em/ or /en/. Direction, duration, and position of the chirp were the values manipulated.

The perception of a nasal /n/ change to /m/ when adding a chirp between the vowel and nasal F2 Formant frequ. (Hz) 2700 2000 800 375 vowelnasal 100 200ms 3. Experiments

Chirp down means that the frequency of its sine wave changes from 2kHz to 1kHz. Chirp up means that its frequency changes from 1kHz to 2kHz 3. Experiments

Subjects: 7 male and 6 female. Their task was always the same: After each signal presentation subjects were asked to judge whether the syllable sound more like and /em/ or like and /en/ by pressing the appropriate button on a computer graphics tablet. There was no transition between the vowel and the nasal. A chirp between the vowel and the nasal F2 was added to the signal. 3. Experiments

Experiment 1 Baseline/Direction chirp up chirp down vowelnasal 100 200ms In 80% of the trials the participants heard the difference between up and down chirp.

Experiment 2 Duration vowelnasal 100 200ms

vowelnasal Experiment 3 Position

5. Conclusions: Chirps of 4 ms and 10 ms duration, independent of their direction are apparently integrated into the speech signal and change the percept from /en/ to /em/. For longer duration chirps the direction of it matters but for chirp durations up to 30 ms the majority of stimuli are still heard as /em/. Subjects very clearly hear two objects, so that some scene analysis is taking place since the chirp is not integrated completely into the speech.

Duplex perception with one ear. It seems that listeners can also discriminate the direction motion of the chirp when they focus their attention in the chirp and a more high level of auditory processing takes places (80% accuracy).

The data suggests that a low spectro-temporal analysis of the transitions is carried out because: (1) the spectral structure of the chirp sound is very different from the harmonically structured speech sound, and (2) the direction of the chirp motion has little perceptual significance in the percept. These results complement the auditory scene analysis theoretical framework.

Mr. Background Noise Do human listeners actively generate representation of background noise to improve speech recognition? Hypothesis: Recognition performance should be highest if the spectral and temporal structure of interfering noise is regular so that a good noise model can be generated  random noise Noise prediction.

Experiment 4 & 5 Same stimuli than before (chirp down) The amplitude of the chirp vary (5 conditions) Background noise (down chirps): –Quantity: Lots vs Few –Quality: Random vs Regular Categorization task 2FC. Threshold shifts.

en en en Regular condition Irregular condition

Each point in the scatter is the mean threshold over all subjects for a give session. The solid lines show the Boltzmann fit (Eq.(1) for each individual subject in the fifth different conditions. All the fits have the same upper and lower asymptotes.

lots vs. few (t = -3.34, df = 38, p = 0.001). control vs. lots (t = -3.34, df = 38, p = 0.001). No effect between random and regular. Exp. 4 rand/reg

Two aspects change from exp. 4 to 5: –Amplitude scale of the chirps. –The conditions lots now includes 100/20’’ and before 170/20’’.

lots vs few (t = 2.27, df = 38, p = 0.05). control vs. lots (t = 3.12, df = 38, p < 0.05). No effect between random and regular. Exp.5 rand/reg

Conclusions Only the amount of background noise seems to affect the performance of the recognition task. The sequential grouping of the background noise seems an irrelevant cue to improve auditory stream segregation and therefore, speech perception. Counterintuitive phenomenon. Attention must be focused on an object (background noise) for a change in that object to detected (Rensink, et al, 1997)

Change deafness: May occur as a function of (not) attending to the relevant stimulus dimension (Vitevitch, 2003). SPEECH Lexical (meaning) Inlexical (age, gender, emotions of the talker) Acoustical (loudness, pitch, timbre)

Irrelevant sound effect (ISE) (Colle & Welsh, 1976) disrupts in serial recall. The level of meaning (reverse vs forward speech), predictability of the sequence (random vs regular), and similarity (semantic or phisical) of the IS to the target material, seems to have little impact in the focal task. (Jones et al., 1990). Changing state: The degree of variability or phisical change within an auditory stream is the primary determinant of the degree of distrupion in the focal task.

Frequency Time One stream Frequency Time Two streams Changing single stream Two unchanging streams

Experiment 6 Shannon et al. (1999) database of CVC syllables. Constant white noise + high frequency burst + low frequency burst. 3 Conditions: Control, background noise 1 stream, background noise 2 streams. Does auditory stream segregation require attention?

Results Is there a confound effect? Tukey test p < 0.001 More quantity of noise

Procedure White noise: 68 dB Burst high 71 dB Burst low 67 dB Speech signal 59-64 dB : 61.5dB

Conclusions Changing-state stream is more disruptive than two steady-state streams. Working memory or STM and the phonological loop (Badeley, 1999) Auditory streaming can occur in the absence of focal attention.

6. Future research: Streaming and focal attention Streaming and location (diotic vs dichotic presentation)

Thank you

References: Bregman, A.S. 1990. Auditory Scene Analysis: The perceptual organisation of sound, MIT Press, Cambridge MA. Harding and Meyer, G. (2003) Changes in the perception of synthetic nasal consonants as a result of vowel formant manipulations. Speech Communication 39, 173-189. Meyer, G. and Barry, W. (1999) Continuity based grouping affects the perception of vowel-nasal syllables. In: Proc. 14th International Congress of Phonetic Science. University of California.

Table 1: Klatt synthesizer parameters: Fx: formant frequency (Hz), B: bandwidth (Hz), A: Amplitude (dB). The nasals have nasal formants at 250Hz and nasal zeros at 750 Hz(/m/) and 1000 (/n/). 60 300270060 100060 250/m/ 60300270060 200060 250/n/ 50752700607520006075375Vowel ABF3ABF2ABF1

Spectrogram of the stimulus used in the three experiments, with vowel formant at 375, 2000, and 2700 Hz corresponding to an /e/, and with nasal prototype formants at 250, 2000 and 2700 Hz corresponding to an /n/. There was no transitions between the vowel and the nasal

Mr Background Noise and Miss Speech Perception in: by Elvira Perez and Georg Meyer.

Similar presentations

Presentation on theme: "Mr Background Noise and Miss Speech Perception in: by Elvira Perez and Georg Meyer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mr Background Noise and Miss Speech Perception in: by Elvira Perez and Georg Meyer.

Similar presentations

Presentation on theme: "Mr Background Noise and Miss Speech Perception in: by Elvira Perez and Georg Meyer."— Presentation transcript:

Similar presentations

About project

Feedback