Using Fo and vocal-tract length to attend to one of two talkers. Chris Darwin University of Sussex With thanks to : Rob Hukin John Culling John Bird MRC & EPSRC
1.Review past work on the way that the human auditory system uses differences in Fo to separate two voices; 2. Present new data on the use of Fo, vocal- tract length and their combination to allow listeners to select one of tw o simultaneous messages. Something old, something new, something borrowed, background blue.
Difference in Fo leads to: 1.binaural separation of sound sources 2.increase in intelligibility 3.ability to track a sound source over time. Three types of experiment:
Difference in Fo leads to: 1.binaural separation of sound sources 2.increase in intelligibility 3.ability to track a sound source over time. Three types of experiment:
Broadbent & Ladefoged (1957) PAT-generated sentence “What did you say before that?” F1F2 when Fo the same -125 Hz (either natural or monotone), listeners heard: one voice only 16/18 in one place 18/18 when Fo different -125 /135 (monotone), listeners heard: two voices 15/18 in two places 12/18
... Harvey Fletcher (1953) was there first ! (almost) p 216 describes experiment (suggested by Arnold). Speech fuses but polyphonic music sounds weird since different notes are heard at different
B & L Conclusion Common Fo integrates –broadband frequency regions of a single voice –coming simultaneously to different ears into a single voice heard in one position.
Is a common Fo sufficient for fusion? Broadbent & Ladefoged's stimuli used formant resonators with broad low-frequency skirts. Sharply-filtered sounds sometimes give impression of two sound sources even with common Fo.
Formant T(f) & abs difference
Dichotic : same Fo original PSOLA Fo -> 0% PSOLA Fo -> 0% LP filter HP filter Left ear Right ear apologies to Hideki
Dichotic : different Fo original PSOLA Fo -> - 4% PSOLA Fo -> + 4% LP filter HP filter Left ear Right ear
Complementary LP/HP filters Variable bandwidth
Complementary LP/HP filters (dB)
Dichotic Results (female voice) Filter 1 kHz
Dichotic Results (male voice) Dichotic
-| Level difference | between ears (dB)
Higher filter cut-offs need wider bandwidths Same Fo
Low-frequency overlap cf natural ILDs higher for low frequency sounds
ITD : same Fo original PSOLA Fo -> 0% PSOLA Fo -> 0% LP filter HP filter Left ear Right ear Delay ±571 µs
ITD : different Fo original PSOLA Fo -> - 4% PSOLA Fo -> + 4% LP filter HP filter Left ear Right ear Delay ±571 µs
ITD Results (female voice) ±570 µs ITD
ITD Results (male voice) ±570 µs ITD
Summary Fusion at same Fo? Fusion at Different Fo (±4%)? Dichotic Low-frequency overlap needed No But what about Fo’s ability to separate different voices? (original B & L question)
Difference in Fo leads to: 1.binaural separation of sound sources 2.increase in intelligibility 3.ability to track a sound source over time. Three types of experiment:
Fo improves identification double vowels sentences double vowels over by 1 semitone sentences improve for longer
Mechanisms of Fo improvement A. Global: Across formant grouping by Fo (as originally conceived by B & L) B. Local: Better definition of individual formants - especially F1 where harmonics resolved At small ∆Fos B more important than A for double vowels (Culling & Darwin, JASA 1993). Also true for sentences?
Fo between two sentences (Bird & Darwin 1998; after Brokx & Nooteboom, 1982) Target sentence Fo = 140 Hz Masking sentence = 140 Hz ± 0,1,2,5,10 semitones Two sentences (same talker) only voiced consonants (with very few stops) Task: write down target sentence Replicates & extends Brokx & Nooteboom
Chimeric sentences (Bird & Darwin, Grantham Meeting 1998) Fo below 800 HzFo above 800 Hz
Paired sentences' Fos Low Pass High Pass Normal Same Fo in High Same Fo in Low Swapped (gives wrong gping)112100
Segregating sentence pairs by Fo all the action is in the low frequency region (<800 Hz) no strong evidence of across-formant grouping
Adding Fo-swapped inappropriate pairing of Fo only detrimental above 4 semitones
Summary of Fo-differences Across-formant grouping only significant for large Fo differences (> ~ 4 semitones) Most of the improvement with small Fo differences happens in the F1 frequency- region.
another caveat for auto-correlation Improvement in identification of double vowels for small ∆Fos is about as good when each vowel is made up of alternating harmonics of the two Fos (Culling & Darwin) Autocorrelation would pull out completely wrong envelopes.
No simultaneous effect of FM different Frequency Modulations of Fo Although separation by Fo shows strong effects, there is no detectable effect of simultaneous separation by different Frequency Modulations of Fo. Listeners unable to discriminate correlated from uncorrelated FM in simulataneous inharmonic sine waves (Carlyon).
Summary of Fo effects in separating competing voices Intelligibility increased by small Fo only in F1 region (and harmonic alternation tolerated)... … but not by Fo in only higher freq. region. Across-formant consistency of Fo only important at larger Fo FM produces no additional separation
Difference in Fo leads to: 1.binaural separation of sound sources 2.increase in intelligibility 3.ability to track a sound source over time. Three types of experiment:
Tracking by Fo We can also continuity of an Fo contour to track a particular sound source over time.
CRM task (tracking a sound source) (Bolia et al., 2000) 2 simultaneous sentences each of form Ready (Call Sign) go to (Color) (Number) now. Same talker (TT); Same Sex (TS); Different sex (TD) Target denoted by Call-Sign "Baron" 8 Talkers in corpus, 2048 tokens
Listeners responded by selecting the appropriate colored digit with the computer mouse CRM task (Bolia et al., 2000)
CRM task results (Brungart et al)
Effect of change in Fo
Fo contours for 2 individuals Individuals, with most constant Fo contours, show most improvement with ∆Fo
Effect of change of VT
Effect of joint change of Fo and VT Original: male
Effect of joint change of Fo and VT Original: female
Superadditivity of ∆Fo and ∆VT predicted d' actual d' male female ∆Fo & ∆VT superadditive … and still less than real different-sex talkers
Conclusions Same Fo not a sufficient condition for dichotic fusion for complemenarily filtered speech. Intelligibility increase for small ∆Fo confined to F1 region. Only across-formant for larger ∆Fo. Fo & VT-size useful for tracking sources across time. Superadditive.