Download presentation
Presentation is loading. Please wait.
Published byThomas Charles Modified over 9 years ago
1
Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704, USA http://www.icsi.berkeley.edu/~steveng steveng@icsi.berkeley.edu Ken W. Grant Army Audiology and Speech Center Walter Reed Army Medical Center Washington, D.C. 20307, USA http://www.wramc.amedd.army.mil/departments/aasc/avlab grant@tidalwave.net
2
Acknowledgements and Thanks Technical Assistance Takayuki Arai, Rosaria Silipo Research Funding U.S. National Science Foundation
3
BACKGROUND
4
Superior recognition and intelligibility under many conditions Provides phonetic-segment information that is potentially redundant with acoustic information Vowels Provides segmental information that complements acoustic information Consonants Directs auditory analyses to the target signal Who, where, when, what (spectral) What’s the Big Deal with Speech Reading? +
5
Audio-Visual vs. Audio-Only Recognition NH = Normal Hearing HI = Hearing Impaired The visual modality provides a significant gain in speech processing Particularly under low signal-to-noise-ratio conditions And for hearing-impaired listeners Figure courtesy of Ken Grant
6
Voicing Manner Place Other Percent Information Transmitted relative to Total Information Received Articulatory Information via by Visual Cues Figure courtesy of Ken Grant 0% 3% 4% Place of articulation 93% Place of Articulation Most Important
7
Key issues pertaining to: Early versus late integration models of bi-modal information Most contemporary models favor late integration of information However …. Preliminary evidence (Sams et al., 1991) that silent speechreading can activate auditory cortex (in humans) (but Bernstein et al. 2002 say “nay”) Superior colliculus (an upper brainstem nucleus) may also serve as a site of bimodal integration (or at least interaction; Stein and colleagues) Are Auditory & Visual Processing Independent?
8
What are the temporal factors underlying integration of audio-visual information for speech processing? Two sets of data are examined: Spectro-temporal integration – audio-only signals Audio-visual integration using sparse spectral cues and speechreading In each experiment the cues (acoustic and/or visual) are desynchronized and the impact on word intelligibility measured (for English sentences) Time Constraints Underlying A/V Integration
9
EXPERIMENT OVERVIEW
10
Time course of integration Within (the acoustic) modality – Four narrow spectral slits Central slits desynchronized relative to the lateral slits Across modalities – Two acoustic slits (the lateral channels) Speechreading video information Desynchronize the video and audio streams relative to each other Spectro-temporal Integration
11
Auditory-Visual Asynchrony - Paradigm Video of spoken (Harvard/IEEE) sentences, presented in tandem with a sparse spectral representation (low- and high-frequency slits) of the same material + + Video LeadsAudio Leads 40 – 400 ms Baseline Condition SYNCHRONOUS A/V
12
Auditory-Visual Integration - Preview When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as long as 200 ms Why? Why? Why? We’ll return to these data shortly But first, let’s take a look at audio-alone speech intelligibility data in order to gain some perspective on the audio-visual case The audio-alone data come from earlier studies by Greenberg and colleagues using TIMIT sentences 9 Subjects
13
AUDIO-ALONE EXPERIMENTS
14
The edge of each slit was separated from its nearest neighbor by an octave Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? – YES (cf. next slide) What is the intelligibility of each slit alone and in combination with others? Audio (Alone) Spectral Slit Paradigm + +
15
89%60%13% 2%9% 4% Word Intelligibility - Single and Multiple Slits 1 2 3 4 Slit Number 1 2 3 4 334 841 2120 5340 CF (Hz) 334 841 2120 5340 CF (Hz)
16
Word Intelligibility - Single Slits The intelligibility associated with any single slit is only 2 to 9% The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits
17
Word Intelligibility - 4 Slits
18
Word Intelligibility - 2 Slits
20
Slit Asynchrony Affects Intelligibility Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility The affect of asynchrony on intelligibility is relatively symmetrical These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions
21
Intelligibility and Slit Asynchrony Desynchronizing the two central slits relative to the lateral ones has a pronounced effect on intelligibility Asynchrony greater than 50 ms results in intelligibility lower than baseline
22
AUDIO-VISUAL EXPERIMENTS
23
Focus on Audio-Leading-Video Conditions When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals These data are next compared with data from the previous slide to illustrated the similarity in the slope of the function
24
Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of the audio-leading-video condition Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves
25
When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio We’ll consider a variety of interpretations in a few minutes Focus on Video-Leading-Audio Conditions
26
The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions WHY? WHY? WHY? There are several interpretations of these data – we’ll consider several on the following slide Auditory-Visual Integration - the Full Monty
27
INTERPRETATION OF THE DATA
28
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data
29
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Possible Interpretations of the Data – 1
30
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. Possible Interpretations of the Data – 1
31
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) Possible Interpretations of the Data – 1
32
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.) Possible Interpretations of the Data – 1
33
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.) Subjects in this study were wearing headphones Possible Interpretations of the Data – 1
34
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.) Subjects in this study were wearing headphones Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds) Possible Interpretations of the Data – 1
35
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.) Subjects in this study were wearing headphones Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds) (Let’s put this potential interpretation aside for a few moments) Possible Interpretations of the Data – 1
36
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data – 2
37
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other Possible Interpretations of the Data – 2
38
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other Some problems with this interpretation …. Possible Interpretations of the Data – 2
39
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other Some problems with this interpretation …. Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other) Possible Interpretations of the Data – 2
40
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other Some problems with this interpretation …. Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other) However, the data do not correspond to this pattern Possible Interpretations of the Data – 2
41
Auditory-Visual Integration The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions WHY? WHY? WHY?
42
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data – 3
43
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous Possible Interpretations of the Data – 3
44
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous Some problems with this interpretation …. Possible Interpretations of the Data – 3
45
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous Some problems with this interpretation …. If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms? Possible Interpretations of the Data – 3
46
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous Some problems with this interpretation …. If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms? There must be some other factor (or set of factors) associated with this perceptual integration asymmetry. What would it (they) be? Possible Interpretations of the Data – 3
47
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data – 4
48
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? Possible Interpretations of the Data – 4
49
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? The visual component of the speech signal is most closely associated with place-of-articulation information (cf. Grant and Walden, 1996) Possible Interpretations of the Data – 4
50
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996) In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length) Possible Interpretations of the Data – 4
51
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996) In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length) This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality Possible Interpretations of the Data – 4
52
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996) In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length) This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels Possible Interpretations of the Data – 4
53
VARIABILITY AMONG SUBJECTS
54
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects One Further Wrinkle to the Story ….
55
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio One Further Wrinkle to the Story ….
56
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms One Further Wrinkle to the Story ….
57
Variation across subjects Video signal leading is better than synchronous for 8 of 9 subjects Auditory-Visual Integration - by Individual Ss These data are complex, but the implications are clear. Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility
58
SUMMARY
59
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality Audio-Video Integration – Summary
60
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) Audio-Video Integration – Summary
61
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony Audio-Video Integration – Summary
62
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms Audio-Video Integration – Summary
63
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms) Audio-Video Integration – Summary
64
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms) There are many potential interpretations of the data Audio-Video Integration – Summary
65
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms) There are many potential interpretations of the data The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio) Audio-Video Integration – Summary
66
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms) There are many potential interpretations of the data The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio) The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general Audio-Video Integration – Summary
67
That’s All Many Thanks for Your Time and Attention
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.