Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Slides:



Advertisements
Similar presentations
Considerations for the Development and Fitting of Hearing-Aids for Auditory-Visual Communication Ken W. Grant and Brian E. Walden Walter Reed Army Medical.
Advertisements

Frequency Band-Importance Functions for Auditory and Auditory- Visual Speech Recognition Ken W. Grant Walter Reed Army Medical Center Washington, D.C.
Sounds that “move” Diphthongs, glides and liquids.
SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
Hearing relative phases for two harmonic components D. Timothy Ives 1, H. Martin Reimann 2, Ralph van Dinther 1 and Roy D. Patterson 1 1. Introduction.
Auditory scene analysis 2
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
The Neuroscience of Language. What is language? What is it for? Rapid efficient communication – (as such, other kinds of communication might be called.
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
Speech perception 2 Perceptual organization of speech.
“Speech and the Hearing-Impaired Child: Theory and Practice” Ch. 13 Vowels and Diphthongs –Vowels are formed when sound produced at the glottal source.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
Temporal Properties of Spoken Language Steven Greenberg The Speech Institute
TRANSMISSION FUNDAMENTALS Review
Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.
Phonetic Similarity Effects in Masked Priming Marja-Liisa Mailend 1, Edwin Maas 1, & Kenneth I. Forster 2 1 Department of Speech, Language, and Hearing.
Cortical Encoding of Natural Auditory Scenes Brigid Thurgood.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
William Stallings Data and Computer Communications 7th Edition (Selected slides used for lectures at Bina Nusantara University) Data, Signal.
Experimental Evaluation
Perceptual Inference and Information Integration in Brain and Behavior PDP Class Jan 11, 2010.
Conclusions  Constriction Type does influence AV speech perception when it is visibly distinct Constriction is more effective than Articulator in this.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
User Benefits of Non-Linear Time Compression Liwei He and Anoop Gupta Microsoft Research.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Studying Visual Attention with the Visual Search Paradigm Marc Pomplun Department of Computer Science University of Massachusetts at Boston
Abstract Research Questions The present study compared articulatory patterns in production of dental stop [t] with conventional dentures to productions.
Theory testing Part of what differentiates science from non-science is the process of theory testing. When a theory has been articulated carefully, it.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Speech Science Fall 2009 Nov 2, Outline Suprasegmental features of speech Stress Intonation Duration and Juncture Role of feedback in speech production.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
Wireless and Mobile Computing Transmission Fundamentals Lecture 2.
ANALYSIS ON THE VIDEO CLIP PUBLISHED BY CHANNEL-4,UK.
Physics 114: Exam 2 Review Lectures 11-16
Designing for Attention With Sound: Challenges and Extensions to Ecological Interface Design Marcus O. Watson and Penelope M. Sanderson HUMAN FACTORS,
SPEECH PERCEPTION DAY 16 – OCT 2, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Sh s Children with CIs produce ‘s’ with a lower spectral peak than their peers with NH, but both groups of children produce ‘sh’ similarly [1]. This effect.
Sounds in a reverberant room can interfere with the direct sound source. The normal hearing (NH) auditory system has a mechanism by which the echoes, or.
Pragmatically-guided perceptual learning Tanya Kraljic, Arty Samuel, Susan Brennan Adaptation Project mini-Conference, May 7, 2007.
Gutierrez, Aldous Euclid B. Mr. Xavier Aquino Velasco – Associate/Lecturer III, FEU Tech ENSP2 FEU Institute of Technology.
Phonetic Context Effects Major Theories of Speech Perception Motor Theory: Specialized module (later version) represents speech sounds in terms of intended.
What are the Essential Cues for Understanding Spoken Language? Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon.
Precise and Approximate Representation of Numbers 1.The Cartesian-Lagrangian representation of numbers. 2.The homotopic representation of numbers 3.Loops.
The Quantum Theory of Atoms and Molecules The Schrödinger equation and how to use wavefunctions Dr Grant Ritchie.
Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.
Experimental Psychology PSY 433
On Coding for Real-Time Streaming under Packet Erasures Derek Leong *#, Asma Qureshi *, and Tracey Ho * * California Institute of Technology, Pasadena,
1 Theory and Practice of International Financial Management Assessing the Risk of Foreign Exchange Exposure.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
Nuclear Accent Shape and the Perception of Syllable Pitch Rachael-Anne Knight LAGB 16 April 2003.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Control of prosodic features under perturbation in collaboration with Frank Guenther Dept. of Cognitive and Neural Systems, BU Carrie Niziolek [carrien]
Evaluation Institute Qatar Comprehensive Educational Assessment (QCEA) 2008 Summary of Results.
Introduction Ruth Adam & Uta Noppeney Max Planck Institute for Biological Cybernetics, Tübingen Scientific Aim Experimental.
Bridging the gap between L2 speech perception research and phonological theory Paola Escudero & Paul Boersma (March 2002) Presented by Paola Escudero.
Maxlab proprietary information – 5/4/09 – Maximilian Riesenhuber CT2WS at Georgetown: Letting your brain be all that.
Temporal Properties of Spoken Language Steven Greenberg In Collaboration with Hannah Carvey,
The Quantum Theory of Atoms and Molecules
4aPPa32. How Susceptibility To Noise Varies Across Speech Frequencies
Precedence-based speech segregation in a virtual auditory environment
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Cognitive Brain Dynamics Lab
Speech Perception (acoustic cues)
Attentional Modulations Related to Spatial Gating but Not to Allocation of Limited Resources in Primate V1  Yuzhi Chen, Eyal Seidemann  Neuron  Volume.
Wallis, JD Helen Wills Neuroscience Institute UC, Berkeley
A Japanese trilogy: Segment duration, articulatory kinematics, and interarticulator programming Anders Löfqvist Haskins Laboratories New Haven, CT.
Multisensory Integration: Maintaining the Perception of Synchrony
Presentation transcript:

Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704, USA Ken W. Grant Army Audiology and Speech Center Walter Reed Army Medical Center Washington, D.C , USA

Acknowledgements and Thanks Technical Assistance Takayuki Arai, Rosaria Silipo Research Funding U.S. National Science Foundation

BACKGROUND

Superior recognition and intelligibility under many conditions Provides phonetic-segment information that is potentially redundant with acoustic information Vowels Provides segmental information that complements acoustic information Consonants Directs auditory analyses to the target signal Who, where, when, what (spectral) What’s the Big Deal with Speech Reading? +

Audio-Visual vs. Audio-Only Recognition NH = Normal Hearing HI = Hearing Impaired The visual modality provides a significant gain in speech processing Particularly under low signal-to-noise-ratio conditions And for hearing-impaired listeners Figure courtesy of Ken Grant

Voicing Manner Place Other Percent Information Transmitted relative to Total Information Received Articulatory Information via by Visual Cues Figure courtesy of Ken Grant 0% 3% 4% Place of articulation 93% Place of Articulation Most Important

Key issues pertaining to: Early versus late integration models of bi-modal information Most contemporary models favor late integration of information However …. Preliminary evidence (Sams et al., 1991) that silent speechreading can activate auditory cortex (in humans) (but Bernstein et al say “nay”) Superior colliculus (an upper brainstem nucleus) may also serve as a site of bimodal integration (or at least interaction; Stein and colleagues) Are Auditory & Visual Processing Independent?

What are the temporal factors underlying integration of audio-visual information for speech processing? Two sets of data are examined: Spectro-temporal integration – audio-only signals Audio-visual integration using sparse spectral cues and speechreading In each experiment the cues (acoustic and/or visual) are desynchronized and the impact on word intelligibility measured (for English sentences) Time Constraints Underlying A/V Integration

EXPERIMENT OVERVIEW

Time course of integration Within (the acoustic) modality – Four narrow spectral slits Central slits desynchronized relative to the lateral slits Across modalities – Two acoustic slits (the lateral channels) Speechreading video information Desynchronize the video and audio streams relative to each other Spectro-temporal Integration

Auditory-Visual Asynchrony - Paradigm Video of spoken (Harvard/IEEE) sentences, presented in tandem with a sparse spectral representation (low- and high-frequency slits) of the same material + + Video LeadsAudio Leads 40 – 400 ms Baseline Condition SYNCHRONOUS A/V

Auditory-Visual Integration - Preview When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as long as 200 ms Why? Why? Why? We’ll return to these data shortly But first, let’s take a look at audio-alone speech intelligibility data in order to gain some perspective on the audio-visual case The audio-alone data come from earlier studies by Greenberg and colleagues using TIMIT sentences 9 Subjects

AUDIO-ALONE EXPERIMENTS

The edge of each slit was separated from its nearest neighbor by an octave Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? – YES (cf. next slide) What is the intelligibility of each slit alone and in combination with others? Audio (Alone) Spectral Slit Paradigm + +

89%60%13% 2%9% 4% Word Intelligibility - Single and Multiple Slits Slit Number CF (Hz) CF (Hz)

Word Intelligibility - Single Slits The intelligibility associated with any single slit is only 2 to 9% The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

Word Intelligibility - 4 Slits

Word Intelligibility - 2 Slits

Slit Asynchrony Affects Intelligibility Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility The affect of asynchrony on intelligibility is relatively symmetrical These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions

Intelligibility and Slit Asynchrony Desynchronizing the two central slits relative to the lateral ones has a pronounced effect on intelligibility Asynchrony greater than 50 ms results in intelligibility lower than baseline

AUDIO-VISUAL EXPERIMENTS

Focus on Audio-Leading-Video Conditions When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals These data are next compared with data from the previous slide to illustrated the similarity in the slope of the function

Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of the audio-leading-video condition Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves

When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio We’ll consider a variety of interpretations in a few minutes Focus on Video-Leading-Audio Conditions

The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions WHY? WHY? WHY? There are several interpretations of these data – we’ll consider several on the following slide Auditory-Visual Integration - the Full Monty

INTERPRETATION OF THE DATA

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) The speech of sound is ca feet per second (at sea level, 70° F, etc.) Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) The speech of sound is ca feet per second (at sea level, 70° F, etc.) Subjects in this study were wearing headphones Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) The speech of sound is ca feet per second (at sea level, 70° F, etc.) Subjects in this study were wearing headphones Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds) Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa Some problems with this interpretation (at least by itself) ….. The speed of light is ca. 186,300 miles per second (effectively instantaneous) The speech of sound is ca feet per second (at sea level, 70° F, etc.) Subjects in this study were wearing headphones Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds) (Let’s put this potential interpretation aside for a few moments) Possible Interpretations of the Data – 1

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data – 2

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other Possible Interpretations of the Data – 2

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other Some problems with this interpretation …. Possible Interpretations of the Data – 2

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other Some problems with this interpretation …. Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other) Possible Interpretations of the Data – 2

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other Some problems with this interpretation …. Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other) However, the data do not correspond to this pattern Possible Interpretations of the Data – 2

Auditory-Visual Integration The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions WHY? WHY? WHY?

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous Some problems with this interpretation …. Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous Some problems with this interpretation …. If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms? Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous Some problems with this interpretation …. If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms? There must be some other factor (or set of factors) associated with this perceptual integration asymmetry. What would it (they) be? Possible Interpretations of the Data – 3

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? The visual component of the speech signal is most closely associated with place-of-articulation information (cf. Grant and Walden, 1996) Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996) In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length) Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996) In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length) This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality Possible Interpretations of the Data – 4

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be? The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996) In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length) This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels Possible Interpretations of the Data – 4

VARIABILITY AMONG SUBJECTS

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects One Further Wrinkle to the Story ….

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio One Further Wrinkle to the Story ….

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms One Further Wrinkle to the Story ….

Variation across subjects Video signal leading is better than synchronous for 8 of 9 subjects Auditory-Visual Integration - by Individual Ss These data are complex, but the implications are clear. Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility

SUMMARY

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by ms) Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by ms) There are many potential interpretations of the data Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by ms) There are many potential interpretations of the data The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio) Audio-Video Integration – Summary

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by ms) There are many potential interpretations of the data The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio) The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general Audio-Video Integration – Summary

That’s All Many Thanks for Your Time and Attention