What are the Essential Cues for Understanding Spoken Language? Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon.

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon Speech Santa Venetia, CA USA http://www.icsi.berkeley.edu/~steveng steveng@savant-garde.net

Acknowledgements and Thanks Research Funding U.S. Department of Defense (before Iraq) U.S. National Science Foundation Research Collaborators Takayuki Arai, Ken Grant, Rosaria Silipo

For Further Information Consult the web site: www.icsi.berkeley.edu/~steveng

Summary of the Presentation This presentation examines the origins of word intelligibility in the acoustic and visual properties of the speech signal

Summary of the Presentation Low-frequency modulation of acoustic energy, produced by the articulators during speech, is crucial for understanding spoken language, NOT the fine spectral details of the speech signal

Summary of the Presentation These modulation patterns are differentially distributed across the acoustic frequency spectrum

Summary of the Presentation The specific configuration of the modulation patterns across the frequency spectrum provide the essential cues for understanding spoken language

Summary of the Presentation The acoustic frequency spectrum serves as a medium for distribution of modulation patterns – however, much of the spectrum is dispensable

Summary of the Presentation Modulation patterns reflect syllables and their specific contents/structure

Summary of the Presentation Disrupting the modulation information via desynchronization provides an estimate of the temporal window associated with phonetic integration

Summary of the Presentation Visual cues (speechreading) can combine with the acoustic speech signal, providing information analogous to the modulation patterns + + Video LeadsAudio Leads 40 – 400 ms Baseline Condition SYNCHRONOUS A/V

Summary of the Presentation The temporal limits of combining visual and acoustic information are syllable length, particularly when the video precedes the audio signal

Summary of the Presentation Such temporal properties reflect a basic sensory-motor time constant of ca. 200 ms – the sampling rate of consciousness And are consistent with a multi-tier framework for understanding spoken language

Effects of Reverberation on the Speech Signal Reflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions Yet, the intelligibility of speech is remarkably stable This implies that intelligibility is NOT based on the spectro-temporal details but rather on some more basic parameter(s)

Intelligibility Based on Slow Modulations 75% of the spectrum is discarded, leaving four 1/3-slits across the spectrum The edge of each slit is separated from its nearest neighbor by an octave The modulation pattern for each slit differs from that of the others The four-slit compound waveform looks very similar to the full-band signal And is HIGHLY INTELLIBLE

Quantifying Modulation Patterns in Speech The modulation spectrum provides a quantitative method for computing the amount of modulation in the speech signal The technique is illustrated for a paradigmatic, simple signal for clarity The computation is performed for each spectral channel separately

An Invariant Property of the Speech Signal? Houtgast and Steeneken demonstrated that the modulation spectrum, a temporal property, is highly predictive of speech intelligibility This is significant, as it is difficult to degrade intelligibility through normal spectral distortion (many have tried, few have succeeded ….) In highly reverberant environments, the modulation spectrum’s peak is attenuated, shifting down to ca. 2 Hz, becoming increasing unintelligible [based on an illustration by Hynek Hermansky] Modulation Spectrum

The Modulation Spectrum Reflect Syllables Given the importance of the modulation spectrum for intelligibility, what does it reflect linguistically? The distribution of syllable duration matches the modulation spectrum, suggesting that the integrity of the syllable is essential for understanding speech Modulation spectrum of 15 minutes of spontaneous Japanese speech (OGI-TS corpus) compared with the syllable duration distribution for the same material (Arai and Greenberg, 1997) Syllable duration (modulation frequency) Modulation Spectrum

Intelligibility Derived from Modulation Patterns Many perceptual studies emphasize the importance of low-frequency modulation patterns for understanding spoken language Historically, this was first demonstrated by Homer Dudley in 1939 with what has become known as the VOCODER – modulations higher than 25 Hz can be filtered out without significant impact on intelligibility As mentioned earlier, Houtgast and Steeneken demonstrated that the low-frequency modulation spectrum is a good predictor of intelligibility in a wide range of acoustic listening environments (1970s and 1980s) In the mid-1990s, Rob Drullman demonstrated the impact of low-pass filtering the modulation spectrum on intelligibility and segment identification – modulations below 8 Hz appeared to be most important However, …. all of these studies were performed on broadband speech There was no attempt to examine the interaction between temporal and spectral factors for coding speech information (Other studies, such as those by Shannon and associates, have examined spectral- temporal interactions, but only in a crude way)

Intelligibility Studies Using Spectral Slits The interaction between spectral and temporal information for coding speech information can be examined with some degree of precision using spectral slits In what follows, the use of the term “spectral” refers to operations and processes in the acoustic frequency (i.e., tonotopic) domain The term “temporal” or “modulation spectrum” refers to operations and processes that specifically involve low-frequency (< 30 Hz) modulations First, we’ll examine the impact of extreme band-pass SPECTRAL filtering on intelligibility without consideration of the modulation spectrum

Intelligibility of Sparse Spectral Speech In Collaboration with Takayuki Arai and Rosaria Silipo

Intelligibility of Sparse Spectral Speech The spectrum of spoken sentences (TIMIT corpus) can be partitioned into narrow (1/3- octave) channels (“slits”) In the example below, there are four, one-third-octave slits distributed across the frequency spectrum The edge of a slit is separated from its nearest neighbor by an octave No single slit, by itself, is particularly intelligible

The intelligibility associated with any single slit is only 2 to 9% Word Intelligibility - Single Slits

The intelligibility associated with any single slit is only 2 to 9% The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

Intelligibility of Sparse Spectral Speech Two slits, when combined, provide a higher degree of intelligibility, as shown on the following slides

Word Intelligibility - 2 Slits

Intelligibility of Sparse Spectral Speech Clearly, the degree of intelligibility depends on precisely where the slits are situated in the frequency spectrum, as well as their relationship to each other Spectrally contiguous slits may (or may not) be more intelligible than those far apart Slits in the mid-frequency region, corresponding to the signal’s second formant, are the most intelligible of any two-slit combination

Intelligibility of Sparse Spectral Speech There is a marked improvement in intelligibility when three slits are presented together Particularly when the slits are spectrally contiguous

Intelligibility of Sparse Spectral Speech Four slits combined yield nearly (but not quite) perfect intelligibility

Intelligibility of Sparse Spectral Speech This was done intentionally in order that the contribution of each slit could be precisely delineated Without having to worry about “ceiling” effects for highly intelligible conditions

A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility Spectral Slits - Summary

Modulation Spectrum Across Frequency The modulation spectrum varies in magnitude across frequency

Modulation Spectrum Across Frequency The shape of the modulation spectrum is similar for the three lowest slits….

A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility Spectral Slits - Summary

Modulation Spectrum Across Frequency The modulation spectrum varies in magnitude across frequency

Modulation Spectrum Across Frequency The modulation spectrum varies in magnitude across frequency The shape of the modulation spectrum is similar for the three lowest slits….

Modulation Spectrum Across Frequency The modulation spectrum varies in magnitude across frequency The shape of the modulation spectrum is similar for the three lowest slits…. But the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies

The intelligibility associated with any single slit ranges between 2 and 9%, Word Intelligibility - Single Slits

The intelligibility associated with any single slit ranges between 2 and 9%, suggesting that the shape and magnitude of the modulation spectrum, per se, is NOT the controlling variable for intelligibility

Desynchronizing Slits Affects Intelligibility Four channels, presented synchronously, yield ca. 90% intelligibility Intelligibility for two channels ranges between 10 and 60% intelligibility When the center slits lead or lag the lateral slits by more than 25 ms intelligibility suffers significantly

Asynchrony greater than 50 ms results in intelligibility lower than baseline A trough in performance occurs at ca. 200-250 ms asynchrony A slight rebound in intelligibilty occurs for longer asynchronies Slit Asynchrony Affects Intelligibility These data are from a different set of subjects than those participating in the study described earlier – hence slightly different numbers for the baseline conditions

A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility The magnitude of the modulation pattern does not appear to be the controlling variable for intelligibility Even small amounts of asynchrony (>25 ms) imposed on spectral slits may result in significant degradation of intelligibility Asynchrony greater than 50 ms has a profound impact of intelligibility Intelligibility progressively declines with greater amounts of asynchrony up to an asymptote of ca. 250 ms Beyond asynchronies of 250 ms, intelligibility IMPROVES, but the amount of improvement depends on individual factors BOTH the amplitude and phase components of the modulation spectrum appear to be extremely important for speech intelligibility The modulation phase is of particular importance for cross-spectral integration of phonetic information What are the Essential Cues? - So Far

Some Interesting Implications Cross-spectral (tonotopic) integration of modulation information is important for understanding spoken language The auditory system appears to be quite sensitive to phase variation across frequency channels Although reverberation typically imposes phase shifts of this sort, intelligibility is usually unaffected unless the amount of reverberation is extremely large (or the listener is hearing impaired) Are the phase shifts imposed of small magnitude? Or is there something else going on? If speechreading can be likened to a visible form of modulation spectrum (a big if, but let’s assume so for the current argument), then this would imply that the brain should be as sensitive to time shifts of the visual stream as it is to the acoustic signal Let’s find out …

Audio-Visual Integration of Speech In Collaboration with Ken Grant

In face-to-face interaction the visual component of the speech signal can be extremely important for understanding spoken language (particularly in noisy and/or reverberant conditions) It is therefore of interest to ascertain the brain’s tolerance for asynchrony between the audio and visual components of the speech signal This exercise can also provide potentially illuminating insights into the nature of the neural mechanisms underlying speech comprehension Specifically, the contribution of speechreading cues can provide clues about what REALLY is IMPORTANT in the speech signal for INTELLIGIBILITY that is independent of the sensory modality involved Audio-Visual Integration of Speech

Auditory-Visual Asynchrony - Paradigm Video of spoken (Harvard/IEEE) sentences, presented in tandem with a sparse spectral representation (low- and high-frequency slits) of the same material + + Video LeadsAudio Leads 40 – 400 ms Baseline Condition SYNCHRONOUS A/V

Auditory-Visual Asynchrony - Paradigm The mid-frequency channels are where place-of-articulation cues derive, and these are the cues most closely associated with speechreading + + Video Leads 40 – 400 ms Audio Leads 40 – 400 ms Baseline Condition SYNCHRONOUS A/V Place of Articulation

Focus on Audio-Leading-Video Conditions When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals These data are next compared with data from the audio-alone study to illustrate the similarity in the slope of the function

Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of the audio-leading-video condition Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves

When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio Focus on Video-Leading-Audio Conditions

The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions WHY? WHY? WHY? Auditory-Visual Integration - the Full Monty

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. For the sake of time we consider only the bearest skeleton of possibilities Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. For the sake of time we consider only the bearest skeleton of possibilities We can rule out explanations based exclusively on transmission time differences across the modalities, either physically or neurologically Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. For the sake of time we consider only the bearest skeleton of possibilities We can rule out explanations based exclusively on transmission time differences across the modalities, either physically or neurologically The performance functions would be offset and parallel to each other, which they are NOT Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. For the sake of time we consider only the bearest skeleton of possibilities We can rule out explanations based exclusively on transmission time differences across the modalities, either physically or neurologically The performance functions would be offset and parallel to each other, which they are NOT The explanation I currently favor (though there are other possibilities) is: Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. For the sake of time we consider only the bearest skeleton of possibilities We can rule out explanations based exclusively on transmission time differences across the modalities, either physically or neurologically The performance functions would be offset and parallel to each other, which they are NOT The explanation I currently favor (though there are other possibilities) is: In the (audio) speech signal, place-of-articulation information is frequency-specific and evolves over syllable-length intervals Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. For the sake of time we consider only the bearest skeleton of possibilities We can rule out explanations based exclusively on transmission time differences across the modalities, either physically or neurologically The performance functions would be offset and parallel to each other, which they are NOT The explanation I currently favor (though there are other possibilities) is: In the (audio) speech signal, place-of-articulation information is frequency-specific and evolves over syllable-length intervals This syllable interval pertaining to place-of-articulation cues would be appropriate for information encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. For the sake of time we consider only the bearest skeleton of possibilities We can rule out explanations based exclusively on transmission time differences across the modalities, either physically or neurologically The performance functions would be offset and parallel to each other, which they are NOT The explanation I currently favor (though there are other possibilities) is: In the (audio) speech signal, place-of-articulation information is frequency-specific and evolves over syllable-length intervals This syllable interval pertaining to place-of-articulation cues would be appropriate for information encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels Possible Interpretations of the Data

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because …. For the sake of time we consider only the bearest skeleton of possibilities We can rule out explanations based exclusively on transmission time differences across the modalities, either physically or neurologically The performance functions would be offset and parallel to each other, which they are NOT The explanation I currently favor (though there are other possibilities) is: In the (audio) speech signal, place-of-articulation information is frequency-specific and evolves over syllable-length intervals This syllable interval pertaining to place-of-articulation cues would be appropriate for information encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels This is the REAL mystery, but let’s go a little further Possible Interpretations of the Data

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the VIDEO signal LED the AUDIO The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms One Further Wrinkle to the Story ….

The Ability to Understand Speech Under Reverberant Conditions (Spectral Asynchrony) In Collaboration with Takayuki Arai

Spectro-temporal Jittering of Speech So far, we’ve examined the ability to understand spoken language using sparse spectral signals (with most of the spectrum thrown out) However, under most conditions, we encounter the entire, full-band spectrum What difference does this make? We can find out by time-shifting spectral channels relative to each other and measuring the relation between the amount of temporal jitter and intelligibility

Spectral Asynchrony - Paradigm The magnitude of energy in the 3-6 Hz region of the modulation spectrum is computed for each (4 or 7 channel sub-band) as a function of spectral asynchrony The modulation spectrum magnitude is relatively unaffected by asynchronies of 80 ms or less (open symbols), but is appreciably diminished for asynchronies of 160 ms or more Is intelligibility of read sentences (TIMIT) correlated with the reduction in the 3-6 Hz modulation spectrum?

Sub-band Modulation Variation The magnitude of the modulation spectrum varies as a function of the frequency of the sub-band and spectral asynchrony The lowest sub-band exhibits the least amount of modulation It is necessary to normalize the modulation magnitude (relative to baseline)

Intelligibility and Spectral Asynchrony The sentences are highly intelligible for asynchronies as large as 140 ms Intelligibility is roughly correlated with the amount of energy in the modulation spectrum between 3 and 6 Hz However, the correlation varies, depending on the sub-band and the degree of spectral asynchrony

Frequency Dependence of Intelligibility From a piece-wise discriminant analysis (based on performance slopes) …. The LOWER frequency (<1.5 kHz) channels appear to be most important when the degree of asynchrony is LOW The HIGHER frequency (>1.5 kHz) channels are most important when the degree of asynchrony is HIGH This frequency-selective pattern provides important clues as to the frequency selective intelligibility deficits associated with sensori-neural hearing loss

Implications of Spectral Asynchrony The results imply that the brain is able to tolerate large amounts of temporal jitter without significantly compromising intelligibility – however … Because there are potentially hundreds (or thousands) of frequency channels in the auditory system, this result doesn’t really prove the point The “TRUE” amount of asynchrony (from the ear’s perspective) may have been overestimated Distribution of channel asynchrony Intelligibility of spectrally desynchronized speech

In Conclusion ….

The controlling parameters for understanding spoken language appear to be based on the low-frequency modulation patterns in the acoustic signal associated with the syllable Encoding information in terms of low-frequency modulations provides a certain degree of robustness to the speech signal that enables it to be decoded under a wide range of acoustic and speaking conditions The apparent tolerance of spectral asynchrony masks an exquisite auditory sensitivity to modulation asynchrony across frequency Both the magnitude and phase of the modulation patterns are important The importance of modulation phase is apparent in both spectrally sparse and full- spectrum conditions The visual component of the speech signal (a.k.a. speech reading) can provide important information analogous to mid-frequency channels The preceding data are consistent with a multi-tier theoretical framework in which the brain integrates information for both the auditory and visual modalities across a broad range of time constants to derive a complex linguistic representation of the speech signal Grand Summary and Conclusions

Germane Publications Greenberg, S. and Arai, T. (2004) What are the essential cues for understanding spoken language? IEICE Transactions on Information and Systems E87: 1059-1070. Greenberg, S. (2005) A multi-tier theoretical framework for understanding spoken language. In Listening to Speech: An Auditory Perspective (S. Greenberg and W.A. Ainsworth, eds.). Mahwah, NJ: Lawrence Erlbaum Associates. Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936. Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from asynchronous processing of auditory-visual information. Proceedings of the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001), pp. 132-137. Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678. Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedings of the International Conference on Spoken Language Processing, Sydney, pp. 74-77. Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8. Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal constraints on speech intelligibility as deduced from exceedingly sparse spectral representations, 6th European Conference on Speech Communication and Technology (Eurospeech-99), pp. 2687-2690.

That’s All Many Thanks for Your Time and Attention

The SYLLABLE, rather than the PHONE, is the most basic organizational unit of spoken language – the patterns of pronunciation variation observed are incompatible with phonetic segment-based models It corresponds to a linguistic unit associated with articulatory gestures and slow oscillations of energy between 3 and 10 Hz (mostly) Most syllables have an onset constituent (usually a consonant), a nucleus (usually a vowel) and occasionally a coda (at the end, usually a consonant and often [t], [d] or [n] in English) What is a Syllable?

Language – A Syllable-Centric Perspective An empirically grounded perspective of spoken language focuses on the SYLLABLE and PROSODIC ACCENT as the interface between “sound” and “meaning” (or at least lexical form) Modes of Analysis Fric Voc V Nas J Energy Time–FrequencyProsodic Accent Phonetic Interpretation Manner Segmentation Word “Seven” Linguistic Tiers

Place of Articulation The reasons for this seeming vulnerability are controversial, but can be understood through analysis of data shown on the following slides In this experiment, nonsense VC and CV syllables were presented to listeners, who were asked to identify the consonant The syllables were spectrally filtered, so that most of the spectrum was discarded The proportion of consonants correctly recognized was scored as a function of the number of spectral slits presented and their frequency location, as shown on the next series of slides The really interesting analysis comes afterwards ….

Consonant Recognition - Single Slits 5400 Hz 2100 Hz 875 Hz 330 Hz Slits are 1/3-octave wide

Consonant Recognition - 1 Slit

Consonant Recognition - 2 Slits

Articulatory-Feature Analysis The results, as scored in terms of raw consonant identification accuracy, are not particularly insightful (or interesting) in and of themselves They show that the broader the spectral bandwidth of the slits, the more accurate is consonant recognition Moreover, a more densely sampled spectrum results in higher recognition However, we can perform a more detailed analysis by examining the pattern of errors made by listeners From the confusion matrices we can ascertain precisely WHICH ARTICULATORY FEATURES are affected by the various manipulations imposed And from this error analysis we can make certain deductions about the distribution of phonetic information across the tonotopic frequency axis potentially relevant to understanding why speech is most effectively communicated via a broad spectral carrier

Correlation - AFs/Consonant Recognition Consonant recognition is almost perfectly correlated with place-of-articulation performance This correlation suggests that PLACE features are based on cues distributed across the entire speech spectrum, in contrast to features such as voicing and rounding, which appear to be extracted from a narrower span of the spectrum MANNER is also highly correlated with consonant recognition, implying that such features are extracted from a fairly broad portion of the spectrum as well

Syllable Duration & the Modulation Spectrum

Serendipity’s Role in Science A few years ago, Saberi and Perrott published a paper in Nature that aroused a lot of attention I, personally, was interviewed by three separate news organizations about their study What Saberi and Perrott claimed was that time reversal of the speech waveform had minimal impact on speech intelligibility I was listed as a primary source because the authors cited some of my work as a means of explaining their results The original study did not actually measure intelligibility, and the data were largely nonsense However, their paradigm provided a useful starting point for dissociating the phase and magnitude components of the modulation spectrum across a broad range of frequency channels “Garbage in, science (and insight) out!”

What is (Locally) Time-Reversed Speech? Each segment of the speech signal is “flipped” on its horizontal axis The length of the segment thus flipped is the primary experimental parameter This signal manipulation has the effect of dissociating the phase and magnitude components of the modulation spectrum What impact does this manipulation (truly) have on intelligibility? Let’s find out! Stimulus paradigm based on K. Saberi and D. Perrott (1999) “Cognitive restoration of reversed speech,” Nature 398: 760. Experimental paradigm and acoustic analysis bear virtually no relation to that described in the Saberi and Perrott study

Intelligibility of (Locally)Time-Reversed Speech What impact does local time reversal have on intelligibility? There is a progressive decline in intelligibility with increasing length of the reversed segment When the segment exceeds 40 ms the intelligibility is very poor What acoustic properties are correlated with this decline in intelligibility? Stimuli were sentences from the TIMIT corpus Sample sentence: “She washed his dark suit in greasy wash water all year” 80 different sentences, each spoken by a different speaker

Intelligibility Does NOT Depend Solely on the Magnitude Component of Modulation Spectrum Intelligibility as a function of reverse-segment length Modulation Spectrum (magnitude component only) Saberi and Perrott had conjectured that the results of their experiment could be explained on the basis of the magnitude component of the modulation spectrum Brain – 1 (Cognitive) Scientists – 0

Increasing Modulation Phase Dispersion Across Frequency as a Function of Increasing Reversed-Segment Length Let’s examine the relation between modulation phase and intelligibility from a slightly different perspective …. Phase dispersion across the spectrum for a single sentence at 4.5 Hz For reversed-segment lengths greater than 40 ms there is significant phase dispersion (relative to the original) that becomes severe for segments > 80 ms Frequency

Phase dispersion (relative to the original signal) across 40 sentences as a function of reversed-segment length (ms) (example = 750-1500 Hz sub-band; 4.5 Hz) Increasing Modulation Phase Dispersion as a Function of Increasing Reversed-Segment Length 100 60 40 20 Original 80 Intelligibility as a function of reverse-segment length Let’s examine the relation between modulation phase and intelligibility ….

What is the Complex Modulation Spectrum? The complex modulation spectrum combines both the magnitude and phase of the modulation pattern distributed across the tonotopic frequency axis This representation predicts the intelligibility of (locally) time-reversed speech, dissociating the phase and magnitude parts of the modulation spectrum Thereby demonstrating the importance of modulation phase (across the frequency spectrum) for understanding spoken language

Computing the Complex Modulation Spectrum Complex Modulation Spectrum = Magnitude x Phase It is important to compute the phase dispersion across the spectrum with precision and to ascertain its impact on the global modulation spectral representation (shown on the following slide)

Intelligibility is Based on BOTH the Magnitude and Phase Components of the Modulation Spectrum Intelligibility as a function of reverse-segment length Complex Modulation Spectrum (both magnitude and phase) The Relation between Intelligibility and the Complex Modulation Spectrum isn’t Bad! Complex modulation spectrum computed for all 80 sentences

Locally time-reversed speech provides a convenient means to dissociate the magnitude and phase components of the modulation spectrum The intelligibility of time-reversed speech decreases as the segment length increases up to ca. 100 ms Speech intelligibility is NOT correlated with the magnitude component of the low-frequency modulation spectrum Speech intelligibility IS CORRELATED with the COMPLEX modulation spectrum (magnitude x phase) Thus, the phase of the modulation pattern distributed across the frequency spectrum appears to play an important role in understanding spoken language Complex Modulation Spectrum - Summary

Language - A Syllable-Centric Perspective A more empirically grounded perspective of spoken language focuses on the SYLLABLE as the interface between “sound” “vision” and “meaning” Important linguistic information is embedded in the TEMPORAL DYNAMICS of the speech signal (irrespective of the modality)

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon.

Similar presentations

Presentation on theme: "What are the Essential Cues for Understanding Spoken Language? Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon.

Similar presentations

Presentation on theme: "What are the Essential Cues for Understanding Spoken Language? Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon."— Presentation transcript:

Similar presentations

About project

Feedback