Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Modulation Spectrum – Its Role in Sentence and Consonant Identification Steven Greenberg Centre for Applied Hearing Research Technical University of.

Similar presentations


Presentation on theme: "The Modulation Spectrum – Its Role in Sentence and Consonant Identification Steven Greenberg Centre for Applied Hearing Research Technical University of."— Presentation transcript:

1 The Modulation Spectrum – Its Role in Sentence and Consonant Identification Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon Speech Santa Venetia, CA USA http://www.icsi.berkeley.edu/~steveng steveng@savant-garde.net

2 Acknowledgements and Thanks Research Funding U.S. National Science Foundation Otto Mønsted Foundation (Denmark) Danish Research Council Technical University of Denmark (Torsten Dau) Research Collaborators Takayuki Arai Thomas Christiansen Rosaria Silipo

3 The Crux of the Problem ….

4 Effects of Reverberation on the Speech Signal Reflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions Yet, THE INTELLIGIBILITY OF SPEECH IS REMARKABLY STABLE Implying that intelligibility is based NOT on fine spectral detail, but rather on some more basic parameter(s) – what might these be????

5 This presentation examines the origins of word intelligibility in the low- frequency (< 30 Hz) modulation properties of the acoustic speech signal These modulation patterns, which reflect articulatory movement, are differentially distributed across the acoustic frequency spectrum Intelligibility Based on Modulation Patterns

6 The specific configuration of the modulation patterns across the frequency spectrum reflects the essential cues for understanding spoken language Intelligibility Based on Modulation Patterns

7 The acoustic frequency spectrum serves as a DISTRIBUTION MEDIUM for the modulation patterns However, much of the ACOUSTIC SPECTRUM is, in fact, DISPENSIBLE (Harvey Fletcher, Jont Allen and others to the contrary) Intelligibility Based on Modulation Patterns

8 Structure of the Presentation This presentation will focus on the following issues A sparse spectral representation of speech is …. Sufficient for good intelligibility (though not entirely natural sounding, nor particularly robust in background noise) Low-frequency modulations below 30 Hz appear to serve as the primary carriers of phonetic information in the speech signal The role played by different parts of the modulation spectrum is at the outset unclear The presentation will attempt to elucidate this question, both for sentences (first part) and for consonants (second part) The distribution of modulation information across the audio-frequency (tonotopic) spectrum is also important (and will be addressed as well) The perceptual data described may be useful for developing future-generation speech technology (e.g., automatic speech recognition and synthesis) germane to auditory prostheses (e.g., hearing aids and cochlear implants)

9 An Invariant Property of the Speech Signal? Houtgast and Steeneken demonstrated that the modulation spectrum, a temporal property, is highly predictive of speech intelligibility This is significant, as it is difficult to degrade intelligibility through normal spectral distortion (many have tried, few have succeeded …. ) In highly reverberant environments, the modulation spectrum’s peak is highly attenuated, shifting down to ca. 2 Hz, the speech becoming increasingly difficult to comprehend [based on an illustration by Hynek Hermansky] Modulation Spectrum

10 Quantifying Modulation Patterns in Speech The modulation spectrum provides a quantitative method for computing the amount of modulation in the speech signal The technique is illustrated for a paradigmatic signal for clarity’s sake The computation is performed for each spectral channel separately

11 The Modulation Spectrum Reflects Syllables Given the importance of the modulation spectrum for intelligibility, what does it reflect linguistically? The distribution of syllable duration matches the modulation spectrum, suggesting that the integrity of the syllable is essential for understanding speech Modulation spectrum of 15 minutes of spontaneous Japanese speech (OGI-TS corpus) compared with the syllable duration distribution for the same material (Arai and Greenberg, 1997) Syllable duration (modulation frequency) Modulation Spectrum Comparable comparisons have been performed for (American) English

12 Intelligibility Derived from Modulation Patterns Many perceptual studies emphasize the importance of low-frequency modulation patterns for understanding spoken language Historically, this was first demonstrated by Homer Dudley in 1939 with what has become known as the VOCODER – modulations higher than 25 Hz can be filtered out without significant impact on intelligibility As mentioned earlier, Houtgast and Steeneken demonstrated that the low-frequency modulation spectrum is a good predictor of intelligibility in a wide range of acoustic listening environments (1970s and 1980s) In the mid-1990s, Rob Drullman demonstrated the impact of low-pass filtering the modulation spectrum on intelligibility and segment identification – modulations below 8 Hz appeared to be most important However, …. all of these studies were performed on broadband speech There was no attempt to examine the interaction between temporal and spectral factors for coding speech information (Other studies, such as those by Shannon and associates, have examined spectral- temporal interactions, but not at a fine level of detail)

13 Intelligibility Studies Using Spectral Slits The interaction between spectral and temporal information for coding speech information can be examined with some degree of precision using spectral slits In what follows, the use of the term “spectral” refers to operations and processes in the acoustic frequency (i.e., tonotopic) domain The term “temporal” or “modulation spectrum” refers to operations and processes that specifically involve low-frequency (< 30 Hz) modulations First, we’ll examine the impact of extreme band-pass SPECTRAL filtering on intelligibility without consideration of the modulation spectrum

14 Intelligibility of Sparse Spectral Speech The spectrum of spoken sentences (TIMIT corpus) can be partitioned into narrow (1/3- octave) channels (“slits”) In the example below, there are four, one-third-octave slits distributed across the frequency spectrum The edge of a slit is separated from its nearest neighbor by an octave No single slit, by itself, is particularly intelligible

15 The intelligibility associated with any single slit is only 2 to 9% Word Intelligibility - Single Slits

16 The intelligibility associated with any single slit is only 2 to 9% The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

17 Intelligibility of Sparse Spectral Speech Two slits, when combined, provide a higher degree of intelligibility, as shown on the following slides

18 Word Intelligibility - 2 Slits

19

20

21

22

23 Intelligibility of Sparse Spectral Speech Clearly, the degree of intelligibility depends on precisely where the slits are situated in the frequency spectrum, as well as their relationship to each other Spectrally contiguous slits may (or may not) be more intelligible than those far apart Slits in the mid-frequency region, corresponding to the signal’s second formant, are the most intelligible of any two-slit combination

24 Word Intelligibility - 2 Slits

25 Intelligibility of Sparse Spectral Speech There is a marked improvement in intelligibility when three slits are presented together Particularly when the slits are spectrally contiguous

26 Word Intelligibility - 3 Slits

27

28

29

30 Intelligibility of Sparse Spectral Speech Four slits combined yield nearly (but not quite) perfect intelligibility

31 Word Intelligibility - 4 Slits

32 Intelligibility of Sparse Spectral Speech This was done intentionally in order that the contribution of each slit could be precisely delineated Without having to worry about “ceiling” effects for highly intelligible conditions

33 Modulation Spectrum Across Frequency The modulation spectrum varies in magnitude across frequency

34 Modulation Spectrum Across Frequency The shape of the modulation spectrum is similar for the three lowest slits….

35 Modulation Spectrum Across Frequency But the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies Raising the prospect that the mid- frequency modulation spectrum (10-30 Hz) may be important under certain conditions

36 Modulation Spectrum Across Frequency The high amount of energy in the mid-frequency MODULATION spectrum is typical of material whose ACOUSTIC spectrum is higher than 3 kHz And does not depend solely on the use of narrow spectral slices As shown in this sample of OCTAVE-WIDE channels of broadband speech (or broader than an octave for the lowest sub-band) TIMIT Corpus 40 sentences

37 Low-pass Modulation Filtering of Slits The MODULATION SPECTRUM of the spectral slits shown in previous slides can be LOW-PASS FILTERED in order to ascertain the relation between modulation patterns and their spectral affiliation For simplicity’s sake, either the lowest two (slits 1 + 2) or highest two (3 + 4) slits were low-pass modulation filtered in tandem

38 Modulation Spectrum Across Frequency Each sentence presented contained four spectral slits Baseline performance – 4 slits without modulation filtering – was 87% intelligibility The modulation spectrum was systematically low-pass filtered between 24 Hz and 3 Hz, in 3-Hz steps for each of the two-slit combinations, without modulation filtering the other two slits in the stimulus

39 Modulation Spectrum Across Frequency The general effect of low-pass modulation filtering is similar Low-pass filtering below 12 Hz has a significant impact on intelligibility, which is particularly pronounced when the modulation spectrum is restricted to frequencies lower than 6 Hz However, there is a significant difference in the impact of low-pass modulation filtering depending on whether the slits are in the low or high portion of the ACOUSTIC frequency spectrum

40 Modulation Spectrum Across Frequency When the low-pass modulation filtered slits are in the low spectral frequencies (<1 kHz) there is a progressive decline of intelligibility Moreover, low-pass modulation filtering above 15 Hz has no significant impact on intelligibility In contrast, low-pass modulation filtering the high-frequency (>2 kHz) slits does impact intelligibility, even for a low-pass cutoff of 24 Hz

41 Modulation Spectrum Across Frequency This result implies that modulation frequencies higher than 24 Hz contribute to intelligibility, but only for the acoustic spectrum above 2 kHz According to some, only modulation frequencies below 8 Hz contribute to intelligibility However, recall that these other studies used full bandwidth speech signals Low-pass filtering the modulation spectrum of such broadband stimuli does not necessarily remove the upper portion of the modulation spectrum

42 Much of the higher modulation spectrum could have been re-introduced through cross-channel phase distortion (as suggested by Ghitza, 2001) In addition, the inherent redundancy of the full bandwidth signal makes it difficult to ascertain the specific contribution of each spectral region and modulation frequency Some other method is required to tease apart the spectro-temporal components of intelligibility Modulation Spectrum Across Frequency

43 The Story So Far The low-frequency modulation spectrum is crucial for understanding spoken language However, it is unclear precisely WHICH parts of the modulation spectrum contribute most heavily, and how much their contribution depends on their acoustic spectral (i.e., tonotopic) affiliation The details are important for technical exploitation of these ideas (for application in hearing aids, automatic speech recognition and synthesis)

44 The Next Chapter – Consonant Identification Because word intelligibility depends ultimately on listeners’ ability to decode phonetic information in the acoustic signal A more fine-grained approach to the spectro-temporal foundations of speech processing may benefit from examining the ability to identify specific consonantal segments By focusing on consonant identification it is possible to study certain aspects of auditory processing associated with speech understanding in greater detail (and with more precision) than is possible through intelligibility alone Moreover, it is possible to decompose consonants into more elementary “building blocks” known as articulatory-acoustic (or phonetic) features This phonetic decomposition into the fundamental phonetic dimensions of “voicing,” “manner of production,” and “place of articulation” provides some interesting insights into the auditory basis of speech processing

45 A Brief Introduction to Phonetic Features Three principal articulatory dimensions are distinguished (among others) – VOICING, MANNER and PLACE of articulation As illustrated for a sample word, “Nap” [n ae p] In order to correctly identify a consonant, all three principle phonetic feature dimensions need to be decoded correctly (at least in principle) Place Voiced Lightly Accented [n] [ae] [p] VoicedUnvoiced Nasal VocalicStop Alveolar (Medial)Bilabial (Front) Prosodic Accent Segment Manner Voicing Place Front Word“nap”(def: seminar activity)

46 The Arc’s Relation to Phonotactics & Manner If we return to the basic question – WHY are syllables realized as rises and falls of energy …. And we make the simple assumption that each manner of production – vowel, fricative, nasal, stop etc. – is associated with a relative energy level Vowels being highest Stops and fricatives lowest With nasals, liquids and glides in between Then we gain some insight as to why the segments occur in the order they do within the syllable

47 The Energy Arc’s Relation to Syllable Phonotactics In effect, the segments reflect various manners of production, which are associated with different energy levels From the perspective of “command and control” the relation between syllable production and the energy arc is automatic and unconscious Syllables are intrinsically arcs that are readily digested by the auditory system and the brain This may account for why it is possible to articulate (and perceive) in terms of syllables, but not in terms of isolated phones (unless they are syllables themselves)

48 The Syllabic Control of Voicing – Significance The most energetic components of the speech signal are usually voiced Voicing helps to build up energy in the syllable Voicing provides implicit structure for the syllable This structure could be extremely important in decoding the speech signal, particularly in noisy environments Recall the importance of fundamental-frequency information for separating concurrent talkers or distinguishing speech from a noisy background Pitch-related cues could only play such an important role if the speech signal is largely voiced voiced voi

49 The Relation Between Voicing and Manner Thus, voicing appears to cut across segmental boundaries It only APPEARS to be associated with individual segments Voicing serves to bind the segments into a syllabic whole through its temporal continuity It is probably not coincidental that 80% (or more) of the speech signal is voiced And that relatively few manner classes (usually stops, afficates, fricatives) can be realized as unvoiced (except in whispered or exaggerated speech) Voicing is indirectly related to the energy arc, in that it is associated with the most intense components of the syllable, and is most robust to noise and reverberation Thus, it is extremely important for decoding speech in noisy environments

50 Place of Articulation – the Key Dimension Articulatory place information is important for distinguishing among syllables and words (particularly for consonants) The distinction among [b], [d] and [g], and [p], [t] and [k] is primarily one of “place,” in that the location of maximum articulatory constriction varies from front to back FRONT MEDIAL BACK Generally, there are only three distinct loci of constriction for any single manner class Hence, the problem of determining articulatory place is greatly simplified if the manner of production is known Manner-dependent place of articulation classifiers have been successfully applied in automatic phonetic transcription (e.g., Chang, Wester & Greenberg, 2001, 2005)

51 Place of Articulation The formant patterns associated with place of articulation cues vary broadly over frequency and time When speech is described as “dynamic” it is usually such formant patterns that are meant (this is a little misleading, in that syllable cues are also highly dynamic, but this is a separate story ….) In low signal-to-noise ratio conditions and among the hearing impaired, place-of- articulation cues are usually among the first to degrade

52 Place of Articulation The reasons for this seeming vulnerability are controversial, but can be understood through analysis of data shown on the following slides In this experiment, nonsense VC and CV (Am. English) syllables were presented to listeners, who were asked to identify the consonant The syllables were spectrally filtered (in one-third octave bands), so that most of the spectrum was discarded The proportion of consonants correctly recognized was scored as a function of the number of spectral slits presented and their frequency location, as shown on the next series of slides The really interesting analysis comes afterwards (so please be patient) ….

53 Consonant Recognition - Single Slits 5400 Hz 2100 Hz 875 Hz 330 Hz Slits are 1/3-octave wide

54 Consonant Recognition - 1 Slit

55 Consonant Recognition - 2 Slits

56

57 Consonant Recognition - 3 Slits

58

59 Consonant Recognition - 4 Slits

60 Consonant Recognition - 5 Slits

61 Articulatory-Feature Analysis The results, as scored in terms of raw consonant identification accuracy, are not particularly insightful (or interesting) in and of themselves They show that the broader the spectral bandwidth of the slits, the more accurate is consonant recognition Moreover, a more densely sampled spectrum results in higher recognition However, we can perform a more detailed analysis by examining the pattern of errors made by listeners From the confusion matrices we can ascertain precisely WHICH ARTICULATORY FEATURES are affected by the various manipulations imposed And from this error analysis we can make certain deductions about the distribution of phonetic information across the tonotopic frequency axis potentially relevant to understanding why speech is most effectively communicated via a broad spectral carrier

62 The results, as scored in terms of raw consonant identification accuracy, are not particularly insightful (or interesting) in and of themselves They show that the broader the spectral bandwidth of the slits, the more accurate is consonant recognition Moreover, a more densely sampled spectrum results in higher recognition The Bottom Line – So Far

63 The data can also be scored in terms of the proportion of phonetic features correctly decoded However, we can perform a more detailed analysis by examining the pattern of errors made by listeners Phonetic Feature Specification – English

64 In order to understand how this is done (and its significance) it is useful to examine a phonetic feature specification for the consonants involved VOICING MANNER PLACE p – Stop Front t – Stop Medial k – Stop Back b + Stop Front d + Stop Medial g + Stop Back s – Fricative Medial f – Fricative Front v + Fricative Front m + Nasal Front n + Nasal Medial y + Glide Front w* + Glide Back w* = +[round] Phonetic Feature Specification – English

65 Perceptual Confusion Matrix – Example p t k b d g s f v m n p 19 13 1 0 0 0 0 3 0 0 0 t 1 32 2 0 1 0 0 0 0 0 0 k 1 8 27 0 0 0 0 0 0 0 0 b 0 0 0 25 10 0 0 0 0 1 0 d 0 0 0 2 34 0 0 0 0 0 0 g 0 0 1 5 7 22 0 0 1 0 0 s 0 0 0 0 0 0 32 4 0 0 0 f 0 1 0 0 1 0 23 10 1 0 0 v 0 0 0 0 0 1 0 0 27 7 1 m 0 0 0 0 0 0 0 0 1 24 4 n 0 0 0 0 0 0 0 0 8 3 32 Response Stimulus The pattern of identification errors can be used to deduce which phonetic features are more robust and which are most vulnerable to distortion

66 Perceptual Confusion Matrix – Example p t k b d g s f v m n p 19 13 1 0 0 0 0 3 0 0 0 t 1 32 2 0 1 0 0 0 0 0 0 k 1 8 27 0 0 0 0 0 0 0 0 b 0 0 0 25 10 0 0 0 0 1 0 d 0 0 0 2 34 0 0 0 0 0 0 g 0 0 1 5 7 22 0 0 1 0 0 s 0 0 0 0 0 0 32 4 0 0 0 f 0 1 0 0 1 0 23 10 1 0 0 v 0 0 0 0 0 1 0 0 27 7 1 m 0 0 0 0 0 0 0 0 1 24 4 n 0 0 0 0 0 0 0 0 8 3 32 Response Stimulus Consonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner As indicated by the yellow rectangles for common place of articulation confusions (within the same manner and voicing class)

67 Phonetic-Feature/Consonant Identity – Correlation Consonant recognition is almost perfectly correlated with place-of-articulation performance This correlation suggests that PLACE features are based on cues DISTRIBUTED across the entire speech spectrum, in contrast to features such as voicing and rounding, which appear to be extracted from a narrower span of the spectrum MANNER is also highly correlated with consonant recognition, implying that such features are extracted from a fairly broad portion of the spectrum as well

68 Let’s Go Danish (for Consonant Identification) An analogous experiment was performed for (11) spoken Danish consonants One, two or three spectral bands (each three-quarters of an octave wide) were used Because the slit bandwidth was much wider than that used for the English consonants, consonant identification accuracy is considerably higher This was done intentionally for reasons described shortly

69 Danish Consonant Identification) The objective was to reach nearly (but not quite) perfect consonant recognition with (only) three slits (for reasons described shortly) While achieving much lower consonant identification with single slits This was to obtain a reasonably large dynamic range between the single-slit and multiple-slit conditions This was done intentionally for reasons disclosed in short order

70 Danish Consonant Identification) The specific reason for structuring the stimuli in this fashion was in order to parametrically manipulate the modulation spectrum of individual slits The modulation spectrum of single slits was low-pass filtered between 24 Hz and 5 Hz in order to ascertain the combined effect of spectro- temporal filtering on consonant identification Slits marked in magenta were low- pass modulation filtered, while those in black were not modulation filtered The percent correct recognition scores are not of particular interest Except in one respect … <6Hz <3Hz

71 Phonetic Feature Specification – Danish As with the English consonants, the Danish material can be decomposed into constituent phonetic features In order to perform a phonetic feature confusion analysis VOICING MANNER PLACE p – Stop Front t – Stop Medial k – Stop Back b + Stop Front d + Stop Medial g + Stop Back s – Fricative Medial f – Fricative Front v + Fricative Front m + Nasal Front n + Nasal Medial

72 Manner Place Voicing Consonant Identification vs. Feature Decoding Consonant errors are not random – perceptual confusions are more likely to occur for place of articulation than voicing or manner Consonant recognition is, in fact, nearly perfectly correlated with the decoding of place of articulation information

73 Perceptual Confusion Matrix – Example p t k b d g s f v m n p 19 13 1 0 0 0 0 3 0 0 0 t 1 32 2 0 1 0 0 0 0 0 0 k 1 8 27 0 0 0 0 0 0 0 0 b 0 0 0 25 10 0 0 0 0 1 0 d 0 0 0 2 34 0 0 0 0 0 0 g 0 0 1 5 7 22 0 0 1 0 0 s 0 0 0 0 0 0 32 4 0 0 0 f 0 1 0 0 1 0 23 10 1 0 0 v 0 0 0 0 0 1 0 0 27 7 1 m 0 0 0 0 0 0 0 0 1 24 4 n 0 0 0 0 0 0 0 0 8 3 32 Response Stimulus Consonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner As indicated by the yellow rectangles for common place of articulation confusions (within the same manner and voicing class)

74 Phonetic Feature Decoding Accuracy Consonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner The specific pattern of confusions can be used to deduce decoding strategies used by listeners Voiced Unvoiced Voiced 215 1 Unvoiced 3 77 99% Correct Stop Fricative Nasal Stop 211 4 1 Fric 3 97 8 Nasal 0 9 63 94% Correct Response Stimulus 77% Correct Voicing Manner Place Dimension 0.91 1.09 0.60 IT (Bits) Information Transmitted Front Medial Back Front 125 53 2 Medial 11 131 2 Back 7 15 50

75 1.5 kHz 3 kHz 0.75 + 3 kHz 0.75 + 1.5 + 3 kHz Information Transmitted (bits) Modulation Frequency Cutoff Stimulus Condition 0.75 kHz Slits marked in magenta were low-pass modulation filtered Total Information Transmitted When total information transmitted is examined (essentially consonant ID), the patterns are systematic Low pass modulation filtering single slits results in a progressive decline, whereas multi-slit stimuli are much less affected by such filtering <6 Hz <3 Hz

76 Voicing Information Transmitted 1.5 kHz 3 kHz 0.75 + 3 kHz 0.75 + 1.5 + 3 kHz Information Transmitted (bits) Modulation Frequency Cutoff Stimulus Condition 0.75 kHz Slits marked in magenta were low-pass modulation filtered Three slits does not add information beyond what is achieved by two The greatest decline in IT occurs below 5 Hz (3 kHz slit) and > 12 Hz (0.75 kHz slit) Notice the relatively slight impact of modulation filtering on decoding of voicing information in two- and three-slit conditions <3 Hz <6 Hz

77 Manner Information Transmitted 1.5 kHz 3 kHz 0.75 + 3 kHz 0.75 + 1.5 + 3 kHz Information Transmitted (bits) Modulation Frequency Cutoff Stimulus Condition 0.75 kHz Slits marked in magenta were low-pass modulation filtered Manner information is integrated in roughly linear fashion across the acoustic frequency spectrum There is a progressive (but relatively slight) decline in manner decoding with low- pass modulation filtering <3 Hz <6 Hz

78 Place Information Transmitted 1.5 kHz 3 kHz 0.75 + 3 kHz Information Transmitted (bits) Stimulus Condition 0.75 kHz Slits marked in magenta were low-pass modulation filtered 0.75 + 1.5 + 3 kHz Modulation Frequency Cutoff The integration of place information across the acoustic frequency spectrum is highly synergistic (and expansive) Notice the low-pass modulation filtering has a significant impact on decoding of place information for the two-slit conditions <3 Hz <6 Hz

79 Phonetic Feature Information – All Dimensions Stimulus Condition 0.75 1.5 3 kHz 1.5 kHz 3 kHz 0.75 + 3 kHz Information Transmitted (bits) Modulation Frequency Cutoff 0.75 kHz Total Slits marked in magenta were low-pass modulation filtered Voicing Modulation Frequency Cutoff 1.5 kHz 3 kHz 0.75 + 3 kHz 0.75 kHz 0.75 1.5 3 kHz Place Manner Information Transmitted (bits) Modulation Frequency Cutoff Stimulus Condition 0.75 1.5 3 kHz 1.5 kHz 3 kHz 0.75 + 3 kHz 0.75 kHz Stimulus Condition 0.75 1.5 3 kHz 1.5 kHz 3 kHz 0.75 + 3 kHz 0.75 kHz Modulation Frequency Cutoff Place of articulation exhibits a pattern of cross-spectral integration distinct from voicing and manner – it requires a broader region of the audio and modulation spectrum <6 Hz <3 Hz <6 Hz <3 Hz <6 Hz <3 Hz <6 Hz

80 Phonetic Features & Modulation Spectrum Phonetic features vary with respect to their modulation spectral properties Place is associated with frequencies higher than 8 Hz Manner is mostly associated with frequencies above 12 Hz and below 8 Hz Voicing’s association with the modulation spectrum is frequency-specific; Below 8 Hz for high audio frequencies and above 12 Hz for low audio frequencies

81 Modulation Frequency Cutoff (Hz) Voicing <12 Place.75 + 3 kHz 0.75+1.5+3 kHz Observed IT/ Predicted IT (Linear Summation ) Stimulus Condition Manner Total. 75 + 3 kHz 0.75+1.5+3 kHz. 75 + 3 kHz 0.75+1.5+3 kHz. 75 + 3 kHz ∞ <24 <6 <3 Values larger than 1 indicate greater than linear summation Cross-channel Synergy (or not) The degree of cross-channel integration depends on the phonetic dimension Voicing and manner are quasi-linear wrt to cross-channel integration Place is highly synergistic in that the amount of information associated with two and three slits is far more than predicted on the basis of linear integration

82 The auditory system performs a spectro-temporal analysis in order to extract phonetic information from the acoustic speech signal Summary

83 A detailed analysis of the audio (tonotopic) spectrum is not required to understand spoken language Summary

84 However, a comprehensive sampling of the modulation characteristics of the speech signal across the audio spectrum is essential Summary

85 This is particularly true for place of articulation information, which is crucial for decoding consonant identity Summary

86 Place of articulation is the only phonetic feature whose information transmission increases expansively across the audio frequency spectrum Summary

87 Moreover, place is the only dimension to be intensively based on the portion of the modulation spectrum above 8 Hz Summary

88 The other phonetic dimensions, voicing and manner, are less tied to the modulation spectrum than place Summary

89 Voicing is associated with low modulation frequencies (and high to a degree) in an audio-frequency-selective manner Summary

90 Manner is associated with modulation frequencies above 12 Hz Summary

91 This modulation spectral “division of labor” is consistent with an auditory analysis based on modulation maps (but is also consistent with other interpretations, such as a range of integration time constants) Summary

92 The auditory system performs a spectro-temporal analysis in order to extract phonetic information from the acoustic speech signal A detailed analysis of the audio (tonotopic) spectrum is not required to understand spoken language However, a comprehensive sampling of the modulation characteristics of the speech signal across the audio spectrum is required This is particularly true for place of articulation information, which is crucial for decoding consonant identity Place of articulation is the only phonetic feature whose information transmission increases expansively across the audio frequency spectrum Moreover, place is the only dimension to be intensively based on the portion of the modulation spectrum above 8 Hz The other phonetic dimensions, voicing and manner, are less tied to the modulation spectrum than place Voicing is associated with low modulation frequencies (and high to a degree) Manner is associated with modulation frequencies above 12 Hz This modulation spectral “division of labor” is consistent with an auditory analysis based on modulation maps (but is also consistent with other interpretations, such as a range of integration time constants) Summary

93 For Additional Information Consult the web site: www.icsi.berkeley.edu/~steveng

94 Many Thanks for Your Time and Attention

95 Language – A Syllable-Centric Perspective An empirically grounded perspective of spoken language focuses on the SYLLABLE and PROSODIC ACCENT as the interface between “sound” and “meaning” (or at least lexical form) Modes of Analysis Fric Voc V Nas J Energy Time–FrequencyProsodic Accent Phonetic Interpretation Manner Segmentation Word “Seven” Linguistic Tiers

96 Language - A Syllable-Centric Perspective A more empirically grounded perspective of spoken language focuses on the SYLLABLE as the interface between “sound” “vision” and “meaning” Important linguistic information is embedded in the TEMPORAL DYNAMICS of the speech signal (irrespective of the modality)

97 The Energy Arc Syllables are characterized by rises and falls in energy (see below, left) The “energy arc” reflects both production and perception From production’s perspective, the arc reflects the articulatory cycle from closure to maximally open aperture and back again (in crude terms) From the ear’s perspective, the energy arc reflects the packaging of information within the temporal limits that the auditory system (and other sensory organs) has evolved to process This temporal dimension is reflected in the modulation spectrum of spoken language (below, right) Modulation SpectrumSpectro-temporal Profile

98 PLACE of articulation is, ironically, the most information-laden articulatory feature dimension in speech, and is inherently TRANS-SEGMENTAL, binding vocalic nuclei with preceding and/or following consonants It is also the most stable phonetic dimension LINGUISTICALLY, although paradoxically, it is extremely vulnerable to acoustic interference when presented exclusively in the acoustic modality (i.e., without visual cues) Place of Articulation FRONT MEDIAL BACK

99 The Energy Arc and Voicing Within the traditional framework, voicing is considered a segmental property A segment is either voiced or not However, we know that this segmental perspective on voicing is only a crude caricature of the acoustic properties of speech Many theoretically voiced segments are at least partially unvoiced For example, in Am. English it is common for [z] to be unvoiced – particularly in syllable-final position in unaccented syllables The so-called voiced obstruents ([b], [d], [g]) are usually realized as partially unvoiced (this is what voice-onset-time refers to), with various languages differing with respect to the specific values of VOT This sort of behavior implies that voicing is NOT a segmental feature, but rather one that is under SYLLABIC control and actually reflects prosodic factors (which is WHY languages vary with respect to VOT) How can this be so?

100 The Syllabic Control of Voicing Recall, that the core of the syllable – the nucleus – is almost always voiced The nucleus is usually a vowel and contains the peak energy in the syllable Voicing spreads from the nucleus forward in time to the coda, as well as backward to the onset Voicing is continuous in time, and is associated with the higher-energy parts of the syllable The lower-energy components of the syllable may or may not be voiced But where the signal is unvoiced, the associated constituents reside in the “tails” of the syllable – the onset and/or coda It is probably not a coincidence that the most linguistically informative components in speech are NOT associated with voicing voiced voi


Download ppt "The Modulation Spectrum – Its Role in Sentence and Consonant Identification Steven Greenberg Centre for Applied Hearing Research Technical University of."

Similar presentations


Ads by Google