Speech Perception Chris Plack PSYC 60041 Auditory Science Speech Perception Chris Plack
Speech Perception Learning Outcomes Understand the nature of the speech signal Understand how speech is produced Understand the cues that are used by the auditory system to decode the speech signal Understand the problems of variability and context Basic understanding of categorical perception Understand the speech intelligibility index (SII) Understand how speech perception is affected by hearing loss
Audio-vestibular Anatomy and Physiology Lecture 1 Speech Production Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk
Speech Production The Vocal Tract Tongue Teeth Lips Vocal Folds Nasal Cavity Soft Palate Air From Lungs Pharynx Larynx The Vocal Tract Speech is produced when air is forced from the lungs past the vocal folds causing them to vibrate:
Speech Production The Vocal Tract Tongue Teeth Lips Vocal Folds Nasal Cavity Soft Palate Air From Lungs Pharynx Larynx The Vocal Tract Articulators in the nose, throat, and mouth (e.g. tongue) move into different positions to modify the spectrum of the sound waves and produce different vowel sounds.
Waveforms Spectra /a/ /i/ F1 F2 F3 Formants (peaks in spectrum)
Speech Production The Vocal Tract Tongue Teeth Lips Vocal Folds Nasal Cavity Soft Palate Air From Lungs Pharynx Larynx The Vocal Tract Consonants are produced by constrictions in the vocal tract at different positions (e.g. “p” is produced by constriction at the lips, “k” at the back of the mouth).
Speech Production Air forced out from lungs. Vocal folds vibrate (voiced sounds) or stay open (unvoiced). Air-flow enters the mouth (and nasal cavity for nasal sounds). Resonances in these cavities result in formants (e.g. in vowel sounds). Consonants are produced by differing degrees of constriction of the vocal tract occurring at different places. Narrow constriction – fricatives; Complete closure - stops or plosives.
Speech Production So normal speech is a series of complex tones (vowel sounds) interrupted by regular constrictions in the vocal tract (consonant sounds). In whispered speech the vocal chords vibrate noisily rather than producing a regular series of pulses.
Audio-vestibular Anatomy and Physiology Lecture 1 Types of Speech Sound Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk
Speech Sounds Speech can be broken down into words and often further into syllables and phonemes. The phoneme is the smallest unit of speech sounds. Phonemes are represented by the “phonetic alphabet” – International Phonetic Alphabet (IPA). Phonemes describe what is perceived, not a specific acoustic pattern. It is not the case that each phoneme has a set acoustic pattern.
Speech Sounds English contains about 40 phonemes. These are represented by IPA symbols and are shown between two slash marks / / to indicate that they are phonetic symbols. IPA web site: http://www.langsci.ucl.ac.uk/ipa/ IPA fonts for PC can be found at: http://www.sil.org/computing/fonts/encore.html
Consonants Consonants can be distinguished in terms of: Voicing: whether or not vocal folds are vibrating in the larynx. Place: site of constriction or closure in the vocal tract. Manner: type of constriction (plosive, fricative, affricate, nasal, approximants).
Consonants Fricatives e.g. /s/, /f/, /ʃ/ and voiced equivalents /z/, /v/, /ʒ/ Plosives e.g. /p/, /t/, /k/ and /b/, /d/, /g/ Affricates (combination of fricative and plosive) e.g. /tʃ/ as in “church” Nasals e.g. /m/, /n/ and /ŋ/ as in “ring” Approximants e.g. /w/, /r/, /y/ and /l/
Audio-vestibular Anatomy and Physiology Lecture 1 Speech Perception Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk
Acoustic Signals and Cues In most of this module we have talked about experiments that are designed to look at a single aspect of sound perception. It has been important to design the experiments so that the only perceptual cues present are those we are investigating. With speech sounds there are multiple cues present, both acoustic and non-acoustic.
Spectrograms When we analyse speech we generally want to see how the spectral content varies through time. To do this, we use spectrograms: Narrow-band spectrogram – fine frequency resolution, but smears timing information Broad-band spectrogram – fine time resolution, poor frequency resolution
Acoustic Cues “a goo d s p e e ch” Frequency Time Phonemes can be characterised by their spectro-temporal characteristics: “a goo d s p e e ch” Frequency Time
Acoustic Cues The position and movement of the spectral peaks (formants) can carry a great deal of information about the identity of consonants and vowels. Different vowel sounds have formant peaks at different frequencies. Second formant transitions in particular are important in identifying consonants.
Acoustic Cues Temporal fluctuations are important in consonant perception. Stop consonants can be identified by periods of silence (e.g., “say” vs. “stay”).
Variable Nature of Acoustic Cues Unfortunately, the acoustic waveforms corresponding to an individual phoneme are highly variable. This variability arises in part from co-articulation and the way articulators have to move between particular locations in the vocal tract. Also, speaker accent, sex, age, and individual differences in the shape of the vocal tract (e.g. Michael Jackson vs. Barry White) can have a big influence on the sound associated with a given phoneme.
Variable Nature of Acoustic Cues The same phoneme can be perceived from a variety of acoustic patterns. E.g. the formant transitions following the phoneme /d/: /di/ - second formant rising from 2200 to 2400 Hz /du/ - second formant falling from 1200 to 700 Hz Similarly other vowels result in different acoustic patterns after the same consonant.
Co-articulation In free-flowing speech a vowel sound is not a static pattern of spectral peaks. The sound is constantly changing as the articulators in the vocal tract move between the position for one speech sound and that for the next sound. Hence, the sound waveform corresponding to a phoneme is heavily modified by the nature of the phonemes before and after - this is called “co-articulation.”
Co-articulation
Variable Nature of Acoustic Cues Hence, there is a complex relationship between the acoustic pattern and the phoneme. Liberman et al. (1967) refers to "encoded" and "unencoded" phonemes. Encoded - phonemes whose acoustic patterns show considerable context-dependent restructuring. Unencoded - those for which there is less restructuring.
Redundancy The problems of variability and listening to speech in difficult conditions (e.g., noisy environment, crackly phone line) are made easier because the speech signal contains a lot of redundant information. In other words, there is usually much more information in the signal than we need to identify the utterance. This can be demonstrated by: Sine-wave speech Noise-vocoded speech
Sine-Wave Speech Take natural speech and extract only the frequencies of the first three formants (Remez et al.,1981). Replace these three formants with sinusoids that vary in frequency and amplitude according to the original formants.
Tone 1 alone Tone 2 alone Tone 3 alone All three tones Original speech
Sine wave speech Original
Sine wave speech Original
Sine wave speech Original
Sine wave speech Original
Noise Vocoded Speech Another example of simple acoustic patterns being perceived as speech is with white noise modulated by the speech amplitude (developed to simulate cochlear implants). Speech is split into a number of frequency channels. The slow amplitude fluctuations (envelope) of these channels are used to modulate a band of noise centred on the frequency of the channel. These channels are re-combined to form a speech signal that has very limited spectral information compared to the original speech.
Noise Vocoded Speech For 16 channels it’s quite easy to follow the speech. But for 2 channels we can identify some of the words (especially when we’ve heard the original).
Effects of channel number: 1 2 4 8 16
Non-Acoustic Cues For normal speech, perception doesn’t depend solely on acoustic cues. If a highly probable word is replaced in a sentence by a cough, for example, the replaced word is often “heard.” This kind of “filling in” process often occurs in noisy environments. This illustrates the importance of non-acoustic cues.
Non-Acoustic Cues Semantic cues (the meaning of preceding and following words and the subject matter). Syntactic cues (grammatical rules). Circumstantial cues (speaker identity, listening environment, etc.). Visual cues (e.g., lip reading).
Audio-Visual Integration Movements of a speakers lips and face provide important cues for speech perception. What we hear is heavily influenced by what we see. For example, the McGurk effect.
The McGurk Effect
The McGurk Effect
Audio-vestibular Anatomy and Physiology Lecture 1 Is Speech Special? Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk
Is Speech Special? Some researchers have argued that perception of speech sounds is different to non-speech sounds. In particular, that there is a special "speech mode" engaged when we listen to speech sounds. In what way does speech require a special decoding mechanism?
Rate of Speech Sounds In rapid speech, as many as 30 phonemes may occur per second. Liberman et al. (1967) argued that this is too fast for the auditory system to resolve. More recent evidence refutes this - with practice we can identify sequences of non-speech sounds where individual items are as short as 10 ms (100 items per second). At these rates individual items aren't recognised - only the overall pattern.
Categorical Perception For highly encoded phonemes, certain small changes to the acoustic signal make little or no change in the perceived phoneme. Other small changes result in perception of a different phoneme.
Categorical Perception Essentially we don't hear changes within a category. But we do hear changes that cross category boundaries. E.g. for synthetic, two-formant speech, where the second formant transition varies in small steps.
category boundary category boundary
Categorical Perception To demonstrate categorical perception we can use both identification and discrimination tasks. Ideally: Identification task would establish where the category boundaries lie Discrimination task should be difficult within categories (about chance level), but easy across boundaries
Example of categorical perception (identification) task – a continuum of date/gate speech is synthesized by computer. These are presented either in random order or adaptively and the listener is asked to identify the sound as “date” or “gate.”
The percentage of one of the responses is plotted against the stimulus parameter – at one end of the scale 100% categorisation “date”, at the other end 0% (100% “gate”). Date/Gate
Categorical Perception CP has been taken as evidence for a special speech decoder, however CP can also occur for non-speech sounds… Miller et al. (1976): low level noise whose onset occurred at various times relative to a more intense buzz. Task was to label sound as "noise" or "no-noise.” Labelling changed abruptly at 16 ms: “Noise” when noise started more than 16 ms before buzz “No-noise” when noise started less than 16 ms before buzz Discrimination was also best across this boundary.
Native Language Magnet Theory Kuhl (1993): The exposure to a specific language results in stored representations of phonetic category prototypes for that language. Each prototype acts like a “perceptual magnet” and attracts stimuli with similar acoustic patterns to be perceived as more like that prototype. Certain perceptual distinctions are minimised (e.g. for sounds close to prototype) and others maximised (e.g. those lying on opposite sides of a category boundary).
Evidence from Brain Specialisation Different brain regions play a role in the perception of speech and non-speech sounds. The “crossed” pathways are assumed to be more effective than uncrossed (i.e. right ear connected to left hemisphere and vice versa). If two competing stimuli are presented to the two ears: For speech sounds, those presented to the right ear are better identified than those presented to the left. For musical melodies the reverse is true.
Evidence from Brain Specialisation This suggests that speech signals are more readily decoded in the left hemisphere. Studies of deficiencies in speech perception and production in people with brain lesions support this, e.g. aphasias associated with lesions of Broca’s (production) or Wernicke’s (reception) regions: However, it could be that the initial processing of speech and non-speech sounds is similar.
A Speech Mode? When listening to sounds with the acoustic characteristics of speech we seem to change to a “speech mode” of listening. Signals where the spectral and temporal characteristics are similar to those of speech are learned more easily.
A Speech Mode? Sine-wave speech is a good example of this. Listeners with no knowledge of the stimuli just heard beeps and whistles. However, listeners asked to transcribe the “strange speech” were able to do so. Once “speech mode” is engaged it is difficult to reverse the process.
Stevens and House (1972) “although one can imagine an acoustic continuum in which sounds bear a closer and closer relation to speech, there is no such continuum as far as the perception of the sound is concerned – the perception is dichotomous. Sounds are perceived either as linguistic or nonlinguistic entities” Stevens & House (1972) Speech Perception, in Foundations of Modern Auditory Theory. Ed. J.V. Tobias
Audio-vestibular Anatomy and Physiology Lecture 1 Effects of Hearing Loss Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk
Effects of Cochlear Hearing Loss on Speech Perception Hearing loss may affect the audibility of some speech components. In addition, a reduction in frequency selectivity will degrade the representation of spectral features in the speech, and will also impair the separation of the target speech from background speech or noise.
Loss of Audibility If we plot the speech spectrum on an audiogram we can easily see which speech sounds would be affected by different degrees of hearing loss. For example, a high-frequency loss with 0 dB HL up to 2 kHz and 70 dB HL above that would disrupt the ability to distinguish between stops (or plosives, /b/, /d/, /g/, /p/, /t/, /k/) and fricatives (e.g. /f/, /T/, /s/, /S/ etc.).
Speech Intelligibility Index (SII) If audibility was all that was needed for speech intelligibility we should be able to predict SI from the audiogram and the speech spectrum. We would need to know: How much of each part of the speech spectrum was audible How important each part of the spectrum is to speech intelligibility Multiply the two and add them all together across frequency This is what SII does
Calculations can be done for: Octave bands 1/3rd octave bands Critical bands In order to do this we need: Spectrum, expressed in levels for each band Weightings for speech importance Pure-tone threshold and internal or background noise (to calculate signal:noise ratio)
SII for Octave Bands band importance Average spectrum of speech
Low frequency parts of speech are less important than mid frequencies:
SII = 0.9958
SII = 0.8692
SII = 0.6047
SII = 0.2183
SII = 0.0
HF loss, no amplification SII = 0.6435 ½ gain rule amplification SII = 0.8453
Speech Intelligibility Index (SII) SII is a useful tool, but only accounts for audibility of the signal: Doesn’t account for context effects Doesn’t account for loss of frequency selectivity / distortion Doesn’t do well with more severe/profound losses (although new standard does try to utilise the fact that in normal listeners speech intelligibility drops off at high levels)
Reduction in Frequency Selectivity Broadened auditory filters result in a "smoothed" basilar membrane excitation pattern. This effectively smoothes the representation of the spectrum of the speech.
Excitation pattern for vowel /i/ with broadened filters. The smoothed excitation pattern means that the internal representation of formants is also smoothed.
Normal Hearing 20.6 sones
High-Frequency IHC Loss 9.3 sones
High-Frequency OHC Loss 19.0 sones
Simulation of Broadened Filters In impaired listeners, reduced selectivity cannot be separated from recruitment/loss of sensitivity. Reduced selectivity is simulated and normal hearing listeners are used. Processing of speech stimuli can be done to "smear" the spectrum. This processed speech can then be used to see the effect on speech intelligibility in noise. Original speech: Asymmetric filter broadening:
Baer and Moore (1993), JASA 94, 1229.
Simulation of Broadened Filters “Smearing" has relatively little effect for speech in quiet. But as signal-to-noise ratio decreases, intelligibility of smeared speech is reduced.
Effects on Speech Perception Having broad auditory filters impairs ability to separate speech from noisy backgrounds. Perhaps the most common complaint of hearing-impaired listeners is understanding speech when there is background noise. Hearing aids cannot compensate for loss of frequency selectivity…
Some More Simulations The original (unprocessed) speech: Sloping loss, mild at low frequencies and severe at high frequencies: Linear amplification combined with a sloping loss: Two-channel compression amplification & sloping loss:
Hearing Loss Simulation Software UCL Phonetics Department http://www.phon.ucl.ac.uk/resource/hearloss/ National Institute of Occupational Safety & Health (US) http://www.cdc.gov/niosh/mining/products/product47.htm
Speech Perception Learning Outcomes Understand the nature of the speech signal Understand how speech is produced Understand the cues that are used by the auditory system to decode the speech signal Understand the problems of variability and context Basic understanding of categorical perception Understand the speech intelligibility index (SII) Understand how speech perception is affected by hearing loss