CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008.

Slides:



Advertisements
Similar presentations
Tom Lentz (slides Ivana Brasileiro)
Advertisements

Normal Aspects of Articulation. Definitions Phonetics Phonology Articulatory phonetics Acoustic phonetics Speech perception Phonemic transcription Phonetic.
CS 551/651: Structure of Spoken Language Spectrogram Reading: Approximants John-Paul Hosom Fall 2010.
Sounds that “move” Diphthongs, glides and liquids.
SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
Basic Spectrogram & Clinical Application: Consonants
Acoustic Characteristics of Consonants
Speech Perception Dynamics of Speech
Glides (/w/, /j/) & Liquids (/l/, /r/) Degree of Constriction Greater than vowels – P oral slightly greater than P atmos Less than fricatives – P oral.
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2008.
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
Acoustic Characteristics of Vowels
The sound patterns of language
Nasal Stops.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Speech perception 2 Perceptual organization of speech.
Speech Science XII Speech Perception (acoustic cues) Version
“Speech and the Hearing-Impaired Child: Theory and Practice” Ch. 13 Vowels and Diphthongs –Vowels are formed when sound produced at the glottal source.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
The Human Voice. I. Speech production 1. The vocal organs
ACOUSTICAL THEORY OF SPEECH PRODUCTION
Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.
Chapter two speech sounds
Identification and discrimination of the relative onset time of two component tones: Implications for voicing perception in stops David B. Pisoni ( )
Profile of Phoneme Auditory Perception Ability in Children with Hearing Impairment and Phonological Disorders By Manal Mohamed El-Banna (MD) Unit of Phoniatrics,
Spectrogram & its reading
TEMPLATE DESIGN © Perceptual compensation for /u/-fronting in American English KATAOKA, Reiko Department.
Exam 1 Monday, Tuesday, Wednesday next week WebCT testing centre Covers everything up to and including hearing (i.e. this lecture)
TEMPLATE DESIGN © Listener’s variation in phoneme category boundary as a source of sound change: a case of /u/-fronting.
PSY 369: Psycholinguistics
SPEECH PERCEPTION The Speech Stimulus Perceiving Phonemes Top-Down Processing Is Speech Special?
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Language Comprehension Speech Perception Naming Deficits.
Nasal Stops. Nasals Distinct vocal tract configuration Pharyngeal cavity Oral cavity (closed) Nasal cavity (open)
The Perception of Speech
Structure of Spoken Language
Phonetics HSSP Week 5.
Speech Perception. Phoneme - a basic unit of a speech sound that distinguishes one word from another Phonemes do not have meaning on their own but they.
Speech Production1 Articulation and Resonance Vocal tract as resonating body and sound source. Acoustic theory of vowel production.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
Speech Or can you hear me now?. Linguistic Parts of Speech Phone Phone Basic unit of speech sound Basic unit of speech sound Phoneme Phoneme Phone to.
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
Voice Quality + Stop Acoustics
Acoustic Cues to Laryngeal Contrasts in Hindi Susan Jackson and Stephen Winters University of Calgary Acoustics Week in Canada October 14,
Transitions + Perception March 27, 2012 Tidbits First: Guidelines for the final project report So far, I have two people who want to present their projects.
Sensation & Perception
Stops Stops include / p, b, t, d, k, g/ (and glottal stop)
1 Cross-language evidence for three factors in speech perception Sandra Anacleto uOttawa.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Stop Acoustics and Glides December 2, 2013 Where Do We Go From Here? The Final Exam has been scheduled! Wednesday, December 18 th 8-10 am (!) Kinesiology.
Language Perception.
Stop + Approximant Acoustics
WebCT You will find a link to WebCT under the “Current Students” heading on It is your responsibility to know how to work WebCT!
Transitions + Perception March 25, 2010 Tidbits Mystery spectrogram #3 is now up and ready for review! Final project ideas.
1 Acoustic Phonetics 3/28/00. 2 Nasal Consonants Produced with nasal radiation of acoustic energy Sound energy is transmitted through the nasal cavity.
Acoustic Phonetics 3/14/00.
Motor Theory of Perception March 29, 2012 Tidbits First: Guidelines for the final project report So far, I have two people who want to present their.
Stop/Plosives.
Stop Acoustics + Glides December 2, 2015 Down The Stretch They Come Today: Stop and Glide Acoustics Friday: Sonorant Acoustics + USRI evaluations We’ll.
The Human Voice. 1. The vocal organs
Structure of Spoken Language
The Human Voice. 1. The vocal organs
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Structure of Spoken Language
Speech Perception (acoustic cues)
Phonetics and Phonemics
Presentation transcript:

CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008

Recommended Reading: Chapter 5: Strong/Weak Forms, Intonation, and Stress Chapter 11, pp. 267 − 275: Balance between Phonetic Forces and “Physical Phonetics” Final Exam will be a take-home exam with  10 questions (same style as midterm, but may require use of calculator) and a number of spectrograms to be deciphered. It will be handed out at the end of class on Wednesday December 3. The exam will be due back to me by Friday December 12. This is worth about 30% of your grade. The final will cover material from Lecture 7 (“Syllable Structure…”) until the end of the term. Material covered on the midterm will probably not be covered on the final. The spectrogram reading exercises will be similar to the midterm, but will include the other classes of speech that we’ve been studying (nasals, approximants, and affricates) as well as the usual vowels (and diphthongs), fricatives, and stops. Reading

The Perceptual Second Formant: F2' Most vowels can be simulated using two resonances: In one study, the lower resonance was fixed at the frequency of a vowel formant, and the subject was asked to vary the higher resonance (F2') until the perceived sound most closely matched the target vowel. For back vowels and central vowels, subjects adjusted F2' to a frequency near the vowel’s F2 For front vowels except for /iy/, F2' was between the vowel’s F2 and F3; for /iy/, F2' was at or above the vowel’s F3 400 Hz2200 Hz 400 Hz target: /ih/

The Perceptual Second Formant: F2' These finding suggest that when formants are close in frequency, they are integrated so that there is a single “effective” formant equivalent to an average of the two peaks It has also been shown that when two or more formants occur within 3 to 3.5 Barks, the perceived vowel quality is equivalent to a resonance pattern with a single formant at the center of gravity of the two formants So, for two formants within 3 Barks, the formant positions affect a center of gravity measure of a single perceived resonance; beyond 3 Barks, two formants are heard as perceptually distinct. These results suggest that for steady vowels, there is an internal representation that has fairly low resolution.

Perception of Coarticulation In most cases, vowels are affected by coarticulation. In some cases, the vowel does not reach its “target” formant pattern. How does the brain deal with this variation in the signal? The acoustic effects of coarticulation referred to by Lindblom as “target undershoot”; the amount of undershoot depends on syllable duration, as well as on speaking style, and varies both across and within speakers. In vowel perception, Lindblom hypothesized that people compensate for target undershoot, and attempt to recover the canonical vowel targets. In an experiment, synthetic speech stimuli in a wVw and yVy context were presented to listeners, with the F2 of V varying from high (for an /ih/ vowel) to low (for an /uh/ vowel).

Perception of Coarticulation The boundary for perception of /ih/ and /uh/ (given the varying F2 values) was different in the wVw context and yVy context In yVy contexts, mid-level values of F2 were heard as /uh/, and in wVw contexts, mid-level values of F2 heard as /ih/. /w ih w y uh y

Perception of Coarticulation This demonstrates perceptual overshoot; subjects are relying on direction and slope of formant transitions to classify vowels Lindblom proposed Perceptual Compensation model, which “normalizes” formant frequencies based on formants of the surrounding consonants, canonical vowel targets, and syllable duration. However, many factors may account for target undershoot, and so a simple model is not effective in this case. Also, if applied to automatic speech recognition, determining locations of consonants and vowels is a non-trivial problem.

Are Formant Targets Important?? Strange et al. did experiment in which target information, dynamic information (in formant transitions), and duration information were manipulated independently in CVC syllables. Given a CVC, the middle region of the V was removed, or the transition regions were removed, or the duration was normalized, or some combination of these was applied The CVCs were presented to subjects, who were asked to identify the vowel. Regions with no target information are “Silent-Center”, regions with no transitions are “Centers-Alone”, and time- normalized versions are referred to as “Neutral-Duration”

Are Formant Targets Important??

Identification of Silent-Center vowels was “remarkably accurate”; in some cases, as good as identification of unmodified CVC. Neutral-Duration Silent-Center vowels not correctly identified as often as Silent-Center vowels. However, Neutral-Duration Silent Center vowels still more often correctly identified than Neutral-Duration Center-Alone vowels. Conclusions: (1) when vowel transition and duration information is present, recognition is highly accurate (2) with no duration information, transition information is more useful than nucleus information for vowel ID. (3) vowel targets alone are neither sufficient nor necessary

Are Formant Targets Important?? In another study by Furui, CV syllables were truncated either from the beginning or from the ending, and perception of the truncated syllable was measured In another experiment, both initial and final sections of the syllable were truncated, with a minimum duration of 40 msec The “perceptual critical point” was defined as the truncation position at which there was 80% correct recognition. Furui found: (a) The 10 msec during the point of greatest spectral transition is most important for identification of CV syllables, and (b) The crucial information for both vowels and consonants is in this 10-msec region; consonants can be mainly perceived by the spectral transition into the following vowel

Are Formant Targets Important?? Tekieli and Cullinan showed that (a) Given first 10 msec of isolated vowel, Place and Height can be distinguished at levels above chance; the tense-lax feature requires 30 msec. (b) Place of articulation in CV can be identified based on 10 msec after release, but voicing feature requires msec. In short, timing information is critical for tense-lax and voiced-unvoiced distinctions, and making these distinctions requires about 30 msec of speech; other features can be identified in 10 msec. Finally, DiBenedetto demonstrated that the F1 trajectory influenced perception of front vowels; synthetic syllables in which F1 targets are reached earlier than normal are perceived as lower in Height (/iy/  /ih/, /ih/  /eh/, /eh/  /ae/).

Perception of Place of Articulation Acoustic cues to perception of place of articulation reside primarily in spectral transitions between phonemes (with some exceptions, notably weak /f, th/ vs. strong /s, sh/) In perceptual experiments with two synthetic formants, different bursts can be heard by changing the slope of the initial part of F2; a locus of 720 Hz causes perception of /b/, a locus of 1800 Hz causes perception of /d/, and a locus of 3000 Hz often causes perception of /g/. Different plosives can also be perceived based on the shape of the burst (see next slide).

Perception of Place of Articulation

Categorical Perception In labeling speech, we use a fixed symbol set (e.g. Worldbet, IPA, etc.) to record what is spoken But what do we hear? Do we hear discrete symbols, or a continuum of sounds? In other words, is perception categorical, or continuous? If categorical, then there will be a range of stimuli that will yield no perceptual difference, a boundary at which the perception will change, and another range of stimuli with no perceptual difference. One example of a categorically-perceived feature is voice- onset time (VOT); if VOT is long, people hear unvoiced plosives, if VOT is short, people hear voiced plosives. But people don’t hear ambiguous plosives at the boundary between short and long VOT (30 msec).

Categorical Perception In another experiment, the F2 transition was varied along a continuous scale, but what was heard were “essentially quantal jumps from one perceptual category to another” (namely /b/, /d/, and /g/). (Moore, p. 283) On the other hand, small changes in the formants of vowels are easily perceived, leading to perception of “blended” vowels. However, for continuous-speech vowels, perception may be more categorical (Stevens, 1968) and there is evidence that vowels are encoded in memory using distinctive features (when vowels are forgotten, other vowels with similar features are more likely to be remembered, Cole 1968). Other evidence for categorical perception is in second- language learning; e.g. Japanese distinguishing /r/ and /l/ (by the age of 6, perception of speech is altered)

However, another study presented subjects with a range of stimuli between /b/, /d/, and /g/, but subjects were asked to respond with either /b/ or /g/. If perception were completely categorical, the responses in the /d/ region should have been random, but in fact there were systematic responses. (Barclay, 1972) Perception may be continuous but have sharp category boundaries, e.g. (Massaro, 1998) Categorical Perception

Cue Trading Perception of “slit” vs. “split”, with duration of silence between /s/ and /l/ varied, and formant transitions of /l/ varied to be flat or more toward /p/ Long silence durations yield “split”, however, words with formants closer to /p/ transition required less silence to be heard as “split” Conclusion: both acoustic cues are integrated by the listener into a single phonemic perception; cues can be “traded” so that more of one cue requires less of another for one type of perception (e.g. “split”)

Cue Trading As Moore stated, “within limits, a change in the setting or value of one cue, which leads to a change in the phonetic percept, can be offset by an opposed setting of a change in another cue so as to maintain the original phonetic percept.” (p. 291) McGurk Effect: (1) audio signal contains /ga/, video signal contains /ba/, perceived sound is /da/ (2) audio signal contains /ma/, video signal contains /ta/, perceived sound is /na/ subjects not aware of the conflicting cues

Fuzzy-Logic Model of Perception (FLMP) Massaro has proposed the FLMP, in which cues are: (a) evaluated according to their degree of presence; this evaluation returns a high number (up to 1.0) if the feature is present, and a low number (as low as 0.0) if the feature is absent. (b) matched to a prototype higher-level feature, such as a high degree of lip rounding matching a bilabial sound. (c) incorporated in a pattern-classification step, to determine which higher-level feature best matches the available cues The “best” high-level feature is selected as the actual feature. For example, given the following prototypes: phn(labial, voiced) = /b/ phn(labial, not_voiced) = /p/ phn(alveolar, voiced) = /d/ phn(alvoelar, not_voiced) = /t/

Fuzzy-Logic Model of Perception (FLMP) And then given measurements of place of articulation along a scale of 0.0 = bilabial, 1.0 = alveolar, 0.0 = not_voiced, 1.0 = voiced Then the probability of identifying the sound as /b/ is: where A is the evidence of alveolar, and V is the evidence for voicing. This assumes that all of the evidence (cues) are independent. This is equivalent to Bayes’ rule if the “fuzzy” scales are interpreted as probabilities

Fuzzy-Logic Model of Perception (FLMP) With exponential weights on the pieces of evidence, the predicted probabilities of identification agree well with actual probabilities of identification, varying place of articulation and voice-onset-time of synthetic speech sounds: