Speech Perception Chris Plack

Slides:



Advertisements
Similar presentations
Chapter 12: Speech and Music Perception
Advertisements

Sounds that “move” Diphthongs, glides and liquids.
SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
Acoustic Characteristics of Consonants
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Auditory scene analysis 2
CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008.
Acoustic Characteristics of Vowels
Phonetics.
Chapter 12 Speech Perception. Animals use sound to communicate in many ways Bird calls Bird calls Whale calls Whale calls Baboons shrieks Baboons shrieks.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
Periodicity and Pitch Importance of fine structure representation in hearing.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Speech perception 2 Perceptual organization of speech.
ACOUSTICS OF SPEECH AND SINGING MUSICAL ACOUSTICS Science of Sound, Chapters 15, 17 P. Denes & E. Pinson, The Speech Chain (1963, 1993) J. Sundberg, The.
Speech Science XII Speech Perception (acoustic cues) Version
“Speech and the Hearing-Impaired Child: Theory and Practice” Ch. 13 Vowels and Diphthongs –Vowels are formed when sound produced at the glottal source.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
PHONETICS AND PHONOLOGY
The Human Voice. I. Speech production 1. The vocal organs
ACOUSTICAL THEORY OF SPEECH PRODUCTION
Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
PH 105 Dr. Cecilia Vogel Lecture 14. OUTLINE  consonants  vowels  vocal folds as sound source  formants  speech spectrograms  singing.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
Speech perception Relating features of hearing to the perception of speech.
Exam 1 Monday, Tuesday, Wednesday next week WebCT testing centre Covers everything up to and including hearing (i.e. this lecture)
Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
SPEECH PERCEPTION The Speech Stimulus Perceiving Phonemes Top-Down Processing Is Speech Special?
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
The Perception of Speech
Phonetics HSSP Week 5.
Chapter 13: Speech Perception
Phonetics and Phonology
Speech Perception. Phoneme - a basic unit of a speech sound that distinguishes one word from another Phonemes do not have meaning on their own but they.
CSD 5400 REHABILITATION PROCEDURES FOR THE HARD OF HEARING Auditory Perception of Speech and the Consequences of Hearing Loss.
Phonetics: the generation of speech Phonemes “The shortest segment of speech that, if changed, would change the meaning of a word.” hog fog log *Phonemes.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Chapter 7 SPEECH COMMUNICATIONS
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
Speech Or can you hear me now?. Linguistic Parts of Speech Phone Phone Basic unit of speech sound Basic unit of speech sound Phoneme Phoneme Phone to.
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
Say “blink” For each segment (phoneme) write a script using terms of the basic articulators that will say “blink.” Consider breathing, voicing, and controlling.
Transitions + Perception March 27, 2012 Tidbits First: Guidelines for the final project report So far, I have two people who want to present their projects.
LATERALIZATION OF PHONOLOGY 2 DAY 23 – OCT 21, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Sensation & Perception
Sounds and speech perception Productivity of language Speech sounds Speech perception Integration of information.
CSD 2230 INTRODUCTION TO HUMAN COMMUNICATION DISORDERS Normal Sound Perception, Speech Perception, and Auditory Characteristics at the Boundaries of the.
Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.
Chapter 13: Speech Perception. The Acoustic Signal Produced by air that is pushed up from the lungs through the vocal cords and into the vocal tract Vowels.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
Speech Perception.
Language Perception.
Stop + Approximant Acoustics
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
Transitions + Perception March 25, 2010 Tidbits Mystery spectrogram #3 is now up and ready for review! Final project ideas.
Motor Theory of Perception March 29, 2012 Tidbits First: Guidelines for the final project report So far, I have two people who want to present their.
What can we expect of cochlear implants for listening to speech in noisy environments? Andrew Faulkner: UCL Speech Hearing and Phonetic Sciences.
The Human Voice. 1. The vocal organs
Cognitive Processes PSY 334
Essentials of English Phonetics
The Human Voice. 1. The vocal organs
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Speech Perception.
Speech Perception (acoustic cues)
Speech Communications
Presentation transcript:

Speech Perception Chris Plack PSYC 60041 Auditory Science Speech Perception Chris Plack

Speech Perception Learning Outcomes Understand the nature of the speech signal Understand how speech is produced Understand the cues that are used by the auditory system to decode the speech signal Understand the problems of variability and context Basic understanding of categorical perception Understand the speech intelligibility index (SII) Understand how speech perception is affected by hearing loss

Audio-vestibular Anatomy and Physiology Lecture 1 Speech Production Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk

Speech Production The Vocal Tract Tongue Teeth Lips Vocal Folds Nasal Cavity Soft Palate Air From Lungs Pharynx Larynx The Vocal Tract Speech is produced when air is forced from the lungs past the vocal folds causing them to vibrate:

Speech Production The Vocal Tract Tongue Teeth Lips Vocal Folds Nasal Cavity Soft Palate Air From Lungs Pharynx Larynx The Vocal Tract Articulators in the nose, throat, and mouth (e.g. tongue) move into different positions to modify the spectrum of the sound waves and produce different vowel sounds.

Waveforms Spectra /a/ /i/ F1 F2 F3 Formants (peaks in spectrum)

Speech Production The Vocal Tract Tongue Teeth Lips Vocal Folds Nasal Cavity Soft Palate Air From Lungs Pharynx Larynx The Vocal Tract Consonants are produced by constrictions in the vocal tract at different positions (e.g. “p” is produced by constriction at the lips, “k” at the back of the mouth).

Speech Production Air forced out from lungs. Vocal folds vibrate (voiced sounds) or stay open (unvoiced). Air-flow enters the mouth (and nasal cavity for nasal sounds). Resonances in these cavities result in formants (e.g. in vowel sounds). Consonants are produced by differing degrees of constriction of the vocal tract occurring at different places. Narrow constriction – fricatives; Complete closure - stops or plosives.

Speech Production So normal speech is a series of complex tones (vowel sounds) interrupted by regular constrictions in the vocal tract (consonant sounds). In whispered speech the vocal chords vibrate noisily rather than producing a regular series of pulses.

Audio-vestibular Anatomy and Physiology Lecture 1 Types of Speech Sound Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk

Speech Sounds Speech can be broken down into words and often further into syllables and phonemes. The phoneme is the smallest unit of speech sounds. Phonemes are represented by the “phonetic alphabet” – International Phonetic Alphabet (IPA). Phonemes describe what is perceived, not a specific acoustic pattern. It is not the case that each phoneme has a set acoustic pattern.

Speech Sounds English contains about 40 phonemes. These are represented by IPA symbols and are shown between two slash marks / / to indicate that they are phonetic symbols. IPA web site: http://www.langsci.ucl.ac.uk/ipa/ IPA fonts for PC can be found at: http://www.sil.org/computing/fonts/encore.html

Consonants Consonants can be distinguished in terms of: Voicing: whether or not vocal folds are vibrating in the larynx. Place: site of constriction or closure in the vocal tract. Manner: type of constriction (plosive, fricative, affricate, nasal, approximants).

Consonants Fricatives e.g. /s/, /f/, /ʃ/ and voiced equivalents /z/, /v/, /ʒ/ Plosives e.g. /p/, /t/, /k/ and /b/, /d/, /g/ Affricates (combination of fricative and plosive) e.g. /tʃ/ as in “church” Nasals e.g. /m/, /n/ and /ŋ/ as in “ring” Approximants e.g. /w/, /r/, /y/ and /l/

Audio-vestibular Anatomy and Physiology Lecture 1 Speech Perception Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk

Acoustic Signals and Cues In most of this module we have talked about experiments that are designed to look at a single aspect of sound perception. It has been important to design the experiments so that the only perceptual cues present are those we are investigating. With speech sounds there are multiple cues present, both acoustic and non-acoustic.

Spectrograms When we analyse speech we generally want to see how the spectral content varies through time. To do this, we use spectrograms: Narrow-band spectrogram – fine frequency resolution, but smears timing information Broad-band spectrogram – fine time resolution, poor frequency resolution

Acoustic Cues “a goo d s p e e ch” Frequency Time Phonemes can be characterised by their spectro-temporal characteristics: “a goo d s p e e ch” Frequency Time

Acoustic Cues The position and movement of the spectral peaks (formants) can carry a great deal of information about the identity of consonants and vowels. Different vowel sounds have formant peaks at different frequencies. Second formant transitions in particular are important in identifying consonants.

Acoustic Cues Temporal fluctuations are important in consonant perception. Stop consonants can be identified by periods of silence (e.g., “say” vs. “stay”).

Variable Nature of Acoustic Cues Unfortunately, the acoustic waveforms corresponding to an individual phoneme are highly variable. This variability arises in part from co-articulation and the way articulators have to move between particular locations in the vocal tract. Also, speaker accent, sex, age, and individual differences in the shape of the vocal tract (e.g. Michael Jackson vs. Barry White) can have a big influence on the sound associated with a given phoneme.

Variable Nature of Acoustic Cues The same phoneme can be perceived from a variety of acoustic patterns. E.g. the formant transitions following the phoneme /d/: /di/ - second formant rising from 2200 to 2400 Hz /du/ - second formant falling from 1200 to 700 Hz Similarly other vowels result in different acoustic patterns after the same consonant.

Co-articulation In free-flowing speech a vowel sound is not a static pattern of spectral peaks. The sound is constantly changing as the articulators in the vocal tract move between the position for one speech sound and that for the next sound. Hence, the sound waveform corresponding to a phoneme is heavily modified by the nature of the phonemes before and after - this is called “co-articulation.”

Co-articulation

Variable Nature of Acoustic Cues Hence, there is a complex relationship between the acoustic pattern and the phoneme. Liberman et al. (1967) refers to "encoded" and "unencoded" phonemes. Encoded - phonemes whose acoustic patterns show considerable context-dependent restructuring. Unencoded - those for which there is less restructuring.

Redundancy The problems of variability and listening to speech in difficult conditions (e.g., noisy environment, crackly phone line) are made easier because the speech signal contains a lot of redundant information. In other words, there is usually much more information in the signal than we need to identify the utterance. This can be demonstrated by: Sine-wave speech Noise-vocoded speech

Sine-Wave Speech Take natural speech and extract only the frequencies of the first three formants (Remez et al.,1981). Replace these three formants with sinusoids that vary in frequency and amplitude according to the original formants.

Tone 1 alone Tone 2 alone Tone 3 alone All three tones Original speech

Sine wave speech Original

Sine wave speech Original

Sine wave speech Original

Sine wave speech Original

Noise Vocoded Speech Another example of simple acoustic patterns being perceived as speech is with white noise modulated by the speech amplitude (developed to simulate cochlear implants). Speech is split into a number of frequency channels. The slow amplitude fluctuations (envelope) of these channels are used to modulate a band of noise centred on the frequency of the channel. These channels are re-combined to form a speech signal that has very limited spectral information compared to the original speech.

Noise Vocoded Speech For 16 channels it’s quite easy to follow the speech. But for 2 channels we can identify some of the words (especially when we’ve heard the original).

Effects of channel number: 1 2 4 8 16

Non-Acoustic Cues For normal speech, perception doesn’t depend solely on acoustic cues. If a highly probable word is replaced in a sentence by a cough, for example, the replaced word is often “heard.” This kind of “filling in” process often occurs in noisy environments. This illustrates the importance of non-acoustic cues.

Non-Acoustic Cues Semantic cues (the meaning of preceding and following words and the subject matter). Syntactic cues (grammatical rules). Circumstantial cues (speaker identity, listening environment, etc.). Visual cues (e.g., lip reading).

Audio-Visual Integration Movements of a speakers lips and face provide important cues for speech perception. What we hear is heavily influenced by what we see. For example, the McGurk effect.

The McGurk Effect

The McGurk Effect

Audio-vestibular Anatomy and Physiology Lecture 1 Is Speech Special? Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk

Is Speech Special? Some researchers have argued that perception of speech sounds is different to non-speech sounds. In particular, that there is a special "speech mode" engaged when we listen to speech sounds. In what way does speech require a special decoding mechanism?

Rate of Speech Sounds In rapid speech, as many as 30 phonemes may occur per second. Liberman et al. (1967) argued that this is too fast for the auditory system to resolve. More recent evidence refutes this - with practice we can identify sequences of non-speech sounds where individual items are as short as 10 ms (100 items per second). At these rates individual items aren't recognised - only the overall pattern.

Categorical Perception For highly encoded phonemes, certain small changes to the acoustic signal make little or no change in the perceived phoneme. Other small changes result in perception of a different phoneme.

Categorical Perception Essentially we don't hear changes within a category. But we do hear changes that cross category boundaries. E.g. for synthetic, two-formant speech, where the second formant transition varies in small steps.

category boundary category boundary

Categorical Perception To demonstrate categorical perception we can use both identification and discrimination tasks. Ideally: Identification task would establish where the category boundaries lie Discrimination task should be difficult within categories (about chance level), but easy across boundaries

Example of categorical perception (identification) task – a continuum of date/gate speech is synthesized by computer. These are presented either in random order or adaptively and the listener is asked to identify the sound as “date” or “gate.”

The percentage of one of the responses is plotted against the stimulus parameter – at one end of the scale 100% categorisation “date”, at the other end 0% (100% “gate”). Date/Gate

Categorical Perception CP has been taken as evidence for a special speech decoder, however CP can also occur for non-speech sounds… Miller et al. (1976): low level noise whose onset occurred at various times relative to a more intense buzz. Task was to label sound as "noise" or "no-noise.” Labelling changed abruptly at 16 ms: “Noise” when noise started more than 16 ms before buzz “No-noise” when noise started less than 16 ms before buzz Discrimination was also best across this boundary.

Native Language Magnet Theory Kuhl (1993): The exposure to a specific language results in stored representations of phonetic category prototypes for that language. Each prototype acts like a “perceptual magnet” and attracts stimuli with similar acoustic patterns to be perceived as more like that prototype. Certain perceptual distinctions are minimised (e.g. for sounds close to prototype) and others maximised (e.g. those lying on opposite sides of a category boundary).

Evidence from Brain Specialisation Different brain regions play a role in the perception of speech and non-speech sounds. The “crossed” pathways are assumed to be more effective than uncrossed (i.e. right ear connected to left hemisphere and vice versa). If two competing stimuli are presented to the two ears: For speech sounds, those presented to the right ear are better identified than those presented to the left. For musical melodies the reverse is true.

Evidence from Brain Specialisation This suggests that speech signals are more readily decoded in the left hemisphere. Studies of deficiencies in speech perception and production in people with brain lesions support this, e.g. aphasias associated with lesions of Broca’s (production) or Wernicke’s (reception) regions: However, it could be that the initial processing of speech and non-speech sounds is similar.

A Speech Mode? When listening to sounds with the acoustic characteristics of speech we seem to change to a “speech mode” of listening. Signals where the spectral and temporal characteristics are similar to those of speech are learned more easily.

A Speech Mode? Sine-wave speech is a good example of this. Listeners with no knowledge of the stimuli just heard beeps and whistles. However, listeners asked to transcribe the “strange speech” were able to do so. Once “speech mode” is engaged it is difficult to reverse the process.

Stevens and House (1972) “although one can imagine an acoustic continuum in which sounds bear a closer and closer relation to speech, there is no such continuum as far as the perception of the sound is concerned – the perception is dichotomous. Sounds are perceived either as linguistic or nonlinguistic entities” Stevens & House (1972) Speech Perception, in Foundations of Modern Auditory Theory. Ed. J.V. Tobias

Audio-vestibular Anatomy and Physiology Lecture 1 Effects of Hearing Loss Karolina Kluk - de Kort, karolina.kluk@manchester.ac.uk

Effects of Cochlear Hearing Loss on Speech Perception Hearing loss may affect the audibility of some speech components. In addition, a reduction in frequency selectivity will degrade the representation of spectral features in the speech, and will also impair the separation of the target speech from background speech or noise.

Loss of Audibility If we plot the speech spectrum on an audiogram we can easily see which speech sounds would be affected by different degrees of hearing loss. For example, a high-frequency loss with 0 dB HL up to 2 kHz and 70 dB HL above that would disrupt the ability to distinguish between stops (or plosives, /b/, /d/, /g/, /p/, /t/, /k/) and fricatives (e.g. /f/, /T/, /s/, /S/ etc.).

Speech Intelligibility Index (SII) If audibility was all that was needed for speech intelligibility we should be able to predict SI from the audiogram and the speech spectrum. We would need to know: How much of each part of the speech spectrum was audible How important each part of the spectrum is to speech intelligibility Multiply the two and add them all together across frequency This is what SII does

Calculations can be done for: Octave bands 1/3rd octave bands Critical bands In order to do this we need: Spectrum, expressed in levels for each band Weightings for speech importance Pure-tone threshold and internal or background noise (to calculate signal:noise ratio)

SII for Octave Bands band importance Average spectrum of speech

Low frequency parts of speech are less important than mid frequencies:

SII = 0.9958

SII = 0.8692

SII = 0.6047

SII = 0.2183

SII = 0.0

HF loss, no amplification SII = 0.6435 ½ gain rule amplification SII = 0.8453

Speech Intelligibility Index (SII) SII is a useful tool, but only accounts for audibility of the signal: Doesn’t account for context effects Doesn’t account for loss of frequency selectivity / distortion Doesn’t do well with more severe/profound losses (although new standard does try to utilise the fact that in normal listeners speech intelligibility drops off at high levels)

Reduction in Frequency Selectivity Broadened auditory filters result in a "smoothed" basilar membrane excitation pattern. This effectively smoothes the representation of the spectrum of the speech.

Excitation pattern for vowel /i/ with broadened filters. The smoothed excitation pattern means that the internal representation of formants is also smoothed.

Normal Hearing 20.6 sones

High-Frequency IHC Loss 9.3 sones

High-Frequency OHC Loss 19.0 sones

Simulation of Broadened Filters In impaired listeners, reduced selectivity cannot be separated from recruitment/loss of sensitivity. Reduced selectivity is simulated and normal hearing listeners are used. Processing of speech stimuli can be done to "smear" the spectrum. This processed speech can then be used to see the effect on speech intelligibility in noise. Original speech: Asymmetric filter broadening:

Baer and Moore (1993), JASA 94, 1229.

Simulation of Broadened Filters “Smearing" has relatively little effect for speech in quiet. But as signal-to-noise ratio decreases, intelligibility of smeared speech is reduced.

Effects on Speech Perception Having broad auditory filters impairs ability to separate speech from noisy backgrounds. Perhaps the most common complaint of hearing-impaired listeners is understanding speech when there is background noise. Hearing aids cannot compensate for loss of frequency selectivity…

Some More Simulations The original (unprocessed) speech: Sloping loss, mild at low frequencies and severe at high frequencies: Linear amplification combined with a sloping loss: Two-channel compression amplification & sloping loss:

Hearing Loss Simulation Software UCL Phonetics Department http://www.phon.ucl.ac.uk/resource/hearloss/ National Institute of Occupational Safety & Health (US) http://www.cdc.gov/niosh/mining/products/product47.htm

Speech Perception Learning Outcomes Understand the nature of the speech signal Understand how speech is produced Understand the cues that are used by the auditory system to decode the speech signal Understand the problems of variability and context Basic understanding of categorical perception Understand the speech intelligibility index (SII) Understand how speech perception is affected by hearing loss