CSE 551/651: Structure of Spoken Language Lecture 13: Theories of Human Speech Perception; Formant Based Speech Synthesis; Automatic Speech Recognition.

Slides:



Advertisements
Similar presentations
Reading and the phonetic module Carol A. Fowler Haskins Laboratories University of Connecticut Yale University.
Advertisements

CS 551/651: Structure of Spoken Language Spectrogram Reading: Approximants John-Paul Hosom Fall 2010.
Sounds that “move” Diphthongs, glides and liquids.
SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
Acoustic Characteristics of Consonants
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2008.
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008.
Acoustic Characteristics of Vowels
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Speech Science XII Speech Perception (acoustic cues) Version
“Speech and the Hearing-Impaired Child: Theory and Practice” Ch. 13 Vowels and Diphthongs –Vowels are formed when sound produced at the glottal source.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
The Human Voice. I. Speech production 1. The vocal organs
Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Exam 1 Monday, Tuesday, Wednesday next week WebCT testing centre Covers everything up to and including hearing (i.e. this lecture)
Cognitive Processes PSY 334 Chapter 2 – Perception April 9, 2003.
PSY 369: Psycholinguistics
SPEECH PERCEPTION The Speech Stimulus Perceiving Phonemes Top-Down Processing Is Speech Special?
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
The Perception of Speech
Cognitive Processes PSY 334 Chapter 2 – Perception.
Natural Language Understanding
Phonetics HSSP Week 5.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Speech Recognition with Hidden Markov Models Winter 2011
Introduction to Automatic Speech Recognition
1 CS 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Phonetics and Phonology
Speech Perception. Phoneme - a basic unit of a speech sound that distinguishes one word from another Phonemes do not have meaning on their own but they.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic Speech Recognition (ASR) John-Paul Hosom Fall 2008.
Structure of Spoken Language
Phonetic Context Effects Major Theories of Speech Perception Motor Theory: Specialized module (later version) represents speech sounds in terms of intended.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Sounds and speech perception Productivity of language Speech sounds Speech perception Integration of information.
Introduction to Language Phonetics 1. Explore the relationship between sound and spelling Become familiar with International Phonetic Alphabet (IPA )
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Introduction to Digital Speech Processing Presented by Dr. Allam Mousa 1 An Najah National University SP_1_intro.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Stop Acoustics and Glides December 2, 2013 Where Do We Go From Here? The Final Exam has been scheduled! Wednesday, December 18 th 8-10 am (!) Kinesiology.
Stop + Approximant Acoustics
WebCT You will find a link to WebCT under the “Current Students” heading on It is your responsibility to know how to work WebCT!
Transitions + Perception March 25, 2010 Tidbits Mystery spectrogram #3 is now up and ready for review! Final project ideas.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
The Human Voice. 1. The vocal organs
Cognitive Processes PSY 334
Statistical Models for Automatic Speech Recognition
Structure of Spoken Language
The Human Voice. 1. The vocal organs
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Structure of Spoken Language
Statistical Models for Automatic Speech Recognition
Speech Perception (acoustic cues)
Topic: Language perception
Phonetics and Phonemics
Presentation transcript:

CSE 551/651: Structure of Spoken Language Lecture 13: Theories of Human Speech Perception; Formant Based Speech Synthesis; Automatic Speech Recognition John-Paul Hosom Fall 2010

Comprehensive Theories of Human Speech Perception Three main theories that deal with decoding of the speech signal into words: The Motor Theory of Speech Perception Cue-Based Speech Recognition There is overlap in certain of these theories, as well as conflict between theories. Also, some theories are more general and some are more specific. There is no theory that addresses all aspects of human speech perception in sufficient detail to directly enable computer modeling of the theory

The Motor Theory of Speech Perception Alvin M. Liberman et al, originally 1963, revised in 1985 Motor Theory has 3 principles: (a) We perceive the speaker’s intended phonetic gestures; these gestures are the invariant parts of the communication process (b) Speech perception is automatically mediated by an innate, specialized module in the brain to which we have no conscious access and which has unique speech-specific properties (e.g. Wernicke’s and Broca’s areas) (c) Speech production and perception share a common link and a common processing strategy Essentially, listeners interpret the acoustic signal in terms of articulatory states and movements that would produce that signal. The listeners reconstruct the intended gestures of phonetic categories.

The Motor Theory of Speech Perception Each gesture contributes to part of a “pure” phoneme; coarticulation is thus not essential to the linguistic structure and is not represented in the abstract gestures. Coarticulation is “removed” from the signal using some unknown process. Perception and production are directly linked; we cannot perceive that which we cannot pronounce. Yet the link is innate, not wholly learned. “Until and unless the child (tacitly) appreciates the gestural source of the sounds, he can hardly be expected to perceive, or ever learn to perceive, a phonetic structure.” (Liberman and Mattingly, 1985) The percept (intended gestures) is entirely different from the auditory signal (speech waveform)… how the mapping is done is unclear, but uses a specialized module.

The Motor Theory of Speech Perception The reconstruction of intended gestures occurs in a specialized phonetic module in the brain, and is an innate process to which we have no conscious access. Translation from acoustics to gestures is automatic and direct, and the “user” has no control over this process. Initial Motor Theory developed because it was found that there is no simple relationship between speech signal (spectrogram) and perceived speech. Similar signals can yield different percepts, while different signals can yield the same percept. MT postulates a “higher” level of invariance. For example, the F2 transition in /d iy/ has a different slope than the F2 transition in /d uw/, but both yield the perceived consonant /d/. The gesture, voiced alveolar, is the same.

The Motor Theory of Speech Perception The underlying representation is not context-sensitive; this representation was originally thought to be the neural control of the articulators, now is thought to be in the intended articulatory gestures. The perception of speech gestures is special, because perception of non-speech sounds is closely tied to the acoustics, whereas speech sounds are perceived more abstractly. Example: categorical perception of consonants. Possible method of specialized processing in module: analysis-by-synthesis, using an “internal, innately specified vocal-tract synthesizer … that incorporates complete information about the anatomical and physiological characteristics of the vocal tract and also about the articulatory and acoustic consequences of linguistically significant gestures” (Liberman and Mattingly, 1985)

The Motor Theory of Speech Perception Criticisms of the Motor Theory The ability to read spectrograms argues against an inherently biological, innate, specialized module in the brain. If formant frequencies are inverted on the frequency axis, the stimuli are heard as non-speech sounds. However, these non- speech sounds can be classified after training. “The categorization of speech cues [is] not necessarily due to the operation of a special motor reference process, because the same results can be obtained, after proper auditory training, for stimulus differences that are not producible by speaking” (Lane, 1965) Two speech sounds may be perceived as the same even when produced with different articulatory gestures. e.g. “r”-colored vowels may be produced with tongue tip raised or raised tongue position further back in mouth; /uw/ produced with lip rounding or lowered larynx position (Ladefoged p.275)

Cue-Based Theories of Speech Recognition Cole and Scott (1974), Stevens and Blumenstein (1978) Cue-Based Theories claim that there are specific cues in the speech signal that are directly correlated with phonetic perception These cues may be context dependent or context independent The variety of cues are integrated to arrive at a single phonetic representation of the speech signal There may be speech-specific processes in the brain, but there is nothing in the speech-recognition process that can be accomplished only for speech signals (nothing inherently “special” about speech). There is no requirement that ties speech recognition with speech production

Cue-Based Theories of Speech Recognition Cole and Scott provided evidence for invariant cues in all consonant phonemes. The invariant cues may uniquely identify the phoneme (as in the case of /s/, /z/, /sh/, /zh/, /ch/, and /jh/), or they may be used in conjunction with other cues to identify the phoneme (as in the case of /f/, /th/, /v/, /dh/, /m/, /n/, and /ng/). In the case of plosives, the voicing distinction (/p/ vs. /b/) is signaled by invariant cues, while the place of articulation (/b/ vs. /d/) involves either invariant cues or formant-transition cues. In addition to phonetic cues, the speech signal contains cues about prosody (pitch, rate, energy, timing) This model is more computationally tractable than the Motor Theory

Cue-Based Theories of Speech Recognition Stevens and Blumenstein’s work focused on cues for plosive perception, and found that place of articulation can be uniquely identified based on the gross spectral shape at the burst release and at the onset of voicing. Eimas and Corbit (1973) found evidence for feature detectors in speech processing. In particular, VOT can be used to distinguish voiced from unvoiced plosives. The existence of a VOT feature detector would imply that such a detector can, like visual detectors, become fatigued with repeated presentations, and cause a shift in the perceived boundary between voiced and unvoiced consonants. Such a shift was found. The existence of specific feature detectors lends support to a cue-based theory over a motor theory.

Cue-Based Theories of Speech Recognition Agreement of Motor Theory and Cue-Based Theories Motor Theory and Cue-Based Theories agree on the wide variety of cues in the speech signal: “every ‘potential’ cue − that is, each of the many acoustic events peculiar to a linguistically significant gesture − is an actual cue. (For example, every one of 18 potential cues to the voicing distinction in medial position has been shown to have some perceptual value; Lisker, 1978.) All possible cues have not been tested, … but no potential cue has yet been found that could not be shown to be an actual one.” (Liberman, 1985)

Cue-Based Theories of Speech Recognition Criticisms of Cue-Based Theories Motor Theory claims that there are too many cues: “Putting together all the generalizations about the multiplicity and variety of acoustic cues, we should conclude that there is simply no way to define a phonetic category in purely acoustic terms. A complete list of cues − surely a cumbersome matter at best − is not feasible… But even if it were possible to compile such a list, the result would not repay the effort, because none of the cues on the list could be deemed truly essential” (Liberman 1985) Therefore, MT concludes that the cues are too numerous and varied to perceive speech, thus a more “direct” approach, such as analysis-by-synthesis, is required.

Text-to-Speech (TTS) Synthesis Generating a Waveform: Formant Synthesis Instead of specifying mouth shapes, formant synthesis specifies frequencies and bandwidths of resonators, which are used to filter a source waveform. Formant frequency analysis is difficult; bandwidth estimation is even more difficult. But the biggest perceptual problem in formant synthesis is not in the resonances, but in a “buzzy” quality most likely due to the glottal source model. Formant synthesis can sound nearly identical to a natural utterance if details of the prosody, glottal source, and formants are well modeled. original resynthesizedslowed down (“please say the yeed word again”) Demo from Alex Kain,

Text-to-Speech (TTS) Synthesis Formant TTS Synthesis: Architecture Formant-synthesis systems contain a number of sound sources, which are passed to filters in either parallel or cascade series. Each filter corresponds to one formant (resonance) or anti-resonance. (From Yamaguchi, 1993)

Text-to-Speech (TTS) Synthesis Formant systems: Rule-Based Synthesis For synthesis of arbitrary text, formants and bandwidths for each phoneme are determined by analyzing speech of a single person. The models of each phoneme may be a single set of formant frequencies and bandwidths for a canonical phoneme at a single point in time, or a trajectory of frequencies, bandwidths, and source models over time. The formant frequencies for each phoneme are combined over time using a model of coarticulation, such as Klatt’s modified locus theory. Duration, pitch, and energy rules are applied Result: something like this: (This example is quite old. It does not represent state of the art, but a “typical” example of what formant-based synthesis sounds like. Another example is S. Hawking’s voice.)

Automatic Speech Recognition (ASR) Hidden Markov Models

Automatic Speech Recognition (ASR) HMM-Based System Characteristics  System is in only one state at each time t; at time t+1, the system transfers to one of the states indicated by the arcs.  At each time t, the likelihood of each phoneme is estimated using Gaussian mixture model or ANN. The classifier uses a fixed time window usually extending no more than 60 msec. Each frame is typically classified into each phoneme in a particular left and right context, e.g. /y−eh+s/, and as the left, middle, or right region of that context-dependent phoneme (3 states per phoneme).  The probability of transferring from one state to the next is independent of the observed (test) speech utterance, being computed over the entire training corpus.  The Viterbi search determines the most likely word sequence given the phoneme and state-transition probabilities and the list of possible vocabulary words.

Automatic Speech Recognition (ASR) Issues with HMMs:  Independence is assumed between frames  Implicit duration model for phonemes is exponential decay, whereas phonemes actually have Gamma distributions  Independence is required between features within one frame for GMM classification (not so for ANN classification)  All frames of speech contribute equally to final result  Duration is not used in phoneme classification  Duration is modeled using a priori averages over the entire training set. No modeling of relative duration.  Language model uses probability of word N given words N−1, N−2, etc. (bigram, trigram, etc. language model); infrequently occurring word combinations poorly recognized (e.g. “black Monday”, a stock-market ‘crash’ in 1987)

Automatic Speech Recognition (ASR) Why is HMM Dominant Technique for ASR?  well-defined mathematical structure  does not require expert knowledge about speech signal (more people study statistics than study speech)  errors in analysis don’t propagate and accumulate  does not require prior segmentation  does not require a large number of templates  results are usually the best or among the best

ASR Technology vs. Spectrogram Reading HMM-Based ASR: frame based − no identification of landmarks in speech signal duration of phonemes not identified until end of processing all frames are equally important “cues” are completely unspecified, learned by training coarticulation model = context-dependent phoneme models Spectrogram Reading: first identify landmarks in the signal  Where’s the vowel? Is that change in energy a plosive? identify change over duration of a phoneme, relative durations  Is that formant movement a diphthong or coarticulation? identify activity at phoneme boundaries  F2 goes to 1800 Hz at onset of voicing,  voicing continues into frication, so it’s a voiced fric. specific cues to phoneme identity  1800 Hz implies alveolar, F3  2000 Hz implies retroflex coarticulation model = tends toward locus theory

ASR Technology vs. Spectrogram Reading HMM-Based ASR: frame based − no identification of landmarks in speech signal duration of phonemes not identified until end of processing all frames are equally important “cues” are completely unspecified, learned by training coarticulation model = context-dependent phoneme models Spectrogram Reading and Human Speech Recognition first identify landmarks in the signal  Humans thought to have landmark (e.g. plosive) detectors identify change over duration of a phoneme, relative durations  Humans very sensitive to small changes, especially at vowel/consonant boundaries identify activity at phoneme boundaries  Transition into the vowel most important region for human speech perception specific cues to phoneme identity  Humans use (large) set of specific cues, e.g. VOT

The Structure of Spoken Language Final Points: Speech is complex! Not as simple as “sequence of phonemes” There is structure in speech, related to broad phonetic categories Identifying formant locations and movement is important Duration is important even for phoneme identity Phoneme boundaries are important There are numerous cues to phoneme identity Little is understood about how humans process speech Current ASR technology is incapable of accounting for all information that humans use in reading spectrograms, and what is known about human speech processing often not used… this implies (but does not prove) that current technology may be incapable of reaching human levels of performance. Speech is complex!