CSE 551/651: Structure of Spoken Language Lecture 13: Theories of Human Speech Perception; Formant Based Speech Synthesis; Automatic Speech Recognition John-Paul Hosom Fall 2010
Comprehensive Theories of Human Speech Perception Three main theories that deal with decoding of the speech signal into words: The Motor Theory of Speech Perception Cue-Based Speech Recognition There is overlap in certain of these theories, as well as conflict between theories. Also, some theories are more general and some are more specific. There is no theory that addresses all aspects of human speech perception in sufficient detail to directly enable computer modeling of the theory
The Motor Theory of Speech Perception Alvin M. Liberman et al, originally 1963, revised in 1985 Motor Theory has 3 principles: (a) We perceive the speaker’s intended phonetic gestures; these gestures are the invariant parts of the communication process (b) Speech perception is automatically mediated by an innate, specialized module in the brain to which we have no conscious access and which has unique speech-specific properties (e.g. Wernicke’s and Broca’s areas) (c) Speech production and perception share a common link and a common processing strategy Essentially, listeners interpret the acoustic signal in terms of articulatory states and movements that would produce that signal. The listeners reconstruct the intended gestures of phonetic categories.
The Motor Theory of Speech Perception Each gesture contributes to part of a “pure” phoneme; coarticulation is thus not essential to the linguistic structure and is not represented in the abstract gestures. Coarticulation is “removed” from the signal using some unknown process. Perception and production are directly linked; we cannot perceive that which we cannot pronounce. Yet the link is innate, not wholly learned. “Until and unless the child (tacitly) appreciates the gestural source of the sounds, he can hardly be expected to perceive, or ever learn to perceive, a phonetic structure.” (Liberman and Mattingly, 1985) The percept (intended gestures) is entirely different from the auditory signal (speech waveform)… how the mapping is done is unclear, but uses a specialized module.
The Motor Theory of Speech Perception The reconstruction of intended gestures occurs in a specialized phonetic module in the brain, and is an innate process to which we have no conscious access. Translation from acoustics to gestures is automatic and direct, and the “user” has no control over this process. Initial Motor Theory developed because it was found that there is no simple relationship between speech signal (spectrogram) and perceived speech. Similar signals can yield different percepts, while different signals can yield the same percept. MT postulates a “higher” level of invariance. For example, the F2 transition in /d iy/ has a different slope than the F2 transition in /d uw/, but both yield the perceived consonant /d/. The gesture, voiced alveolar, is the same.
The Motor Theory of Speech Perception The underlying representation is not context-sensitive; this representation was originally thought to be the neural control of the articulators, now is thought to be in the intended articulatory gestures. The perception of speech gestures is special, because perception of non-speech sounds is closely tied to the acoustics, whereas speech sounds are perceived more abstractly. Example: categorical perception of consonants. Possible method of specialized processing in module: analysis-by-synthesis, using an “internal, innately specified vocal-tract synthesizer … that incorporates complete information about the anatomical and physiological characteristics of the vocal tract and also about the articulatory and acoustic consequences of linguistically significant gestures” (Liberman and Mattingly, 1985)
The Motor Theory of Speech Perception Criticisms of the Motor Theory The ability to read spectrograms argues against an inherently biological, innate, specialized module in the brain. If formant frequencies are inverted on the frequency axis, the stimuli are heard as non-speech sounds. However, these non- speech sounds can be classified after training. “The categorization of speech cues [is] not necessarily due to the operation of a special motor reference process, because the same results can be obtained, after proper auditory training, for stimulus differences that are not producible by speaking” (Lane, 1965) Two speech sounds may be perceived as the same even when produced with different articulatory gestures. e.g. “r”-colored vowels may be produced with tongue tip raised or raised tongue position further back in mouth; /uw/ produced with lip rounding or lowered larynx position (Ladefoged p.275)
Cue-Based Theories of Speech Recognition Cole and Scott (1974), Stevens and Blumenstein (1978) Cue-Based Theories claim that there are specific cues in the speech signal that are directly correlated with phonetic perception These cues may be context dependent or context independent The variety of cues are integrated to arrive at a single phonetic representation of the speech signal There may be speech-specific processes in the brain, but there is nothing in the speech-recognition process that can be accomplished only for speech signals (nothing inherently “special” about speech). There is no requirement that ties speech recognition with speech production
Cue-Based Theories of Speech Recognition Cole and Scott provided evidence for invariant cues in all consonant phonemes. The invariant cues may uniquely identify the phoneme (as in the case of /s/, /z/, /sh/, /zh/, /ch/, and /jh/), or they may be used in conjunction with other cues to identify the phoneme (as in the case of /f/, /th/, /v/, /dh/, /m/, /n/, and /ng/). In the case of plosives, the voicing distinction (/p/ vs. /b/) is signaled by invariant cues, while the place of articulation (/b/ vs. /d/) involves either invariant cues or formant-transition cues. In addition to phonetic cues, the speech signal contains cues about prosody (pitch, rate, energy, timing) This model is more computationally tractable than the Motor Theory
Cue-Based Theories of Speech Recognition Stevens and Blumenstein’s work focused on cues for plosive perception, and found that place of articulation can be uniquely identified based on the gross spectral shape at the burst release and at the onset of voicing. Eimas and Corbit (1973) found evidence for feature detectors in speech processing. In particular, VOT can be used to distinguish voiced from unvoiced plosives. The existence of a VOT feature detector would imply that such a detector can, like visual detectors, become fatigued with repeated presentations, and cause a shift in the perceived boundary between voiced and unvoiced consonants. Such a shift was found. The existence of specific feature detectors lends support to a cue-based theory over a motor theory.
Cue-Based Theories of Speech Recognition Agreement of Motor Theory and Cue-Based Theories Motor Theory and Cue-Based Theories agree on the wide variety of cues in the speech signal: “every ‘potential’ cue − that is, each of the many acoustic events peculiar to a linguistically significant gesture − is an actual cue. (For example, every one of 18 potential cues to the voicing distinction in medial position has been shown to have some perceptual value; Lisker, 1978.) All possible cues have not been tested, … but no potential cue has yet been found that could not be shown to be an actual one.” (Liberman, 1985)
Cue-Based Theories of Speech Recognition Criticisms of Cue-Based Theories Motor Theory claims that there are too many cues: “Putting together all the generalizations about the multiplicity and variety of acoustic cues, we should conclude that there is simply no way to define a phonetic category in purely acoustic terms. A complete list of cues − surely a cumbersome matter at best − is not feasible… But even if it were possible to compile such a list, the result would not repay the effort, because none of the cues on the list could be deemed truly essential” (Liberman 1985) Therefore, MT concludes that the cues are too numerous and varied to perceive speech, thus a more “direct” approach, such as analysis-by-synthesis, is required.
Text-to-Speech (TTS) Synthesis Generating a Waveform: Formant Synthesis Instead of specifying mouth shapes, formant synthesis specifies frequencies and bandwidths of resonators, which are used to filter a source waveform. Formant frequency analysis is difficult; bandwidth estimation is even more difficult. But the biggest perceptual problem in formant synthesis is not in the resonances, but in a “buzzy” quality most likely due to the glottal source model. Formant synthesis can sound nearly identical to a natural utterance if details of the prosody, glottal source, and formants are well modeled. original resynthesizedslowed down (“please say the yeed word again”) Demo from Alex Kain,
Text-to-Speech (TTS) Synthesis Formant TTS Synthesis: Architecture Formant-synthesis systems contain a number of sound sources, which are passed to filters in either parallel or cascade series. Each filter corresponds to one formant (resonance) or anti-resonance. (From Yamaguchi, 1993)
Text-to-Speech (TTS) Synthesis Formant systems: Rule-Based Synthesis For synthesis of arbitrary text, formants and bandwidths for each phoneme are determined by analyzing speech of a single person. The models of each phoneme may be a single set of formant frequencies and bandwidths for a canonical phoneme at a single point in time, or a trajectory of frequencies, bandwidths, and source models over time. The formant frequencies for each phoneme are combined over time using a model of coarticulation, such as Klatt’s modified locus theory. Duration, pitch, and energy rules are applied Result: something like this: (This example is quite old. It does not represent state of the art, but a “typical” example of what formant-based synthesis sounds like. Another example is S. Hawking’s voice.)
Automatic Speech Recognition (ASR) Hidden Markov Models
Automatic Speech Recognition (ASR) HMM-Based System Characteristics System is in only one state at each time t; at time t+1, the system transfers to one of the states indicated by the arcs. At each time t, the likelihood of each phoneme is estimated using Gaussian mixture model or ANN. The classifier uses a fixed time window usually extending no more than 60 msec. Each frame is typically classified into each phoneme in a particular left and right context, e.g. /y−eh+s/, and as the left, middle, or right region of that context-dependent phoneme (3 states per phoneme). The probability of transferring from one state to the next is independent of the observed (test) speech utterance, being computed over the entire training corpus. The Viterbi search determines the most likely word sequence given the phoneme and state-transition probabilities and the list of possible vocabulary words.
Automatic Speech Recognition (ASR) Issues with HMMs: Independence is assumed between frames Implicit duration model for phonemes is exponential decay, whereas phonemes actually have Gamma distributions Independence is required between features within one frame for GMM classification (not so for ANN classification) All frames of speech contribute equally to final result Duration is not used in phoneme classification Duration is modeled using a priori averages over the entire training set. No modeling of relative duration. Language model uses probability of word N given words N−1, N−2, etc. (bigram, trigram, etc. language model); infrequently occurring word combinations poorly recognized (e.g. “black Monday”, a stock-market ‘crash’ in 1987)
Automatic Speech Recognition (ASR) Why is HMM Dominant Technique for ASR? well-defined mathematical structure does not require expert knowledge about speech signal (more people study statistics than study speech) errors in analysis don’t propagate and accumulate does not require prior segmentation does not require a large number of templates results are usually the best or among the best
ASR Technology vs. Spectrogram Reading HMM-Based ASR: frame based − no identification of landmarks in speech signal duration of phonemes not identified until end of processing all frames are equally important “cues” are completely unspecified, learned by training coarticulation model = context-dependent phoneme models Spectrogram Reading: first identify landmarks in the signal Where’s the vowel? Is that change in energy a plosive? identify change over duration of a phoneme, relative durations Is that formant movement a diphthong or coarticulation? identify activity at phoneme boundaries F2 goes to 1800 Hz at onset of voicing, voicing continues into frication, so it’s a voiced fric. specific cues to phoneme identity 1800 Hz implies alveolar, F3 2000 Hz implies retroflex coarticulation model = tends toward locus theory
ASR Technology vs. Spectrogram Reading HMM-Based ASR: frame based − no identification of landmarks in speech signal duration of phonemes not identified until end of processing all frames are equally important “cues” are completely unspecified, learned by training coarticulation model = context-dependent phoneme models Spectrogram Reading and Human Speech Recognition first identify landmarks in the signal Humans thought to have landmark (e.g. plosive) detectors identify change over duration of a phoneme, relative durations Humans very sensitive to small changes, especially at vowel/consonant boundaries identify activity at phoneme boundaries Transition into the vowel most important region for human speech perception specific cues to phoneme identity Humans use (large) set of specific cues, e.g. VOT
The Structure of Spoken Language Final Points: Speech is complex! Not as simple as “sequence of phonemes” There is structure in speech, related to broad phonetic categories Identifying formant locations and movement is important Duration is important even for phoneme identity Phoneme boundaries are important There are numerous cues to phoneme identity Little is understood about how humans process speech Current ASR technology is incapable of accounting for all information that humans use in reading spectrograms, and what is known about human speech processing often not used… this implies (but does not prove) that current technology may be incapable of reaching human levels of performance. Speech is complex!