Download presentation
Presentation is loading. Please wait.
Published byJohnathan Powell Modified over 9 years ago
1
CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic Speech Recognition (ASR) John-Paul Hosom Fall 2008
2
Text-to-Speech (TTS) Synthesis Having looked at theories of human speech production and speech perception, now we’ll look at structures and algorithms currently used to implement these technologies. Text-to-Speech (TTS) has three main approaches: (1) formant-based (2) concatenative (3) articulatory All TTS approaches must address: (a) text analysis:from text, predicting phonemes, stress, and phrase boundaries (b) prosody:from text-analysis output, predicting pitch contour, energy contour, duration of each phoneme (c) signal processing: given phoneme symbols and timing, generate speech waveform
3
Text-to-Speech (TTS) Synthesis From a linguistic perspective, there may be many more things to consider… (from Klatt 1987)
4
Text-to-Speech (TTS) Synthesis Generating a Waveform: Articulatory Synthesis The vocal tract is divided into a large number of short tubes, as in the electrical transmission line analog (Lecture 11), which are then combined and resonant frequencies calculated. from Sinder, 1999 (thesis work with Flanagan, Rutgers)
5
Text-to-Speech (TTS) Synthesis Generating a Waveform: Articulatory Synthesis Vocal-tract sources include noise and a “buzz” source for voiced sounds Articulatory synthesis important for validating the Motor Theory of Speech Perception Demos from 1976 and circa 1992 (Haskins Labs)
6
Text-to-Speech (TTS) Synthesis Generating a Waveform: Formant Synthesis Instead of specifying mouth shapes, formant synthesis specifies frequencies and bandwidths of resonators, which are used to filter a source waveform. Formant frequency analysis is difficult; bandwidth estimation is even more difficult. But the biggest perceptual problem in formant synthesis is not in the resonances, but in a “buzzy” quality most likely due to the glottal source model. Formant synthesis can sound identical to natural utterance if details of the glottal source and formants are well modeled. NATURAL SPEECHSYNTHETIC SPEECH (John Holmes, 1973)
7
Text-to-Speech (TTS) Synthesis Formant TTS Synthesis: Architecture Formant-synthesis systems contain a number of sound sources, which are passed to filters in either parallel or cascade series. Each filter corresponds to one formant (resonance) or anti-resonance. (From Yamaguchi, 1993)
8
Text-to-Speech (TTS) Synthesis Formant systems: Rule-Based Synthesis For synthesis of arbitrary text, formants and bandwidths for each phoneme are determined by analyzing speech of a single person. The models of each phoneme may be a single set of formant frequencies and bandwidths for a canonical phoneme at a single point in time, or a trajectory of frequencies, bandwidths, and source models over time. The formant frequencies for each phoneme are combined over time using a model of coarticulation, such as Klatt’s modified locus theory. Duration, pitch, and energy rules are applied Result: something like this:
9
Text-to-Speech (TTS) Synthesis Despite great success in copy synthesis, synthesis by rule using formants has severely degraded quality. It’s not clear why… Problem with glottal source? Problem with coarticulation and formant transitions? Problem with prosody? Formant synthesis was main TTS technique until the early or mid 1990’s, when increasing memory size and CPU speed allowed concatenative synthesis to be viable approach. Concatenative synthesis uses recordings of small units of speech (typically the region from the middle of one phoneme to the middle of another phoneme, or a diphone unit), and glues these units together to forms words and sentences. Concatenative synthesis means that you don’t have to worry about glottal source models or coarticulation, since the synthesis is just a concatenation of different waveforms containing “natural” glottal source and coarticulation.
10
Text-to-Speech (TTS) Synthesis Concatenative Synthesis: Units The basic unit for concatenative synthesis is the diphone: More recent TTS research is on using larger units. Issues include: (a) how to decide what units will be used? (b) how to select best unit from very large database? With increasing size and variety of units, there is an exponential growth in the database size. Yet, despite massive databases that may take months to record, coverage is nowhere near complete. There is a very large number of infrequent events in speech. sil-jh jh-aa aa-n n-sil
11
Concatenative Synthesis: Signal Processing Waveform-based Pitch-Synchronous Overlap Add (PSOLA) Perform pitch modification by spacing of pitch-synchronous units Or, use Line Spectral Frequencies (LSFs), which are computed from Linear Predictive Coefficients (LPC) Text-to-Speech (TTS) Synthesis
12
DEMOS Klatt’s DEC Talk (formant synthesis) (early 1990’s) sample 1 AT&T (large-unit selection) sample 1a (2003)sample 2a (2003) sample 1b (2005)sample 2b (2005) Bell Labs (large-unit selection) sample 1a (2003)sample 2a (2003) sample 1b (2005)sample 2b (2005) OGI (diphone units) sample 1a (2003)sample 2a (2003) sample 1b (2005)sample 2b (2005)
13
ASR Technology: Frame-Based Approaches Stochastic Approach includes HMMs and HMM/ANN hybrids
14
ASR Technology: Frame-Based Approaches
15
HMM-Based System Characteristics System is in only one state at each time t; at time t+1, the system transfers to one of the states indicated by the arcs. At each time t, the likelihood of each phoneme is estimated using Gaussian mixture model or ANN. The classifier uses a fixed time window usually extending no more than 60 msec. Each frame is typically classified into each phoneme in a particular left and right context, e.g. /y−eh+s/, and as the left, middle, or right region of that context-dependent phoneme (3 states per phoneme). The probability of transferring from one state to the next is independent of the observed (test) speech utterance, being computed over the entire training corpus. The Viterbi search determines the most likely word sequence given the phoneme and state-transition probabilities and the list of possible vocabulary words.
16
ASR Technology: Frame-Based Approaches Issues with HMMs: Independence is assumed between frames Implicit duration model for phonemes is Geometric, whereas phonemes actually have Gamma distributions Independence is required between features within one frame for GMM classification (not so for ANN classification) All frames of speech contribute equally to final result Duration is not used in phoneme classification Duration is modeled using a priori averages over the entire training set Language model uses probability of word N given words N−1, N−2, etc. (bigram, trigram, etc. language model); infrequently occurring word combinations poorly recognized (e.g. “black Monday”, a stock-market ‘crash’ in 1987)
17
ASR Technology: Frame-Based Approaches Why is HMM Dominant Technique for ASR? well-defined mathematical structure does not require expert knowledge about speech signal (more people study statistics than study speech) errors in analysis don’t propagate and accumulate does not require prior segmentation does not require a large number of templates results are usually the best or among the best
18
Issues in Developing ASR Systems Type of Channel Microphone signal different from telephone signal, “land-line” telephone signal different from cellular signal. Channel characteristics: pick-up pattern (omni-directional, unidirectional, etc.) frequency response, sensitivity, noise, etc. Typical channels: desktop boom mic:unidirectional, 100 to 16000 Hz hand-held mic:super-cardioid, 40 to 20000 Hz telephone:unidirectional, 300 to 8000 Hz Training on data from one type of channel automatically “learns” that channel’s characteristics; switching channels degrades performance.
19
Issues in Developing ASR Systems Speaking Rate Even the same speaker may vary the rate of speech. Most ASR systems require a fixed window of input speech. Formant dynamics change with different speaking rates and speaking styles (e.g. “frustrated speech”). ASR performance is best when tested on same rate of speech as training data. Training on a wide variation in speaking rate results in lower overall performance.
20
Issues in Developing ASR Systems Noise two types of noise: additive, convolutional additive: white noise (random values added to waveform) convolutional: filter (additive values in log spectrum) techniques for removing noise: RASTA, Cepstral Mean Subtraction (CMS) (nearly) impossible to remove all noise while preserving all speech stochastic training “learns” noise as well as speech; if noise changes, performance degrades.
21
Issues in Developing ASR Systems Vocabulary Vocabulary must be specified in advance (can’t recognize new words) Pronunciation of each word must be specified exactly (phonetic substitutions may degrade performance) Grammar: either very simple or very structured Reasons: phonetic recognition so poor that confidence in each recognized phoneme usually very low. humans often speak ungrammatically or disfluently.
22
Issues in Developing ASR Systems How Well Does ASR Do? 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 Read Speech Varied Microphones Broadcast Speech Conversational Speech 5k 20k Spontaneous Speech (2-3k) 1k Noisy 1% 100% Word Error Rate 10% Error Rates on Increasingly Difficult Problems Noisy Speech Structured Speech human speech recognition of Broadcast Speech (0.9%WER) 2.5% 19% Current best performance on conversational telephone speech is around 10% word error rate
23
ASR Technology vs. Spectrogram Reading HMM-Based ASR: frame based − no identification of landmarks in speech signal duration of phonemes not identified until end of processing all frames are equally important “cues” are completely unspecified, learned by training coarticulation model = context-dependent phoneme models Spectrogram Reading: first identify landmarks in the signal Where’s the vowel? Is that change in energy a plosive? identify change over duration of a phoneme, relative durations Is that formant movement a diphthong or coarticulation? identify activity at phoneme boundaries F2 goes to 1800 Hz at onset of voicing, voicing continues into frication, so it’s a voiced fric. specific cues to phoneme identity 1800 Hz implies alveolar, F3 2000 Hz implies retroflex coarticulation model = tends toward locus theory
24
ASR Technology vs. Spectrogram Reading HMM-Based ASR: frame based − no identification of landmarks in speech signal duration of phonemes not identified until end of processing all frames are equally important “cues” are completely unspecified, learned by training coarticulation model = context-dependent phoneme models Spectrogram Reading and Human Speech Recognition first identify landmarks in the signal Humans thought to have landmark (e.g. plosive) detectors identify change over duration of a phoneme, relative durations Humans very sensitive to small changes, especially at vowel/consonant boundaries identify activity at phoneme boundaries Transition into the vowel most important region for human speech perception specific cues to phoneme identity Humans use (large) set of specific cues, e.g. VOT
25
The Structure of Spoken Language Final Points: Speech is complex! Not as simple as “sequence of phonemes” There is structure in speech, related to broad phonetic categories Identifying formant locations and movement is important Duration is important even for phoneme identity Phoneme boundaries are important There are numerous cues to phoneme identity Little is understood about how humans process speech Current ASR technology is incapable of accounting for all information that humans use in reading spectrograms, and what is known about human speech processing often not used… this implies (but does not prove) that current technology may be incapable of reaching human levels of performance. Speech is complex!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.