Speech Generation: From Concept and from Text Julia Hirschberg CS 6998 11/13/2018
Today TTS CTS 11/13/2018
Traditional TTS Systems Monologue News articles, email, books, phone directories Input: plain text How to infer intention behind text? 11/13/2018
Human Speech Production Levels World Knowledge Semantics Syntax Word Phonology Motor Commands, articulator movements, F0, amplitude, duration Acoustics 11/13/2018
TTS Production Levels: Back End and Front End Orthographic input: The children read to Dr. Smith World Knowledge text normalization Semantics Syntax word pronunciation Word Phonology intonation assigment F0, amplitude, duration Acoustics synthesis 11/13/2018
Text Normalization Context independent: Mr., 22, $N, NAACP, MAACO VISA Context-dependent: Dr., St., 1997, 3/16 Abbreviation ambiguities: How to resolve? Application restrictions – all names? Rule or corpus-based decision procedure (Sproat et al ‘01) 11/13/2018
Part-of-speech ambiguity: The convict went to jail/They will convict him Said said hello They read books/They will read books Use: local lexical context, pos tagger, parser? Sense ambiguity: I fish for bass/I play the bass Use: decision lists (Yarowsky ’94) 11/13/2018
Word Pronunciation Letter-to-Sound rules vs. large dictionary O: _{C}e$ /o/ hope O /a/ hop Morphological analysis Popemobile Hoped Ethnic classification Fujisaki, Infiniti 11/13/2018
Goal: phonemes+syllabification+lexical stress Context-dependent too: Rhyming by analogy Meronymy/metonymy Exception Dictionary Beethoven Goal: phonemes+syllabification+lexical stress Context-dependent too: Give the book to John. To John I said nothing. 11/13/2018
Intonation Assignment: Phrasing Traditional: hand-built rules Punctuation 234-5682 Context/function word: no breaks after function word He went to dinner Parse? She favors the nuts and bolts approach Current: statistical analysis of large labeled corpus Punctuation, pos window, utt length,… 11/13/2018
Intonation Assignment: Accent Hand-built rules Function/content distinction He went out the back door/He threw out the trash Complex nominals: Main Street/Park Avenue city hall parking lot Statistical procedures trained on large corpora Contrastive stress, given/new distinction? 11/13/2018
Intonation Assignment: Contours Simple rules ‘.’ = declarative contour ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence Well, how did he do it? And what do you know? 11/13/2018
The TTS Front End Today Corpus-based statistical methods instead of hand-built rule-sets Dictionaries instead of rules (but fall-back to rules) Modest attempts to infer contrast, given/new Text analysis tools: pos tagger, morphological analyzer, little parsing 11/13/2018
TTS Back End: Phonology to Acoustics Goal: Produce a phonological representation from segmentals (phonemes) and suprasegmentals (accent and phrasing assignment) Convert to an acoustic signal (spectrum, pitch, duration, amplitude) From phonetics to signal processing 11/13/2018
Phonological Modeling: Duration How long should each phoneme be? Identify of context phonemes Position within syllable and # syllables Phrasing Stress Speaking rate 11/13/2018
Phonological Modeling: Pitch How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes? Contour or target models for accents, phrase boundaries Rules to align phoneme string and smooth How does F0 align with different phonemes? 11/13/2018
Phonetic Component: Segmentals Phonemes have different acoustic realizations depending on nearby phonemes, stress To/to, butter/tail Approaches: Articulatory synthesis Formant synthesis Concatenative synthesis Diphone or unit selection 11/13/2018
Articulatory Synthesis-by-Rule Model articulators: tongue body, tip, jaw, lips, velum, vocal folds Rules control timing of movements of each articulator Easy to model coarticulation since articulators modeled separately But: sounds very unnatural Transform from vocal tract to acoustics not well understood Knowledge of articulator control rules incomplete 11/13/2018
Formant (Acoustic) Synthesis by Rule Model of acoustic parameters: Formant frequencies, bandwidths, amplitude of voicing, aspiration… Phonemes have target values for parameters Given a phonemic transcription of the input: Rules select sequence of targets Other rules determine duration of target values and transitions between 11/13/2018
Speech quality not natural Acoustic model incomplete Human knowledge of linguistic and acoustic control rules incomplete 11/13/2018
Concatenative Synthesis Pre-recorded human speech Cut up into units, code, store (indexed) Diphones typical Given a phonemic transcription Rules select unit sequence Rules concatenate units based on some selection criteria Rules modify duration, amplitude, pitch, source – and smooth spectrum across junctures 11/13/2018
Issues Speech quality varies based on Size and number of units (coverage) Rules Speech coding method used to decompose acoustic signal into spectral, F0, amplitude parameters How much the signal must be modified to produce the output 11/13/2018
Coding Methods LPC: Linear Predictive Coding Decompose waveform into vocal tract/formant frequencies, F0, amplitude: simple model of glottal excitation Robotic More elaborate variants (MPLPC, RELP) less robotic but distortions when change in F0, duration PSOLA (pitch synchronous overlap/add): No waveform decomposition 11/13/2018
No coding (use natural speech) Delete/repeat pitch periods to change duration Overlap pitch periods to change F0 Distortion if large F0, durational change Sensitive to definition of pitch periods No coding (use natural speech) Avoid distortions of coding methods But how to change duration, F0, amplitude? 11/13/2018
Corpus-based Unit Selection Units determined case-by-case from large hand or automatically labeled corpus Amount of concatenation depends on input and corpus Algorithms for determining best units to use Longest match to phonemes in input Spectral distance measures Matching prosodic, amplitude, durational features??? 11/13/2018
TTS Back End: Summary Speech most natural when least signal processing: corpus-based unit selection and no coding….but…. 11/13/2018
Where good match between input and database TTS: Where are we now? Natural sounding speech for some utterances Where good match between input and database Still…hard to vary prosodic features and retain naturalness Yes-no questions: Do you want to fly first class? Context-dependent variation still hard to infer from text and hard to realize naturally: 11/13/2018
Appropriate contours from text Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. Variation in pitch range, rate, pausal duration to convey topic structure Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. How to mimic real voices? 11/13/2018
TTS vs. CTS Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how But….generating prosody for CTS isn’t so easy In principle, the information TTS systems lack to support natural prosodic assignment is readily available to CTS systems. So the initial hope in the NLG community was that prosodic assignment would be a simple problem. It’s proven however fairly hard. Why? 11/13/2018
Next Week Read Discussion questions Write an outline of your class project and what you’ve done so far 11/13/2018