Download presentation
Presentation is loading. Please wait.
1
Speech Generation: From Concept and from Text
Julia Hirschberg CS 6998 11/13/2018
2
Today TTS CTS 11/13/2018
3
Traditional TTS Systems
Monologue News articles, , books, phone directories Input: plain text How to infer intention behind text? 11/13/2018
4
Human Speech Production Levels
World Knowledge Semantics Syntax Word Phonology Motor Commands, articulator movements, F0, amplitude, duration Acoustics 11/13/2018
5
TTS Production Levels: Back End and Front End
Orthographic input: The children read to Dr. Smith World Knowledge text normalization Semantics Syntax word pronunciation Word Phonology intonation assigment F0, amplitude, duration Acoustics synthesis 11/13/2018
6
Text Normalization Context independent: Mr., 22, $N, NAACP, MAACO VISA
Context-dependent: Dr., St., 1997, 3/16 Abbreviation ambiguities: How to resolve? Application restrictions – all names? Rule or corpus-based decision procedure (Sproat et al ‘01) 11/13/2018
7
Part-of-speech ambiguity:
The convict went to jail/They will convict him Said said hello They read books/They will read books Use: local lexical context, pos tagger, parser? Sense ambiguity: I fish for bass/I play the bass Use: decision lists (Yarowsky ’94) 11/13/2018
8
Word Pronunciation Letter-to-Sound rules vs. large dictionary
O: _{C}e$ /o/ hope O /a/ hop Morphological analysis Popemobile Hoped Ethnic classification Fujisaki, Infiniti 11/13/2018
9
Goal: phonemes+syllabification+lexical stress Context-dependent too:
Rhyming by analogy Meronymy/metonymy Exception Dictionary Beethoven Goal: phonemes+syllabification+lexical stress Context-dependent too: Give the book to John. To John I said nothing. 11/13/2018
10
Intonation Assignment: Phrasing
Traditional: hand-built rules Punctuation Context/function word: no breaks after function word He went to dinner Parse? She favors the nuts and bolts approach Current: statistical analysis of large labeled corpus Punctuation, pos window, utt length,… 11/13/2018
11
Intonation Assignment: Accent
Hand-built rules Function/content distinction He went out the back door/He threw out the trash Complex nominals: Main Street/Park Avenue city hall parking lot Statistical procedures trained on large corpora Contrastive stress, given/new distinction? 11/13/2018
12
Intonation Assignment: Contours
Simple rules ‘.’ = declarative contour ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence Well, how did he do it? And what do you know? 11/13/2018
13
The TTS Front End Today Corpus-based statistical methods instead of hand-built rule-sets Dictionaries instead of rules (but fall-back to rules) Modest attempts to infer contrast, given/new Text analysis tools: pos tagger, morphological analyzer, little parsing 11/13/2018
14
TTS Back End: Phonology to Acoustics
Goal: Produce a phonological representation from segmentals (phonemes) and suprasegmentals (accent and phrasing assignment) Convert to an acoustic signal (spectrum, pitch, duration, amplitude) From phonetics to signal processing 11/13/2018
15
Phonological Modeling: Duration
How long should each phoneme be? Identify of context phonemes Position within syllable and # syllables Phrasing Stress Speaking rate 11/13/2018
16
Phonological Modeling: Pitch
How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes? Contour or target models for accents, phrase boundaries Rules to align phoneme string and smooth How does F0 align with different phonemes? 11/13/2018
17
Phonetic Component: Segmentals
Phonemes have different acoustic realizations depending on nearby phonemes, stress To/to, butter/tail Approaches: Articulatory synthesis Formant synthesis Concatenative synthesis Diphone or unit selection 11/13/2018
18
Articulatory Synthesis-by-Rule
Model articulators: tongue body, tip, jaw, lips, velum, vocal folds Rules control timing of movements of each articulator Easy to model coarticulation since articulators modeled separately But: sounds very unnatural Transform from vocal tract to acoustics not well understood Knowledge of articulator control rules incomplete 11/13/2018
19
Formant (Acoustic) Synthesis by Rule
Model of acoustic parameters: Formant frequencies, bandwidths, amplitude of voicing, aspiration… Phonemes have target values for parameters Given a phonemic transcription of the input: Rules select sequence of targets Other rules determine duration of target values and transitions between 11/13/2018
20
Speech quality not natural Acoustic model incomplete
Human knowledge of linguistic and acoustic control rules incomplete 11/13/2018
21
Concatenative Synthesis
Pre-recorded human speech Cut up into units, code, store (indexed) Diphones typical Given a phonemic transcription Rules select unit sequence Rules concatenate units based on some selection criteria Rules modify duration, amplitude, pitch, source – and smooth spectrum across junctures 11/13/2018
22
Issues Speech quality varies based on
Size and number of units (coverage) Rules Speech coding method used to decompose acoustic signal into spectral, F0, amplitude parameters How much the signal must be modified to produce the output 11/13/2018
23
Coding Methods LPC: Linear Predictive Coding
Decompose waveform into vocal tract/formant frequencies, F0, amplitude: simple model of glottal excitation Robotic More elaborate variants (MPLPC, RELP) less robotic but distortions when change in F0, duration PSOLA (pitch synchronous overlap/add): No waveform decomposition 11/13/2018
24
No coding (use natural speech)
Delete/repeat pitch periods to change duration Overlap pitch periods to change F0 Distortion if large F0, durational change Sensitive to definition of pitch periods No coding (use natural speech) Avoid distortions of coding methods But how to change duration, F0, amplitude? 11/13/2018
25
Corpus-based Unit Selection
Units determined case-by-case from large hand or automatically labeled corpus Amount of concatenation depends on input and corpus Algorithms for determining best units to use Longest match to phonemes in input Spectral distance measures Matching prosodic, amplitude, durational features??? 11/13/2018
26
TTS Back End: Summary Speech most natural when least signal processing: corpus-based unit selection and no coding….but…. 11/13/2018
27
Where good match between input and database
TTS: Where are we now? Natural sounding speech for some utterances Where good match between input and database Still…hard to vary prosodic features and retain naturalness Yes-no questions: Do you want to fly first class? Context-dependent variation still hard to infer from text and hard to realize naturally: 11/13/2018
28
Appropriate contours from text
Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. Variation in pitch range, rate, pausal duration to convey topic structure Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. How to mimic real voices? 11/13/2018
29
TTS vs. CTS Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how But….generating prosody for CTS isn’t so easy In principle, the information TTS systems lack to support natural prosodic assignment is readily available to CTS systems. So the initial hope in the NLG community was that prosodic assignment would be a simple problem. It’s proven however fairly hard. Why? 11/13/2018
30
Next Week Read Discussion questions
Write an outline of your class project and what you’ve done so far 11/13/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.