Speech Generation: From Concept and from Text

Slides:



Advertisements
Similar presentations
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Advertisements

Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
5-Text To Speech (TTS) Speech Synthesis
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
SYNTAX 1 DAY 30 – NOV 6, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
6/14/20151 Speech Synthesis: Then and Now Julia Hirschberg CS 4706.
Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.
Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Chapter three Phonology
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Digital signal Processing Digital signal Processing ECI Semester /2004 Telecommunication and Internet Engineering, School of Engineering, South.
A PRESENTATION BY SHAMALEE DESHPANDE
A Text-to-Speech Synthesis System
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Phonetics and Phonology
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Science Fall 2009 Nov 2, Outline Suprasegmental features of speech Stress Intonation Duration and Juncture Role of feedback in speech production.
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Korea Maritime and Ocean University NLP Jung Tae LEE
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Failed, because: Discriminability alone is not enough; code on speech needs to be compatible with speech. Minimally, must have the speed of speech. Lessons:
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
© 2013 by Larson Technical Services
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Phonetics, part III: Suprasegmentals October 19, 2012.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
Suprasegmental Properties of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
G. Anushiya Rachel Project Officer
Università di Cagliari
Mr. Darko Pekar, Speech Morphing Inc.
Text-To-Speech System for English
SUPRASEGMENTAL PHONEME
Kuiper and Allan Chapter 6.2
Statistical NLP: Lecture 13
Studying Intonation Julia Hirschberg CS /21/2018.
Meanings of Intonational Contours
Studying Intonation Julia Hirschberg CS /21/2018.
Intonational and Its Meanings
Intonational and Its Meanings
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Automatic Speech Recognition
The American School and ToBI
Meaningful Intonational Variation
Speech and Language Processing
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Meanings of Intonational Contours
Turn-taking and Disfluencies
Informatique et Phonétique
Representing Intonational Variation
Advanced NLP: Speech Research and Technologies
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Predicting Phrasing and Accent
Advanced NLP: Speech Research and Technologies
Comparative Studies Avesani et al 1995; Hirschberg&Avesani 1997
Predicting Phrasing and Accent
Intonational and Its Meanings
Indian Institute of Technology Bombay
Artificial Intelligence 2004 Speech & Natural Language Processing
Fromkin's Utterance Generator
Presentation transcript:

Speech Generation: From Concept and from Text Julia Hirschberg CS 6998 11/13/2018

Today TTS CTS 11/13/2018

Traditional TTS Systems Monologue News articles, email, books, phone directories Input: plain text How to infer intention behind text? 11/13/2018

Human Speech Production Levels World Knowledge Semantics Syntax Word Phonology Motor Commands, articulator movements, F0, amplitude, duration Acoustics 11/13/2018

TTS Production Levels: Back End and Front End Orthographic input: The children read to Dr. Smith World Knowledge text normalization Semantics Syntax word pronunciation Word Phonology intonation assigment F0, amplitude, duration Acoustics synthesis 11/13/2018

Text Normalization Context independent: Mr., 22, $N, NAACP, MAACO VISA Context-dependent: Dr., St., 1997, 3/16 Abbreviation ambiguities: How to resolve? Application restrictions – all names? Rule or corpus-based decision procedure (Sproat et al ‘01) 11/13/2018

Part-of-speech ambiguity: The convict went to jail/They will convict him Said said hello They read books/They will read books Use: local lexical context, pos tagger, parser? Sense ambiguity: I fish for bass/I play the bass Use: decision lists (Yarowsky ’94) 11/13/2018

Word Pronunciation Letter-to-Sound rules vs. large dictionary O: _{C}e$  /o/ hope O  /a/ hop Morphological analysis Popemobile Hoped Ethnic classification Fujisaki, Infiniti 11/13/2018

Goal: phonemes+syllabification+lexical stress Context-dependent too: Rhyming by analogy Meronymy/metonymy Exception Dictionary Beethoven Goal: phonemes+syllabification+lexical stress Context-dependent too: Give the book to John. To John I said nothing. 11/13/2018

Intonation Assignment: Phrasing Traditional: hand-built rules Punctuation 234-5682 Context/function word: no breaks after function word He went to dinner Parse? She favors the nuts and bolts approach Current: statistical analysis of large labeled corpus Punctuation, pos window, utt length,… 11/13/2018

Intonation Assignment: Accent Hand-built rules Function/content distinction He went out the back door/He threw out the trash Complex nominals: Main Street/Park Avenue city hall parking lot Statistical procedures trained on large corpora Contrastive stress, given/new distinction? 11/13/2018

Intonation Assignment: Contours Simple rules ‘.’ = declarative contour ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence Well, how did he do it? And what do you know? 11/13/2018

The TTS Front End Today Corpus-based statistical methods instead of hand-built rule-sets Dictionaries instead of rules (but fall-back to rules) Modest attempts to infer contrast, given/new Text analysis tools: pos tagger, morphological analyzer, little parsing 11/13/2018

TTS Back End: Phonology to Acoustics Goal: Produce a phonological representation from segmentals (phonemes) and suprasegmentals (accent and phrasing assignment) Convert to an acoustic signal (spectrum, pitch, duration, amplitude) From phonetics to signal processing 11/13/2018

Phonological Modeling: Duration How long should each phoneme be? Identify of context phonemes Position within syllable and # syllables Phrasing Stress Speaking rate 11/13/2018

Phonological Modeling: Pitch How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes? Contour or target models for accents, phrase boundaries Rules to align phoneme string and smooth How does F0 align with different phonemes? 11/13/2018

Phonetic Component: Segmentals Phonemes have different acoustic realizations depending on nearby phonemes, stress To/to, butter/tail Approaches: Articulatory synthesis Formant synthesis Concatenative synthesis Diphone or unit selection 11/13/2018

Articulatory Synthesis-by-Rule Model articulators: tongue body, tip, jaw, lips, velum, vocal folds Rules control timing of movements of each articulator Easy to model coarticulation since articulators modeled separately But: sounds very unnatural Transform from vocal tract to acoustics not well understood Knowledge of articulator control rules incomplete 11/13/2018

Formant (Acoustic) Synthesis by Rule Model of acoustic parameters: Formant frequencies, bandwidths, amplitude of voicing, aspiration… Phonemes have target values for parameters Given a phonemic transcription of the input: Rules select sequence of targets Other rules determine duration of target values and transitions between 11/13/2018

Speech quality not natural Acoustic model incomplete Human knowledge of linguistic and acoustic control rules incomplete 11/13/2018

Concatenative Synthesis Pre-recorded human speech Cut up into units, code, store (indexed) Diphones typical Given a phonemic transcription Rules select unit sequence Rules concatenate units based on some selection criteria Rules modify duration, amplitude, pitch, source – and smooth spectrum across junctures 11/13/2018

Issues Speech quality varies based on Size and number of units (coverage) Rules Speech coding method used to decompose acoustic signal into spectral, F0, amplitude parameters How much the signal must be modified to produce the output 11/13/2018

Coding Methods LPC: Linear Predictive Coding Decompose waveform into vocal tract/formant frequencies, F0, amplitude: simple model of glottal excitation Robotic More elaborate variants (MPLPC, RELP) less robotic but distortions when change in F0, duration PSOLA (pitch synchronous overlap/add): No waveform decomposition 11/13/2018

No coding (use natural speech) Delete/repeat pitch periods to change duration Overlap pitch periods to change F0 Distortion if large F0, durational change Sensitive to definition of pitch periods No coding (use natural speech) Avoid distortions of coding methods But how to change duration, F0, amplitude? 11/13/2018

Corpus-based Unit Selection Units determined case-by-case from large hand or automatically labeled corpus Amount of concatenation depends on input and corpus Algorithms for determining best units to use Longest match to phonemes in input Spectral distance measures Matching prosodic, amplitude, durational features??? 11/13/2018

TTS Back End: Summary Speech most natural when least signal processing: corpus-based unit selection and no coding….but…. 11/13/2018

Where good match between input and database TTS: Where are we now? Natural sounding speech for some utterances Where good match between input and database Still…hard to vary prosodic features and retain naturalness Yes-no questions: Do you want to fly first class? Context-dependent variation still hard to infer from text and hard to realize naturally: 11/13/2018

Appropriate contours from text Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. Variation in pitch range, rate, pausal duration to convey topic structure Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. How to mimic real voices? 11/13/2018

TTS vs. CTS Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how But….generating prosody for CTS isn’t so easy In principle, the information TTS systems lack to support natural prosodic assignment is readily available to CTS systems. So the initial hope in the NLG community was that prosodic assignment would be a simple problem. It’s proven however fairly hard. Why? 11/13/2018

Next Week Read Discussion questions Write an outline of your class project and what you’ve done so far 11/13/2018