Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)

Slides:

Advertisements

Similar presentations

Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.

Advertisements

Building an ASR using HTK CS4706

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.

S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **

5-Text To Speech (TTS) Speech Synthesis

CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 14: Waveform Synthesis (in Concatenative TTS) Lots.

Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.

Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.

Natural Language Processing - Speech Processing -

Application of HMMs: Speech recognition “Noisy channel” model of speech.

Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

COMP 4060 Natural Language Processing Speech Processing.

Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.

LSA 352 Speech Recognition and Synthesis

Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.

A PRESENTATION BY SHAMALEE DESHPANDE

A Text-to-Speech Synthesis System

Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,

Representing Acoustic Information

Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Introduction to Automatic Speech Recognition

LE 460 L Acoustics and Experimental Phonetics L-13

Speech synthesis Recording and sampling Speech recognition Apr. 5

Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.

6-Text To Speech (TTS) Speech Synthesis

04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.

Isolated-Word Speech Recognition Using Hidden Markov Models

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

Prepared by: Waleed Mohamed Azmy Under Supervision:

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.

Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.

Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.

Performance Comparison of Speaker and Emotion Recognition

HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.

G. Anushiya Rachel Project Officer

CS 224S / LINGUIST 285 Spoken Language Processing

Mr. Darko Pekar, Speech Morphing Inc.

Text-To-Speech System for English

Speech and Language Processing

EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES

Presentation transcript:

Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)

Architectures of Modern Synthesis Articulatory Synthesis: –Model movements of articulators and acoustics of vocal tract Formant Synthesis: –Start with acoustics, create rules/filters to create each formant Concatenative Synthesis: –Use databases of stored speech to assemble new utterances: diphone, unit selection HMM Synthesis Text from Richard Sproat slides 6/22/20152 Speech and Language Processing Jurafsky and Martin

Formant Synthesis In past, most common commercial systems (while computers relatively underpowered) 1979 MIT MITalk (Allen, Hunnicut, Klatt) 1983 DECtalk system Voice of Stephen Hawking 6/22/20153 Speech and Language Processing Jurafsky and Martin

Concatenative Synthesis All current commercial systems Diphone Synthesis –Units are diphones; middle of one phone to middle of next. –Why? Middle of phone is steady state. –Record 1 speaker saying each diphone Unit Selection Synthesis –Larger units –Record 10 hours or more, so have multiple copies of each unit –Use search to find best sequence of units 6/22/20154 Speech and Language Processing Jurafsky and Martin

TTS Demos (all Unit-Selection) Festival – 2.cs.cmu.edu/~awb/festival_demos/index.htmlhttp://www- 2.cs.cmu.edu/~awb/festival_demos/index.html Cepstral – bin/demos/generalhttp:// bin/demos/general AT&T – o.phphttp://www2.research.att.com/~ttsweb/tts/dem o.php 6/22/20155

How do we get from Text to Speech? TTS Backend takes representation of segments + f0 + duration + ?? and creates a waveform A full system needs to go all the way from random text to sound 6/22/20156

Front End and Back End PG&E will file schedules on April 20. TEXT ANALYSIS: Text to intermediate representation: WAVEFORM SYNTHESIS: From intermediate representation to waveform 6/22/20157 Speech and Language Processing Jurafsky and Martin

The Hourglass 6/22/20158 Speech and Language Processing Jurafsky and Martin

Waveform Synthesis Given: –String of phones –Prosody Desired F0 for entire utterance Duration for each phone Stress value for each phone, possibly accent value Intensity? Generate/find: –Waveforms 6/22/20159 Speech and Language Processing Jurafsky and Martin

Diphone TTS Training: –Choose units (kinds of diphones) –Record 1 speaker saying at least 1 example of each –Mark boundaries and segment to create diphone database Synthesis: –Select relevant set of diphones from database –Concatenate them in order, doing minor signal processing at boundaries –Use signal processing techniques to change prosody (F0, energy, duration) of sequence 6/22/ Speech and Language Processing Jurafsky and Martin

Diphones Where is the stable region? 6/22/ Speech and Language Processing Jurafsky and Martin

Diphone Database Middle of phone more stable than edges Need O(phone2) number of units –Some phone-phone sequences don’t exist –ATT (Olive et al.’98) system had 43 phones 1849 possible diphones but only 1172 actual Phonotactics: –[h] only occurs before vowels –Don’t need diphones across silence –But…may want to include stress or accent differences, consonant clusters, etc Requires much knowledge of phonetics in design Database relatively small (by today’s standards) –Around 8 megabytes for English (16 KHz 16 bit) 6/22/201512

Voice Speaker –Called voice talent –How to choose? Diphone database –Called a voice –Modern TTS systems have multiple voices 6/22/ Speech and Language Processing Jurafsky and Martin

Prosodic Modification Modifying pitch and duration independently Changing sample rate modifies both: –Chipmunk speech Duration: duplicate/remove parts of the signal Pitch: re-sample to change pitch Text from Alan Black 6/22/ Speech and Language Processing Jurafsky and Martin

Speech as Sequence of Short Term Signals Alan Black 6/22/ Speech and Language Processing Jurafsky and Martin

Duration Modification Duplicate/remove short term signals Slide from Richard Sproat 6/22/201516

Duration modification Duplicate/remove short term signals 6/22/ Speech and Language Processing Jurafsky and Martin

Pitch Modification Move short-term signals closer together/further apart: more cycles per sec means higher pitch and vice versa Add frames as needed to maintain desired duration Slide from Richard Sproat 6/22/ Speech and Language Processing Jurafsky and Martin

TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Epoch detection and windowing Pitch-synchronous Overlap-and-add Very efficient Can modify Hz up to two times or by half Smoother transitions 6/22/ Speech and Language Processing Jurafsky and Martin

Unit Selection Synthesis Generalization of the diphone intuition –Larger units From diphones to phrases to …. sentences –Record many copies of each unit E.g., 10 hours of speech instead of 1500 diphones (a few minutes of speech) –Label diphones and their midpoints 6/22/201520

Unit Selection Intuition Given a large labeled database, find the unit that best matches the desired synthesis specification What does “best” mean? –Target cost: Find closest match in terms of Phonetic context F0, stress, phrase position,… –Join cost: Find best join with neighboring units Matching formants + other spectral characteristics Matching energy Matching F0 6/22/ Speech and Language Processing Jurafsky and Martin

Targets and Target Costs Target cost T(ut,st): How well does target specification st match potential db unit ut? Goal: find unit least unlike target Examples of labeled diphone midpoints –/ih-t/ +stress, phrase internal, high F0, content word –/n-t/ -stress, phrase final, high F0, function word –/dh-ax/ -stress, phrase initial, low F0, word=the Costs of different features have different weights 6/22/ Speech and Language Processing Jurafsky and Martin

Target Costs Comprised of p subcosts –Stress –Phrase position –F0 –Phone duration –Lexical identity Target cost for a unit: Slide from Paul Taylor 6/22/ Speech and Language Processing Jurafsky and Martin

Join (Concatenation) Cost Measure of smoothness of join between each pair of units to be joined (target irrelevant) Features, costs, and weights (w) Comprised of p subcosts: –Spectral features –F0 –Energy Join cost: Slide from Paul Taylor 6/22/ Speech and Language Processing Jurafsky and Martin

Total Costs Hunt and Black 1996 We now have weights (per phone type) for features set between target and database units Find best path of units through database that minimizes: Standard problem solvable with Viterbi search with beam width constraint for pruning Slide from Paul Taylor 6/22/ Speech and Language Processing Jurafsky and Martin

Synthesizing…. 6/22/ Speech and Language Processing Jurafsky and Martin

Unit Selection Summary Advantages –Quality far superior to diphones: fewer joins, more choices of units –Natural prosody selection sounds better Disadvantages: –Quality very bad when no good match in database HCI issue: mix of very good and very bad quite annoying –Synthesis is computationally expensive –Can’t control prosody well at all Diphone technique can vary emphasis Unit selection can give result that conveys wrong meaning 6/22/201527

New Trend Major Problem with Unit Selection Synthesis –Can’t modify signal Mixing modified and unmodified sounds unpleasant Database often doesn’t have exactly what you want Solution?: HMM (Hidden Markov Model) Synthesis –Won recent TTS bakeoff Sounds less natural to researchers but naïve subjects preferred Has potential to improve over both diphone and unit selection Generate speech parameters from statistics trained on data Voice quality can be changed by transforming HMM parameters 6/22/ Speech and Language Processing Jurafsky and Martin

Concatenative Synthesis

HMM Synthesis A parametric model Can train on mixed data from many speakers Model takes up a very small amount of space Speaker adaptation possible

HMMs Some hidden process has generated some visible observation.

HMMs Some hidden process has generated some visible observation.

HMMs Hidden states have transition probabilities and emission probabilities.

HMM Synthesis Every phoneme+context is represented by an HMM. The cat is on the mat. The cat is near the door. Acoustic features extracted: f0, spectrum, duration Train HMM with these examples.

HMM Synthesis Each state outputs acoustic features (a spectrum, an f 0, and duration)

HMM Synthesis Each state outputs acoustic features (a spectrum, an f 0, and duration) Interpolate....

Problems with HMM Synthesis Many contextual features = data sparsity Cluster similar-sounding phones, e.g: 'bog' and 'dog’ /aa/ in both have similar acoustic features, if different context Create single HMM that produces both, and was trained on examples of both

Experiments: Google, Summer 2010 Can we train on lots of mixed data? (~1 utterance per speaker) More data vs. better data 15k utterances from Google Voice Search as training data ace hardware rural supply

More Data vs. Better Data Voice Search utterances filtered by speech recognition confidence scores 50%, 6849 utterances 75%, 4887 utterances 90%, 3100 utterances 95%, 2010 utterances 99%, 200 utterances

HMM Synthesis Unit selection (Roger) HMM (Roger) Unit selection (Nina) HMM (Nina) 6/22/ Speech and Language Processing Jurafsky and Martin

Tokuda et al ’02

Demo TTS Systems 6/22/ Speech and Language Processing Jurafsky and Martin