6-Text To Speech (TTS) Speech Synthesis

Slides:



Advertisements
Similar presentations
Digital Signal Processing
Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
5-Text To Speech (TTS) Speech Synthesis
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
G.S.MOZE COLLEGE OF ENGINNERING BALEWADI,PUNE -45.
Xkl: A Tool For Speech Analysis Eric Truslow Adviser: Helen Hanson.
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.
Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner.
Dynamic Time Warping Applications and Derivation
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
A PRESENTATION BY SHAMALEE DESHPANDE
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Introduction to Automatic Speech Recognition
Speech synthesis Recording and sampling Speech recognition Apr. 5
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.
Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
DR.D.Y.PATIL POLYTECHNIC, AMBI COMPUTER DEPARTMENT TOPIC : VOICE MORPHING.
Structure of Spoken Language
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
HMM training strategy for incremental speech synthesis.
Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
G. Anushiya Rachel Project Officer
Mr. Darko Pekar, Speech Morphing Inc.
Text-To-Speech System for English
Speech and Language Processing
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Digital Systems: Hardware Organization and Design
Presentation transcript:

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone Sequence To Speech Articulatory Approaches Concatenative Approaches HMM-based Approaches Rule-Based Approaches

Speech Synthesis Concept Text Speech waveform Speech waveform Text Text to Phone Sequence Phone Sequence to Speech Natural Language Processing (NLP) Speech Processing

Speech Naturalness Obviation of undesirable noise and distortion and dissociation from speech Prosody generation Speech energy Duration pitch Intonation Stress

Speech Naturalness (Cont’d) Intonation and Stress are very effective in speech naturalness Intonation : Variation of Pitch frequency along speaking Stress : Increasing the pitch frequency in a specific time

Which word receives an intonation? It depends on the context. The ‘new’ information in the answer to a question is often accented while the ‘old’ information is usually not. Q1: What types of foods are a good source of vitamins? A1: LEGUMES are a good source of vitamins. Q2: Are legumes a source of vitamins? A2: Legumes are a GOOD source of vitamins. Q3: I’ve heard that legumes are healthy, but what are they a good source of ? A3: Legumes are a good source of VITAMINS. Slide from Jennifer Venditti

Same ‘tune’, different alignment LEGUMES are a good source of vitamins The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Same ‘tune’, different alignment Legumes are a GOOD source of vitamins The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Same ‘tune’, different alignment legumes are a good source of VITAMINS The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Types of Waveform Synthesis Articulatory Synthesis: Model movements of articulators and acoustics of vocal tract Concatenative Synthesis: Use databases of stored speech to assemble new utterances. Diphone Unit Selection Statistical (HMM) Synthesis Trains parameters on databases of speech Rule-Based (Formant) Synthesis: Start with acoustics, create rules/filters to create waveform

Articulatory Synthesis Simulation of physical processes of human articulation Wolfgang von Kempelen (1734-1804) and others used bellows, reeds and tubes to construct mechanical speaking machines Modern versions “simulate” electronically the effect of articulator positions, vocal tract shape, etc on air flow.

Concatenative approaches Two main approaches: 1- Concatenating Phone Units Example: concatenating samples of recorded diphones or syllables 2- Unit selection Uses several samples for each phone unit and selects the most appropriate one when synthesizing

Phone Units Paragraph ( ) Sentence ( ) Word (Depends on the language. Usually more than 100,000) Syllable Diphone & Triphone Phoneme (Between 10 , 100)

Phone Units (Cont’d) Diphone : We model Transitions between two phonemes . . . . . p1 p2 p3 p4 p5 Diphone Phoneme

Phone Units (Cont’d) Farsi phonemes: 30 Farsi diphones: 30*30 = 900 Phoneme /zho/ is missing (?) Farsi triphones: 27000 in theory Not all of the triphones are used

Phone Units (Cont’d) Syllable = Onset (Consonant) + Rhyme Syllable is a set of phonemes that exactly contains one vowel Syllables in Farsi : CV , CVC , CVCC We have about 4000 Syllables in Farsi Syllables in English :V, CV , CVC ,CCVC, CCVCC, CCCVC, CCCVCC, . . . Number of Syllables in English is too many

Phone Sequence To Speech (Cont’d) to primitive utterance primitive utterance to Natural Speech Text to Phone Sequence Speech Text NLP Speech Processing

Concatenative Approaches In this approaches we store units of natural speech for reconstruction of desired speech We could select the appropriate phone unit for speech synthesis we can store compressed parameters instead of main waveform

Concatenative Approaches (Cont’d) Benefits of storing compressed parameters instead of main waveform Less memory use General state instead of a specific stored utterance Generating prosody easily

Concatenative Approaches (Cont’d) Phone Unit Type of Storing Paragraph Sentence Word Syllable Diphone Phoneme Main Waveform Coded/Main Waveform Coded Waveform

Concatenative Approaches (Cont’d) Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing Overlap-Add-Method is a standard DSP method PSOLA is a base action for Voice Conversion. In this method in analysis stage we select frames that are synchronous by pitch markers.

Diphone Architecture Example Training: Choose units (kinds of diphones) Record 1 speaker saying 1 example of each diphone Mark the boundaries of each diphones, cut each diphone out and create a diphone database Synthesizing an utterance, grab relevant sequence of diphones from database Concatenate the diphones, doing slight signal processing at boundaries use signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones

Unit Selection Same idea as concatenative synthesis, but database contains bigger varieties of “phone units” from diphones to sentences Multiple examples of phone units (under different prosodic conditions) are recorded Selection of appropriate unit therefore becomes more complex, as there are in the database competing candidates for selection

Unit Selection Unlike diphone concatenation, little or no signal processing applied to each unit Natural data solves problems with diphones Diphone databases are carefully designed but: Speaker makes errors Speaker doesn’t speak intended dialect Require database design to be right If it’s automatic Labeled with what the speaker actually said Coarticulation, schwas, flaps are natural “There’s no data like more data” Lots of copies of each unit mean you can choose just the right one for the context Larger units mean you can capture wider effects

Unit Selection Issues Given a big database For each segment (diphone) that we want to synthesize Find the unit in the database that is the best to synthesize this target segment What does “best” mean? “Target cost”: Closest match to the target description, in terms of Phonetic context F0, stress, phrase position “Join cost”: Best join with neighboring units Matching formants + other spectral characteristics Matching energy Matching F0

Unit Selection Search

Joining Units unit selection, just like diphone, need to join the units Pitch-synchronously For diphone synthesis, need to modify F0 and duration For unit selection, in principle also need to modify F0 and duration of selection units But in practice, if unit-selection database is big enough (commercial systems) no prosodic modifications (selected targets may already be close to desired prosody)

Unit Selection Summary Advantages Quality is far superior to diphones Natural prosody selection sounds better Disadvantages: Quality can be very bad in some places HCI problem: mix of very good and very bad is quite annoying Synthesis is computationally expensive Needs more memory than diphone synthesis

Rule-Based Approach Stages Determine the speech model and model parameters Determine type of phone units Determine some parameter amount for each phone unit Substitute sequence of phone units by its equivalent parameter sequence Put parameter sequence in speech model

KLATT 80 Model

KLATT 88 Model

THE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZER FNP FNZ FTP FTZ F1 B1 BNP BNZ BTP BTZ DF1 DB1 F2 B2 F3 B3 F4 B4 F5 B5 GLOTTAL SOUND SOURCES NASAL POLE ZERO PAIR TRACHEAL POLE ZERO PAIR FIRST FORMANT RESONATOR SECOND FORMANT RESONATOR THIRTH FORMANT RESONATOR FOURTH FORMANT RESONATOR FIFTH FORMANT RESONATOR FILTERED IMPULSE TRAIN TL CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES F0 AV OO FL DI SPECTRAL TILT LOW-PAS RESONANTOR KL GLOTT 88 model (default) SS CP + NASAL FORMANT RESONATOR AH ANV ASPIRATION NOISE GENERATOR SO MODIFIED LF MODEL FIRST FORMANT RESONATOR A1V SECOND FORMANT RESONATOR B2F + - A2F FIRST DIFFERENCE PREEMPHASIS SECOND FORMANT RESONATOR A2V + THIRD FORMANT RESONATOR B3F A3F THIRTH FORMANT RESONATOR AF A3V FRICATION NOISE GENERATOR FOURTH FORMANT RESONATOR B4F A4F FOURTH FORMANT RESONATOR A4V FIFTH FORMANT RESONATOR B5F + - A5F TRACHEAL FORMANT RESONATOR ATV B6F F6 SIXTH FORMANT RESONATOR A6F AB PARALLEL VOCAL TRACT MODEL LYRYNGEAL SOUND SOURCES (NORMALLY NOT USED) BYPASS PATH PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES

Three Voicing Source Model In KLATT 88 The old KLSYN impulsive source The KLGLOTT88 model The modified LF model

HMM-Based Synthesis Corpus-based, statistical parametric synthesis Proposed in mid-'90s, becomes popular since mid-'00s Large data + automatic training => Automatic voice building Source-filter model + statistical acoustic model Flexible to change its voice characteristics HMM as its statistical acoustic model We focus on HMM-based speech synthesis

First extract parametric representations of speech including spectral and excitation parameters from a speech database Model them by using a set of generative models (e.g., HMMs) Training Synthesis

Speech Parameter Modeling Based on HMM Spectral parameter modeling Excitation parameter modeling State duration modeling

Spectral parameter modeling Mel-cepstral analysis has been used for spectral estimation A continuous density HMM is used for the vocal tract modeling in the same way as speech recognition systems. The continuous density Markov model is a finite state machine which makes one state transition at each time unit (i.e. frame). First, a decision is made to which state to succeed (including the state itself). Then an output vector is generated according to the probability density function (pdf) for the current state

Cont’d…

F0 parameter modeling While the observation of F0 has a continuous value in the voiced region, there exists no value for the unvoiced region. We can model this kind of observation sequence assuming that the observed F0 value occurs from one-dimensional spaces and the “unvoiced” symbol occurs from the zero-dimensional space.

Calculation of dynamic feature: As was mentioned, mel-cepstral coefficient is used as spectral parameter, Their dynamic feature Δc and Δ2c are calculated as follows: Dynamic features for F0: In unvoiced region, pt, Δpt and Δ2pt are defined as a discrete symbol. When dynamic features at the boundary between voiced and unvoiced cannot be calculated, they are defined as a discrete symbol.

Effect of dynamic feature By using dynamic features, the generated speech parameter vectors reflect not only the means of static and dynamic feature vectors but also the covariances of those Estimation will be smoother Good and bad effect

Multi-Stream HMM structure: The sequence of mel-cepstral coefficient vector and F0 pattern are modeled by a continuous density HMM and multi-space probability distribution HMM Putting all this together has some advantages

Synthesis part An arbitrarily given text to be synthesized is converted to a context-based label sequence. The text is converted a context dependent label sequence by a text analyzer. For the TTS system, the text analyzer should have the ability to extract contextual information. However, no text analyzer has the ability to extract accentual phrase and to decide accent type of accentual phrase.

Some contextual factors When we have HMM for each phoneme {preceding, current, succeeding} phoneme position of breath group in sentence {preceding, current, succeeding} part-of-speech position of current accentual phrase in current breath group position of current phoneme in current accentual phrase

According to the obtained state durations, a sequence of mel-cepstral coefficients and F0 values including voiced/unvoiced decisions is generated from the sentence HMM by using the speech parameter generation algorithm Finally, speech is synthesized directly from the generated mel-cepstral coefficients and F0 values by the MLSA filter

Spectral representation & corresponding filter cepstrum: LMA filter generalized cepstrum: GLSA filter mel-cepstrum: MLSA (Mel Log Spectrum Approximation) filter mel-generalized cepstrum: MGLSA filter LSP: LSP filter PARCOR: all-pole lattice filter LPC: all-pole filter

Advantages Most of the advantages of statistical parametric synthesis against unit-selection synthesis are related to its flexibility due to the statistical modeling process. Transforming voice characteristics, speaking styles, and emotions. Also Combination of unit selection and voice conversion (VC) techniques can alleviate this problem, high-quality voice-conversion is still problematic.

Adaptation (mimicking voices):Techniques of adaptation were originally developed in speech recognition to adjust general acoustic model , These techniques have also been applied to HMM-based speech synthesis to obtain speaker-specific synthesis systems with a small amount of speech data Interpolation (mixing voices): Interpolate parameters among representative HMM sets - Can obtain new voices even no adaptation data is available - Gradually change speakers. & speaking styles

Some rule-based applications KLATT C Implementation: http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/speech/systems/klatt/0.html A web page which generates speech waveform online given KLATT parameters: http://www.asel.udel.edu/speech/ tutorials/synthesis/Klatt.html Formant Synthesis Demo using Fant’s Formant Model http://www.speech.kth.se/ wavesurfer/formant/

TTS Online Demos AT&T: Festival Cepstral IBM http://www2.research.att.com/~ttsweb/tts/demo.php Festival http://www.cstr.ed.ac.uk/projects/festival/morevoices.html Cepstral http://www.cepstral.com/cgi-bin/demos/general IBM http://www.research.ibm.com/tts/coredemo.shtml

Festival Open source speech synthesis system Designed for development and runtime use Use in many commercial and academic systems Distributed with RedHat 9.x, etc Hundreds of thousands of users Multilingual No built-in language Designed to allow addition of new languages Additional tools for rapid voice development Statistical learning tools Scripts for building models 1/5/07 Text from Richard Sproat

Festival as software http://festvox.org/festival/ General system for multi-lingual TTS C/C++ code with Scheme scripting language General replaceable modules: Lexicons, LTS, duration, intonation, phrasing, POS tagging, tokenizing, diphone/unit selection, signal processing General tools Intonation analysis (f0, Tilt), signal processing, CART building, N-gram, SCFG, WFST 1/5/07 Text from Richard Sproat

Festival as software http://festvox.org/festival/ No fixed theories New languages without new C++ code Multiplatform (Unix/Windows) Full sources in distribution Free software 1/5/07 Text from Richard Sproat

CMU FestVox project Festival is an engine, how do you make voices? Festvox: building synthetic voices: Tools, scripts, documentation Discussion and examples for building voices Example voice databases Step by step walkthroughs of processes Support for English and other languages Support for different waveform synthesis methods Diphone Unit selection Limited domain 1/5/07 Text from Richard Sproat

Future Trends Speaker adaptation Language adaptation