Download presentation
Presentation is loading. Please wait.
1
Overview on Text to Speech Systems (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Kishore Prahallad International Institute of Information Technology (IIIT) Hyderabad, India & Language Technologies Institute, Carnegie Mellon University Kishore Prahallad IIIT-Hyderabad
2
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Topics Overview & Components of a Text to Speech System Text Normalization Linguistic Analysis Speech Generation Formant Synthesis Concatenative Synthesis Statistical Parametric Synthesis Kishore Prahallad IIIT-Hyderabad
3
A Text to Speech (TTS) system converts text into spoken language
Welcome to the world of text to speech systems… Text to Speech System Text Speech Kishore Prahallad IIIT-Hyderabad
4
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Types of TTS Systems Limited domain Voice built specifically for an application Limited set of words and sentences Weather forecasts Air/Rail Travel information systems Agriculture information systems etc.. Unrestricted A generic voice capable to reading anything! News Reading Story-telling Desktop assistant etc Kishore Prahallad IIIT-Hyderabad
5
How to synthesize speech?
Record a set of phones (say /a/, /aa/, /i/, /ii/, /k/, /kh/) Given a text, for each word obtain the sequence of phones to be concatenated For example: amma /a/ /m/ /m/ /a/ Concatenate the *pre-recorded phones* to get the speech !!!!!! No!!!!! Kishore Prahallad IIIT-Hyderabad
6
What needs to be incorporated then?
Coarticulation Coupling effect, when two sounds are produced together Production of /k/ and /a/ in isolation is different from producing /ka/ Energy Suitable energy contour Pitch Pitch and its contour (variation across the phones) Duration How long each phone should be Prosody Kishore Prahallad IIIT-Hyderabad
7
Architecture of a TTS System
Document Structure Detection Conversion from Unicode and Fonts Handling numbers, symbols, abbreviations etc. Text Tagged Text Pre-processing Text Normalization Word sequence Linguistic Analysis: Part of speech tagging Phrase breaks Letter to Sound Rules Prosodic Prediction Duration F0 Contour Energy Phone sequence Waveform-Generation Speech Kishore Prahallad IIIT-Hyderabad
8
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Why Preprocessing? Is the input to a TTS system a sequence of phones? NO! NO! NO! The input is *text* Raw text Formatted text (MS Word, PDF/PS, MS PPT) Tagged text (use XML like tags as markup for synthesis) Encoded text Multilingual text in Unicode, Fonts etc, etc….. Kishore Prahallad IIIT-Hyderabad
9
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Preprocessing contd... Conversion from different formats (pdf/ps/doc) to a generic tagged format or raw text Handle Multilingual Text in Unicode Unicode is similar to ASCII tables, but they can represent practically any language in the world Kishore Prahallad IIIT-Hyderabad
10
Architecture of a TTS System
Text Tagged Text Pre-processing Text Normalization Word sequence Linguistic Analysis: Part of speech tagging Phrase breaks Letter to Sound Rules Prosodic Prediction Duration F0 Contour Energy Phone sequence Waveform-Generation Speech Kishore Prahallad IIIT-Hyderabad
11
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Text Normalization An advanced TTS would be able to handle non-standard words Standard words are those whose entry can be found in pronunciation dictionary Pronunciation dictionary maps a word to a sequence of phones. Kishore Prahallad IIIT-Hyderabad
12
Abbreviations and Acronyms
Title Dr., MD, Mr., Mrs., St. (Saint), etc. Measure ft., Hz, mm, cm, in, kg Place names CO, LA, PA, USA, IN, St. (street), Dr. (drive) Kishore Prahallad IIIT-Hyderabad
13
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Numbers Phone numbers , (717) Dates mm/dd/yy, dd/mm/yy, July 4 05, Times 13:00, 1:00 PM, 12:15:35 Money $20, 300 € Kishore Prahallad IIIT-Hyderabad
14
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Numbers Account numbers 13 digit, 9 digit numbers Ordinal numbers 1st, 2nd, 1000th, ½, ¼, 1/100, Cardinal numbers - Amounts, statements 2426 two four two six twenty four twenty six two thousand four hundred and twenty six Kishore Prahallad IIIT-Hyderabad
15
Architecture of a TTS System
Text Tagged Text Pre-processing Text Normalization Word sequence Linguistic Analysis: Part of speech tagging Phrase breaks Letter to Sound Rules Prosodic Prediction Duration F0 Contour Energy Phone sequence Waveform-Generation Speech Kishore Prahallad IIIT-Hyderabad
16
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Linguistic Analysis Part of Speech (POS) Tagging Proper noun/verb/adjective etc A mapping table: word –> pos_tag Statistically trained or manually prepared Prosodic Phrase breaks POS tags are useful to predict phrase breaks in a sentence Ex: man’triji ne kahaa ki aaj hamaare desh man’triji ne kahaa ki [pau] aaj hamaare desh “ki” is a preposition, and we give a short pause while speaking Task is to predict these phrase breaks, so that short pauses can be introduced during synthesis Kishore Prahallad IIIT-Hyderabad
17
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Letter to Sound Rules Given a word, output the sequence of phones How? Pronunciation dictionary Maps the spelling to a set of phones ftp://ftp.cs.cmu.edu/afs/cs.cmu.edu/data/anonftp/project/fgdata/dict/cmudict.0.4 An entry looks like SPEECH S P IY1 CH [word] [phones] Kishore Prahallad IIIT-Hyderabad
18
What if there is no pronunciation Dictionary?
If no pronunciation dict. then – Use a set of simple rules Example: Indian languages A direct correspondence between what is written and what is spoken Hindi Word: namaskaara /n/ /a/ /m/ /a/ /s/ /k/ /aa/ /r/ $ Note: last /a/ -> $ (null) /a/ is a short vowel often referred to as schwa Process of mapping /a/ -> $ is known as schwa deletion Schwa deletion can be captured using a set of simple rules Ex: when /a/ occurs at the end of word map it to $ Letter to Sound rules can be learnt using statistical models too!! CART, HMM, Neural Networks Kishore Prahallad IIIT-Hyderabad
19
Architecture of a TTS System
Text Tagged Text Pre-processing Text Normalization Word sequence Linguistic Analysis: Part of speech tagging Phrase breaks Letter to Sound Rules Prosodic Prediction Duration F0 Contour Energy Phone sequence Waveform-Generation Speech Kishore Prahallad IIIT-Hyderabad
20
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Wave-Form Generation Formant Synthesis Concatenative Synthesis Diphone synthesis Unit selection synthesis Statistical Parametric Synthesis Kishore Prahallad IIIT-Hyderabad
21
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Formant Synthesis Each phone is produced by specifying the Formants and pitch A set of rules are also specified to modify pitch and formants, so that transition from one phone to another phone is sufficiently smooth Knowledge-base (manually built) Rules to generate the transitions (co-articulation) Formats Pitch Preprocessing Text Normalization Linguistic analysis Text Speech Formant Synthesizer Phones Kishore Prahallad IIIT-Hyderabad
22
Formant Synthesis contd..
Formant Synthesizers were deployed in commercial market in late 70’s and early 80’s DECTalk Pros Flexible to able to change parameters Generate intelligible speech with less number of parameters Cons Synthesized speech is not natural Knowledge base has to be built manually (not an easy task) A new language needs brand new effort Kishore Prahallad IIIT-Hyderabad
23
Concatenative Speech Synthesis
Don’t build a knowledge-base – instead record a speech database and *select* the required phone Needs a speech database and disk space to store Needs CPU time to select the segment Practical from the engineering perspective Then why people had built Formant synthesizers Motivation from the speech science Disk space and CPU time was much costlier in 70 & 80’s Kishore Prahallad IIIT-Hyderabad
24
A Typical Architecture of Concatenative Synthesis
A unit can be a phone OR a set of phones. If the set of phones, corresponds to a word, then the unit is a word. A recorded Speech database Preprocessing Text Normalization Linguistic analysis Text Speech Unit Selection Algorithm Phones Kishore Prahallad IIIT-Hyderabad
25
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Choice of Unit Word as a unit A large number of units to store Difficult to ensure coverage of all possible words (proper nouns etc). Useful for limited domain Phone as a unit No coarticulation present! Diphone as a unit Preserves the transition region between two phones and thus coarticulation is present Widely used unit for concatenation Kishore Prahallad IIIT-Hyderabad
26
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
What is diphone Phone 1 Phone 2 Kishore Prahallad IIIT-Hyderabad
27
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
What is diphone Phone 1 Phone 2 A diphone starts at the middle of the first phone and ends at the middle of the second phone Preserves the transient region between two phones Kishore Prahallad IIIT-Hyderabad
28
How to build a Diphone Voice
Record all possible phone-phone combinations in a language Example, record ka, ku, ki, kii, …kk, ks, kj.. Some combinations may not occur!! From each of the phone-phone recording, manually label the diphone boundaries Tools such as Emulabel display the waveform and allows you to label the boundaries Pool all the diphones to form a diphone database Kishore Prahallad IIIT-Hyderabad
29
A Typical Architecture of Diphone Synthesis
Prosodic Rules Diphone units (speech database) Preprocessing Text Normalization Linguistic analysis Text Speech Concatenation & Prosodic Manipulation Phones Kishore Prahallad IIIT-Hyderabad
30
Pros & Cons of Diphone Synthesis
Advantages over formant synthesis Easy to adapt for a new language Make use of recorded speech as apposed to modeling the formants and their transitions Cons Needs explicit modeling of prosody Output: Intelligible, but not natural Kishore Prahallad IIIT-Hyderabad
31
Diphone to Unit selection synthesis
Formant to diphone Avoids the building of a knowledge-base Make use of recorded speech Diphone to unit selection Avoids prosodic modeling Speech database consists of multiple examples of each diphone Record a diphone several times but in different contexts Store diphone units with varying prosody Don’t model the prosody, BUT *select* a diphone with suitable prosody Kishore Prahallad IIIT-Hyderabad
32
A Typical Architecture of Unit-Selection Synthesis
n - Diphone units (large speech database) Preprocessing Text Normalization Linguistic analysis Text Speech Unit Selection Phones Kishore Prahallad IIIT-Hyderabad
33
Building a Unit Selection Voice
Take a news paper text, say about 2000 sentences A more careful approach is to make sure that these 2000 sentences have a good coverage of all possible diphones Speak the sentences one by one thus create 1-2 hours of speech Recording should be done in a quiet environment Speech can be recorded using your desktop Kishore Prahallad IIIT-Hyderabad
34
Build Process: A higher level view
Goal – Automatically *extract* the diphones from the speech database and *index* them Automatic Extraction Label the phone boundaries in each of the spoken sentence. This task is referred to as speech segmentation. Speech segmentation is performed by HMMs (Neural Networks could also be used) Given the phone boundaries, approximate the diphone boundaries thus diphone-like units are obtained Indexing For each diphone-like unit, store context information Context information: Previous phone, next phone, position in the syllable, position in the word etc….. There could be thousands of units for each type For each unit-type build a decision trees to split the thousands of units into several sub-clusters Kishore Prahallad IIIT-Hyderabad
35
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
During Synthesis.. Given a sequence of phones For each phone, traverse through the corresponding decision tree and arrive at a set of target units Select a unit based on how well it matches with the input specification and how well it matches with the other units in the sequence Kishore Prahallad IIIT-Hyderabad
36
Pros and Cons of Unit Selection
For best examples, quality is high! Quality varies from high to often bad due to bad selection of units (or missing units) Strongly resemble the style of speech being recorded Hard to modify the characteristics for varying style, emotion etc. etc. Kishore Prahallad IIIT-Hyderabad
37
Statistical Parametric Synthesis (SPS)
Speech synthesized from parameters Parametric models are trained from speech data Vs. Older non-statistical techniques such as DEC-talk had parameters constructed from hand In Blizzard challenges, SPS technique based quality is rated higher by native listeners Consistency in the quality of the voice Reached a matured level where the quality is quite acceptable Hidden Markov Model based (HTS), Decision Tree Based (CLUSTERGEN) etc. Kishore Prahallad IIIT-Hyderabad
38
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Basic Technique Speech Parameter generation from HMM with the use of dynamic (delta) features Speech synthesis from Mel-cepstrum A vocoding technique based on Mel-cepstrum F0 used for excitation generation F0 pattern modeling using HMMs Kishore Prahallad IIIT-Hyderabad
39
HMM Based Speech Synthesis
Ref: Kishore Prahallad IIIT-Hyderabad
40
Comparison of diphone, unit and HTS voices
Diph, unit, hts Kishore Prahallad IIIT-Hyderabad
41
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Open Source Tools Festival – multi-lingual speech synthesis engine Festvox – A set of tools to create a new voice in a new language Kishore Prahallad IIIT-Hyderabad
42
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
Further Directions Reading Style Commonly used mode for many applications Emphasis Story-Telling Emotional Neutral, Sad, Happy and Anger Moods Stylistic Specific to a speaker Kishore Prahallad IIIT-Hyderabad
43
Kishore Prahallad (kishore@iiit.ac.in), IIIT-Hyderabad
References CMU course slides CMU Course Lecture Notes Building Synthetic Voices The Festival Speech Synthesis System Black, A. (2006), CLUSTERGEN: A Statistical Parametric Synthesizer using Trajectory Modeling, Interspeech ICSLP, Pittsburgh, PA. Black, A., Zen, H., and Tokuda, K, (2007) Statistical Parametric Synthesis, ICASSP 2007, Hawaii. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and Kitamura T., “Speech parameter generation algorithms for HMM-base speech synthesis,” in ICASSP2000, Istanbul, Turkey, 2000 Kishore Prahallad IIIT-Hyderabad
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.