5-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Phone Units Phone Sequence To Speech Speech Naturalness Concatenative Approaches Rule-Based Approaches
Speech Synthesis Concept Text Speech Text Text to Phone Sequence Phone Sequence to Speech Speech Natural Language Processing (NLP) Speech Processing
Phone Units Paragraph ( ) Sentence ( ) Word (Depends on the language. Usually more than 100,000) Syllable Diphone & Triphone Phoneme (Between 10 , 100)
Phone Units (Cont’d) Diphone : We model Transitions between two phonemes . . . . . p1 p2 p3 p4 p5 Diphone Phoneme
Phone Units (Cont’d) In farsi we have 30 Phoneme. so we have 30*30 Diphone Theoretically. Practically the only Diphone that we don’t have in farsi is /zho/ we have 27000 Triphone Theoretically. But practically we have about 15000 Triphone in farsi.
Phone Units (Cont’d) Syllable = Onset (Consonant) + Rhyme Syllable is a set of phonemes that exactly contains one vowel Syllables in Farsi : CV , CVC , CVCC We have about 4000 Syllables in farsi Syllables in English :V, CV , CVC ,CCVC, CCVCC, CCCVC, CCCVCC, . . . Number of Syllables in English is very much
Phone Sequence To Speech Concatenative Approaches : Trade-Off between Naturality And Memory usage and variety of desired functions Rule-Based Approaches : The most important Rule-Based approach is Klatt method
Phone Sequence To Speech (Cont’d) to primitive utterance primitive utterance to Natural Speech Text to Phone Sequence Speech Text NLP Speech Processing
Speech Naturalness Obviation of undesirable noise and distortion and dissociation from speech Prosody generation Speech energy Duration pitch Intonation Stress
Speech Naturalness (Cont’d) Intonation and Stress are very effective in speech naturalness Intonation : Variation of Pitch frequency along speaking Stress : Increasing the pitch frequency in a specific time
Concatenative Approaches In this approaches we store units of natural speech for reconstruction of desired speech We could select the appropriate phone unit for speech synthesis we can store compressed parameters instead of main waveform
Concatenative Approaches (Cont’d) Benefits of storing compressed parameters instead of main waveform Less memory use General state instead of a specific stored utterance Generating prosody easily
Concatenative Approaches (Cont’d) Phone Unit Type of Storing Paragraph Sentence Word Syllable Diphone Phoneme Main Waveform Coded/Main Waveform Coded Waveform
Concatenative Approaches (Cont’d) Pitch Synchronous Overlap-Add-Method (PSOLA) is a famous method in phoneme transmit smoothing Overlap-Add-Method is a standard DSP method PSOLA is a base action for Voice Conversion. In this method in analysis stage we select frames that are synchronous by pitch markers.
Rule-Based Approach Stages Determine the speech model and model parameters Determine type of phone units Determine some parameter amount for each phone unit Substitute sequence of phone units by its equivalent parameter sequence Put parameter sequence in speech model
KLATT 80 Model
KLATT 88 Model
THE KLSYN88 CASCADE PARALLEL FORMANT SYNTHESIZER FNP FNZ FTP FTZ F1 B1 BNP BNZ BTP BTZ DF1 DB1 F2 B2 F3 B3 F4 B4 F5 B5 GLOTTAL SOUND SOURCES NASAL POLE ZERO PAIR TRACHEAL POLE ZERO PAIR FIRST FORMANT RESONATOR SECOND FORMANT RESONATOR THIRTH FORMANT RESONATOR FOURTH FORMANT RESONATOR FIFTH FORMANT RESONATOR FILTERED IMPULSE TRAIN TL CASCADE VOCAL TRACT MODEL LARYNGEAL SOUND SOURCES F0 AV OO FL DI SPECTRAL TILT LOW-PAS RESONANTOR KL GLOTT 88 model (default) SS CP + NASAL FORMANT RESONATOR AH ANV ASPIRATION NOISE GENERATOR SO MODIFIED LF MODEL FIRST FORMANT RESONATOR A1V SECOND FORMANT RESONATOR B2F + - A2F FIRST DIFFERENCE PREEMPHASIS SECOND FORMANT RESONATOR A2V + THIRD FORMANT RESONATOR B3F A3F THIRTH FORMANT RESONATOR AF A3V FRICATION NOISE GENERATOR FOURTH FORMANT RESONATOR B4F A4F FOURTH FORMANT RESONATOR A4V FIFTH FORMANT RESONATOR B5F + - A5F TRACHEAL FORMANT RESONATOR ATV B6F F6 SIXTH FORMANT RESONATOR A6F AB PARALLEL VOCAL TRACT MODEL LYRYNGEAL SOUND SOURCES (NORMALLY NOT USED) BYPASS PATH PARALLEL VOCAL TRACT MODEL FRICATION SOUND SOURCES
Three Voicing Source Model In KLATT 88 The old KLSYN impulsive source The KLGLOTT88 model The modified LF model