Download presentation
Presentation is loading. Please wait.
Published bySabrina Blair Modified over 9 years ago
1
HMM training strategy for incremental speech synthesis.
Pouget M., Hueber T., Bailly G., Baumann T. Good morning everybody, thank you for coming to this presentation today. My talk today is about incremental speech synthesis, and more precisely, hmm training strategies for incremental speech synthesis.
2
Introduction to incremental speech synthesis State-of-the-art
Outline Introduction to incremental speech synthesis State-of-the-art Strategies for training HMM Baseline : ‘Default’ strategy. Proposed strategy : ‘Joker’ strategy. Experiments Material and methods Results Conclusion and perspectives. The presentation will follow this Outline, with first, a small presentation of what incremental speech synthesis is and the question we addressed in this paper. After a small review of the existing systems for synthesizing speech incrementally, I will present you what we considered to be the baseline technique and the technique we proposed in the paper. Finally, I will briefly explain the experimental set up we used to assess our strategies, both objectively and subjectively. Pouget et al. – Incremental Speech Synthesis – Interspeech 2015
3
Incremental Text-To-Speech (iTTS) systems
Text-to-speech (TTS) systems are classically designed to process whole sentences at once. Incremental text-to-speech (iTTS) systems aim at delivering the speech output ‘word after word’ i.e. while the user is typing. Time Conventional TTS Text input Speech output Sn Sn+1 Sn+2 Sentence Incremental TTS words Wn+1 Wn I will now make a short description of conventional text-to-speech systems versus incremental text to speech systems. Conventionally, Text-to-speech systems are designed to output a whole sentence at once, once it is completely written down. Indeed, in order to infer the most natural prosody, TTS systems need to compute the whole sentence. We call incremental Text-to-speech systems, systems that are able to deliver the speech signal while the user is still typing the end of the sentence. This figure illustrates the idea behind incremental text to speech: the upper part of the figure shows the synthesis of 3 sentences. The synthetized speech only starts once the whole sentence has been input. On the other hand, we can see that with an incremental device, the waiting time between the start of the text input and the synthesis is shorter when outputting signal word after word. One could assume that we can have classical TTS device work incrementally. However, such systems work with full sentences and we don’t want the speech signal to sound like this. AUDIO EXAMPLE. ----- Notes de la réunion (04/09/15 14:44) ----- plutot que de dire "to process whole sentence at once" mettre : "in classical the speech is generated sentence after sentence" + par exemple "my name is mael, press the button, "my name is mael" "we target at having a synthetic speech signal to follow text input". il manque un truc qui apparait en rouge "iTTS not TTS triggered word after word". AUDIO EXEMPLE : each word is interpreted as a whole sentence. Pouget et al. – Incremental Speech Synthesis – Interspeech 2015
4
HMM-based speech synthesis
Estimation of spectral and prosodic trajectories Waveform generation (vocoder) Natural Language Processing current phone +Linguistic contextual features Left context (past) Right context (future) I am now going to tackle the main issue with incrementallity in speech synthesis. The figure show the general principle of speech synthesis : a NLP modules realizes the morphological and syntactic analysis of the text input and allows the computation of several symbolic linguistic features concerning the phonemes to synthetize. Those features (phoneme to be synthetized, preceding and following phonemes, position of the word in the sentence or number of syllables in the current word) are used to estimate the prosody and spectrum parameters trajectory which are used by the vocoder to generate speech. The spectrum only needs short range features such as preceding and following phonemes. The inference of the prosody requires longer range features such as the number of remaining words or the number of syllables in the word. We thus consider two kind of features, the one related to the past (number of words since the beginning of the sentence) that are known without ambiguities and the features related to the future that need information linked to the end of the sentence (number of remaining words for instance). Those features will be a problem for synthesizing incrementally. ----- Notes de la réunion (04/09/15 14:44) ----- Right context : NOT AVAILABLE IN iTTS. faire passer l'information que les soucis sont à propos de la prosodie. 1 je décris l'overview 2 on décompose en left+righ 3 en incrémental, on a que le left 4 Etat de l'art -> le problème c'est la prosodie! HMM-based speech synthesis: Spectrum: short-range features such as quinphone (Le Maguer, 2013) not a major issue in iTTS Prosody estimation: on long-range symbolic linguistic features (number of words in the sentence, Part-of-speech (POS) of the next word, etc.)
5
iTTS: State of the art Incremental waveform generation assuming complete contextual features (i.e. information about past/left and future/right context): First proof-of-concept formulated by (Edlund et al., 2008) (Astrinaki et al. 2012): Reactive implementation of MLPG algorithm (pHTS/MAGE platform). Incremental speech synthesis from incomplete contextual features: (d’Alessandro, …, Dall, … et al., eNTERFACE 2013): Preliminary version of an iTTS system based on MAGE platform. (Bauman, ICASSP 2014) – (InProTK toolkit) : a first strategy for dealing with incomplete contextual features in iTTS, Let us have a short review of the existing work. The concept of incremental speech synthesis was initially tackled by Edlund in Astrinaki and colleagues also worked on the reactive text to speech synthesis. However they worked on incremental generation of sentences fully known. Dall and colleagues implemented a preliminary version of incremental speech synthesis synthesizer based on MAGE. Baumann and colleagues implemented a complete system for incremental dialogue : inproTK. This system included an incremental speech synthesis unit in which he used incomplete contextual features. He proposed a strategy for dealing with those missing features.
6
Goal of this study Evaluate for French the baseline approach proposed by Baumann et al. (2014) for German and English ‘Default’ strategy Investigate a new approach which explicitly deals with potentially missing features when training HTS voice. ‘Joker’ strategy The goal of our study is twofold : first, we implemented and evaluated in French the strategy Baumann proposed in German and English. We called his strategy, the default strategy. Secondly, we proposed a new approach for dealing with missing features when training the voice : the joker strategy
7
Baseline techniques (Bauman, 2014)
Principle: Contextual features which can not be calculated when processing the text incrementally are substituted with ‘default’ values at synthesis time Default values are defined as the mean value observed in the training set Implementation: Standard HTS voice training procedure full context labels. learning set : 7) In the following section I will explain how those two strategy work. The idea of the strategy used by Baumann is to use models that have been trained non-incrementally. So he uses full context models. He analyzed the training corpus and created a set of default value for each feature. Those default values are whether the mean value of all the features or the most common (for phonemes for example). During synthesis, when encountering an unknown value, he replaces it with the default value in order to reconstruct the full context models. We named this strategy, the default strategy since we use default values. The advantage of this strategy is that HTS voice is trained using all possible information. However, since there is only one default value per feature, some models might never be used. Moreover, the training and synthesis stage do not have the same input, which make those two stages inconsistent. Voice training Substitution of missing values with default values Unseen label Synthesis PROS: HTS voice is trained using full-context labels CONS: Only a limited amount of model are used at synthesis time
8
Proposed strategy: ‘Joker’
Principle: Re-training HTS voice with a possible ‘unknown’ value for each feature related to right context Objective: Consistency between voice training and synthesis Implementation: Consists in 2 slight adaptations of the original HTS voice training procedure Adaptation 1: Text labeling: Each segmental feature which can not be estimated when processing text incrementally is tagged with the same Joker (#) value 9) We proposed another strategy for dealing with missing features. It consists in training models with labels that are similar to the labels used at synthesis time. Its implementation is rather simple. Since we aim at being able to synthesize at the end of every word, we train models with no information after the current word. In this scenario, at synthesis time, the labels to be synthetized are consistent with the training. ----- Notes de la réunion (04/09/15 14:59) ----- contrary to baumann, we will retrain the voice. But we will have two case : whether it is known : value or it not : joker Joker tagging
9
Proposed strategy: ‘Joker’
Adaptation 2: State tying procedure Clustering models which share the same uncertainty about one specific feature (by adding new questions to the clustering tree, related to the Joker value) 9) We proposed another strategy for dealing with missing features. It consists in training models with labels that are similar to the labels used at synthesis time. Its implementation is rather simple. Since we aim at being able to synthesize at the end of every word, we train models with no information after the current word. In this scenario, at synthesis time, the labels to be synthetized are consistent with the training. ----- Notes de la réunion (04/09/15 14:59) ----- contrary to baumann, we will retrain the voice. But we will have two case : whether it is known : value or it not : joker QS “is the number of phoneme in the next syllable unknown in iTTS ?” {*/C:#} No Yes Cluster state/models which share the same ‘uncertainty’ on a specific contextual feature This is NOT equivalent to a voice trained with left context only
10
Summary Default Joker Non-incremental Voice Training Full-context
Left-context + ‘Joker’ value for contextual features related to right context Synthesis Reconstructed full-context model (using default values) 10) This table sums up the characteristics of the different training strategies : - The joker strategy, as well as the non incremental modeling have consistency between the labels used for training and synthesis. On the other hand, the default strategy is trained with full-context labels and during the synthesis we try to reconstruct the labels to be able to use full context models. Consistency between training and synthesis
11
Experiments Goal: Comparing default vs. Joker Material and method:
Both in terms of spectrum and prosody estimation using both objective and perceptive test. Material and method: GIPSA-lab HMM-based TTS system for French based on : 3h17min of speech extracted from audiobook “Le tour du monde en 80 jours’” (Jules Verne). NLP module: COMPOST (Bailly et al., 1990) Vocoder: Harmonic plus noise modeling (HNM) of the full-band spectral envelope (similarly to Hueber, 2009). 13/17 LSF coefficients for modeling of the harmonic/noise part + Δ + Δ Δ Log(f0) + Δ + Δ Δ. Context-dependant HMM (5 emitting states) trained using an adapted version of HTS 2.2 training script (without GV and post-filtering) 11) The next part is dedicated to the assessment of the strategies that have just been explained. The goal is to compare the joker strategy and the default strategy to a baseline : the non-incremental scenario. They will be evaluated objectively and subjectively with a perceptive listening test. We developed a HMM-based TTS system for French language using a corpus of 3 hours 17 minutes based on a French audio-book analysed by a NLP module also developed at Gipsa-lab. The vocoder uses a harmonic plus noise modeling of the full band spectral envelope. We used the a modified version of the script for training a voice in HTS 2.2, without GV and post-filtering.
12
Audio examples Non incremental TTS
iTTS with « Default » strategy (baseline) iTTS with the « Joker » strategy (proposed method)
13
Objective evaluation Objective measurements:
Similarly to Baumann et al, 2014, acoustic features estimated using non-incremental approach (i.e. when considering both left and right context) is considered here as the reference (i.e. best possible result) Objective measurements: Spectrum Mel-cepstral distortion where : 𝒚 𝑡 𝑆 , 𝒚 𝑡 𝑁𝐼 : vector of D+1 mel cepstral coefficients. S={ joker ; default } and NI = Non Incremental. T = number of frames in the utterance. Prosody (pitch) : in cent Prosody (duration): where P is the number of phonemes in the utterance) Statistical significance between joker and default strategies was assessed using paired t-tests. 12) The objective error is assessed by comparing the speech signal generated with respect to our strategies to the non incremental synthesis. For evaluating, we used the Mel-cepstral distortion. The pitch was evaluated with a perception based frequency error, in cents and the duration estimation was assessed with log duration ratio.
14
Objective evaluation : Results
Default Joker Joker vs. Default MCD (dB) 0.78 ± 0.26 0.94 ± 0.15 *** Ef0 (cents) 197.4 ± 88.7 178.2 ± 78.4 NS Edur 0.20 ± 0.06 0.17 ± 0.04 13) This table summarizes the objective results : For the mel-cepstral distortion, the default strategy outperforms the joker strategy. However, in terms of prosody, the duration estimation is better with the joker strategy and no significative difference was found in terms of pitch. Even though the objective results do not emphasize a clear difference, the stimuli sounded different enough to decide to set up a perceptual test. Spectrum: Default > Joker (0.16 dB) Possible explanation: Minimum error when selecting the ‘right’ full-context label. Prosody (pitch & duration): “objective” difference between ‘default’ and ‘joker’ is either NS (pitch) or very small (duration) despite the fact that the two strategies sound quite different.
15
Subjective evaluation
Ranking test: For each sentence, the listener was asked to rate 3 stimuli (default, joker, non incremental) and place them on a continuous scale (MOS 1-5). 18 participants (in the same anechoic room) Statistical analysis Both parametric (ANOVA) and non parametric (kruskal-Wallis) tests 14) The perceptual test has been carried on 18 subjects who had to evaluate 12 sentences synthesitized with the joker strategy, the default strategy and the non incremental voice. He had to place it on a continuous perceptive scale. Here is an example of the stimuli the user had to rate.
16
Subjective evaluation : results and discussion
15) The results are the following : surprisingly, there was no significant difference between the non incremental voice and the joker strategy. Moreover, the joker strategy significantly outperforms the default strategy for French language. No significant difference between the Joker strategy and the non incremental scenario (!) (partially explained by the quality of the non-incremental system) Joker strategy significantly outperforms the default strategy (for French) good candidate for iTTS systems
17
Conclusion and perspectives
Context: Prosody estimation in incremental HMM-based TTS Evaluated on French the baseline approach proposed by Bauman et al. for German and English Proposed an original approach Core idea: integrate potential uncertainty on some features when training HTS voice Main results: Listening tests showed that our approach outperforms the baseline technique for French Perspectives: Replicate our study for English and German Integrate the Joker strategy in a complete iTTS system ----- Notes de la réunion (04/09/15 14:44) ----- compte rendu :
18
Subjective evaluation
Ranking test: The user has to listen to each sample and place it on a continuous scale. 18 participants (in the same anechoic room) Statistical analysis Both parametric ANOVA test and non parametric (kruskal-Wallis) Variable to explain X position on the ranking area A 3-level explicative factor (Joker, Default, non-incremental) + random effect for the ‘Listener’ Playing stimulus associated to #3 Playing stimulus associated to #1 Playing stimulus associated to #2 Playing stimulus associated to #2 Playing stimulus associated to #1 Playing stimulus associated to #3 Playing stimulus associated to #2 Playing stimulus associated to #3 Playing stimulus associated to #1 Bad Poor Fair Good Excellent Validate 1 1 2 2 3 3 AJOUTER SONS ! Correct Validate Do you want to change the order or validate it? 1 1 2 2 3 3
19
HTS label format for French language
In order to synthetize a the nth phoneme of an utterance, we need to know : (in red : features possibly related to the next word) Identity of the n-2, n-1, n(current), n+1, n+2 phoneme Position of the current phoneme in the current syllable (forward & backward) Number of phonemes in the previous/current/next syllable Identity of the vowel in the current syllable Position of the current syllable in the word (forward & backward) Position of the current syllable in the sentence (forward & backward) Part-of-speech tag (POS-tag) of the previous/current/next word Number of syllables in the previous/current/next word Position of the current word in the sentence (forward & backward) Sentence type (assertion, wh-question, full question, etc.) => More than 1019 possible combinations => Need for clustering
20
Objective evaluation Objective measurments:
Spectrum Mel-cepstral distortion 𝑀𝐶𝐷 𝒚 𝑡 𝑆 , 𝒚 𝑡 𝑁𝐼 = 1 𝑇 10 ln(10) 𝑡=1 𝑡=𝑇 2 𝑑=0 𝑑=𝐷 (𝒚 𝑡 𝑆 − 𝒚 𝑡 𝑁𝐼 )² where : 𝒚 𝑡 𝑆 , 𝒚 𝑡 𝑁𝐼 : vector of D+1 mel cepstral coefficients. S={ joker ; default } and NI = Non Incremental. T = number of frames in the utterance. Prosody (pitch) : 𝐸 𝑓 0 = 1 𝑇 𝑇=1 𝑇 log 2 𝑓 0 𝑆 (𝑡)/ 𝑓 0 𝑁𝐼 (𝑡) in cent Prosody (duration) : 𝐸 𝑑𝑢𝑟 = 1 𝑃 𝑝=1 𝑃 log( 𝑑 𝑁 𝐼 𝑝 / 𝑑 𝑆 𝑝 ) (P the number of phonemes in the utterance) Statistical significance between joker and default strategies was assessed using paired t-tests.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.