Incremental Text-To-Speech synthesis gipsa-lab Outline PLAN.

Incremental Text-To-Speech synthesis

gipsa-lab Outline PLAN

gipsa-lab Incremental Text-To-Speech (iTTS) systems Text-to-speech (TTS) systems are classically designed to process whole sentences at once. Incremental text-to-speech (iTTS) systems aim at delivering the speech output ‘word after word’ i.e. while the user is typing. 3 Time Conventional TTS Text input Speech output SnSn S n+1 S n+2 SnSn S n+1 Sentence Incremental TTS Text input Speech output words W n+1 WnWn WnWn Performing word-by-word synthesis with classical TTS would lead to non- natural prosody (due to the fact that it considers each single word as a sentence).

gipsa-lab Spectrum and Prosody estimation in HMM-based speech synthesis Spectrum estimation relies mainly on short-range features such as quinphone (LeMaguet, 2013).  Not a major issue in iTTS (assuming a lookahead of one word) Prosody estimation (pitch and segment duration) in HMM-based TTS system relies mainly on long-range symbolic linguistic features (describing morphological & syntactic structure, such as the number of words in the sentence, POS of the next word, etc.)  Some of these features can not be calculated when processing the text input incrementally.  Estimating an acceptable prosody from an ‘incomplete’ sentence is one of the main challenge of iTTS systems. Goals of our study: – Evaluating for French a baseline approach proposed by Baumann et al. (2014) for German and English – Investigating a new approach which explicitly deals with potentially missing features when training HTS voice. 4

gipsa-lab iTTS: State of the art First proof-of-concept formulated by (Edlund et al., 2008) : The acoustic signal was computed with the full sentence and “only” delivered incrementally. (Astrinaki et al. 2012): Development of a reactive implementation of MLPG algorithm (pHTS/MAGE platform) from full-context labels (i.e. including information about past and future context). – (d’Alessandro, …, Dall, … et al., eNTERFACE 2013): Preliminary version of an iTTS system based on MAGE platform. (Baumann et al. 2012-2014): Complete incremental dialogue system (InProTK) including an HMM-based iTTS module (based on OpenMary) 5

gipsa-lab Prosody estimation in ITTS system (Bauman et al.) ‘Default strategy’: Linguistic features which can not be calculated when processing the text incrementally are substituted by ‘default’ values, computed from the training set. 6 The default value for each feature is defined as the mean value observed in the training set p3 Unseen label Substitution of missing values with default values Voice training Synthesis p1^p2- p3+p4=p5 @p6 p7 /A:a1 a2 a3 /B:b1-b2-b3 @b4-b5 &b6-b7 #b8-b9 $b10-b11 !b12-b13 ;b14-b15 |b16 /C:c1+c2+c3 /D:d1 d2 /E:e1+e2 @e3+e4 &e5+e6 #e7+e8 /F: f1 f2 /G:g1 g2 /H:h1=h2 @h3=h4|h5 /I:i1 i2 /J: j1+ j2- j3

gipsa-lab ‘Default’ strategy (Baumann, 2014) PROS: HTS voice is trained using full-context label CONS: The regression tree used to build ‘unseen’ HMM model at synthesis time is partially exploited (i.e. some branches are unreachable). 7 Left-context + known Right-context features Unknown Right- context features Pool of default values for unknown contextual features. Reconstructed full-context label …-i+…/C:?+... …-i+…/C:3+... Question related to a left-contextual feature Question related to a right contextual feature C==1? (Is the number of phonemes in next syllable equals to 1?) yes no yes Branches of the tree which will be inaccessible each time the value of the contextual feature C is unknown. No (since default answer is 3)

gipsa-lab Proposed strategy: ‘Joker’ Uncertainty about right contextual features is handled when training the voice (rather at synthesis time as in the Default strategy): Core principle: considering a potentially missing feature as a ‘relevant’ information when training HTS voice (FIXME). Practical implementation: for each phoneme of the training set, linguistic features related to right context (i.e. future) are tagged with the same “Joker value” (EXEMPLE AVEC ANIMATION). State-tying: takes into account the potential uncertainty on each features when clustering HMM states 8 p3 p1^p2- p3+p4=p5 @p6 p7 /A:a1 a2 a3 /B:b1-b2-b3 @b4-b5 &b6-b7 #b8-b9 $b10- b11 !b12-b13 ;b14-b15 |b16 /C:c1+c2+c3 /D:d1 d2 /E:e1+e2 @e3+e4 &e5+e6 #e7+e8 /F: f1 f2 /G:g1 g2 /H:h1=h2 @h3=h4|h5 /I:i1 i2 /J: j1+ j2- j3

gipsa-lab Proposed strategy : Joker 9 Left-context + known Right-context features Unknown Right- context features Question related to a known contextual feature yesno No yes If the feature is unknown, the rest of the tree is still exploitable. Question : is a specific feature known? -if yes: ask more info about the feature -if no: investigate other features CONS: HTS voice is trained from incomplete label PROS: At synthesis time: The regression tree used to build ‘unseen’ HMM model at synthesis time is fully exploited (i.e. all branches are potentially reachable) When synthesizing unseen models, no risk to use un incorrect model due to a false estimation. Unresolved uncertainties will expectedly lead to more neutral intonation.

gipsa-lab Experiments -Goal: Default vs. Joker (iTTS) vs. non-incremental TTS in terms of spectrum and prosody estimation, using both objective and perceptive test. -Non-incremental TTS is considered here as the best possible result (similarly to Baumann et al, 2014) -Material and method: -GIPSA-lab HMM-based TTS system for French based on : -3h17min of speech extracted from audiobook “Le tour du monde en 80 jours’” (Jules Verne) -NLP module: COMPOST (Bailly et al., 1990) -Vocoder: Harmonic plus noise modeling (HNM) of the full-band spectral enveloped -13/17 LSF coefficients for modeling of the harmonic/noise part) + Δ + Δ Δ -Log(f0) + Δ + Δ Δ. -Context-dependant HMM (5 emitting states) trained using HTS 1.X script (without GV and post-filtering) 10

gipsa-lab Objective evaluation 11

gipsa-lab Objective evaluation : Results DefaultJoker Joker vs. Default MCD (dB)0.78 ± 0.260.94 ± 0.15*** E f0 (cents)197.4 ± 88.7178.2 ± 78.4NS E dur 0.20 ± 0.060.17 ± 0.04*** 12 Spectrum: Default > Joker (0.16 dB) Possible explanation: … Prosody (pitch & duration): “objective” difference between Default and Joker is either NS (pitch) or very tiny (duration) despite the fact that the two strategies sounds quite different Jouer 2 exemples  Listening test

gipsa-lab Subjective evaluation Ranking test: The user has to listen to each sample and place it on a continuous scale. 18 participants (in the same anechoic room) Statistical analysis – Both parametric ANOVA test and non parametric (kruskall wallis) – Variable to explain X position on the ranking area – A 3-level explicative factor (Joker, Default, non-incremental) + random effect for the ‘Listener’ 13 BadPoorFairGoodExcellent Validate 1 2 3 AJOUTER SONS !

gipsa-lab Subjective evaluation : results and discussion 14 No significant difference between the Joker strategy and the non incremental scenario (!) (partially explained by the quality of the non-incremental system) Joker strategy outperforms significantly the default strategy (for French)  vey good candidate for iTTS systems

gipsa-lab Conclusion and perspectives Context: Prosody estimation in incremental HMM-based TTS Evaluated on French the baseline approach proposed by Bauman et al. for German and English –  core idea: substituting missing linguistic features by default ones at synthesis time Proposed an original approach  Core idea: integrate potential uncertainty on some features when training HTS voice  Main results: Listening tests showed that our approach outperforms baseline for French  Perspective:  Replicate our study for English and German  Integrate the Joker strategy in a complete iTTS system ….  A COMPLETER 15

gipsa-lab HTS label format for French language In order to synthetize a the n th phoneme of an utterance, we need to know : (in red : features possibly related to the next word) – Identity of the n-2, n-1, n(current), n+1, n+2 phoneme – Position of the current phoneme in the current syllable (forward & backward) – Number of phonemes in the previous/current/next syllable – Identity of the vowel in the current syllable – Position of the current syllable in the word (forward & backward) – Position of the current syllable in the sentence (forward & backward) – Part-of-speech tag (POS-tag) of the previous/current/next word – Number of syllables in the previous/current/next word – Position of the current word in the sentence (forward & backward) – Sentence type (assertion, wh-question, full question, etc.) => More than 10 19 possible combinations => Need for clustering 16

Incremental Text-To-Speech synthesis gipsa-lab Outline PLAN.

Similar presentations

Presentation on theme: "Incremental Text-To-Speech synthesis gipsa-lab Outline PLAN."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Incremental Text-To-Speech synthesis gipsa-lab Outline PLAN.

Similar presentations

Presentation on theme: "Incremental Text-To-Speech synthesis gipsa-lab Outline PLAN."— Presentation transcript:

Similar presentations

About project

Feedback