Download presentation
1
HMM-Based Synthesis of Creaky Voice
Tuomo Raitio John Kane Thomas Drugman Christer Gobl
2
Creaky voice Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration Highly irregular with secondary laryngeal excitations
3
Use of creaky voice Usually involuntary, but various systematic usages have been reported For instance, creaky voice has been observed as phrase boundary marker turn-yielding mechanism indication of hesitations portrayal of social status cue for communicating attitude and affective states
4
Synthesis of creaky voice
HMM-based synthesis of creaky voice requires Algorithm for automatic detection of creaky voice Accurate f0 estimation and voicing decision Prediction of creaky voice from context (text input) Vocoder capable of rendering creaky excitation
5
Previous work (1) Algorithm for automatic detection of creaky voice:
Kane, Drugman & Gobl (Interspeech 2012; Speech Comm. 2013) Based on two features derived from the linear prediction (LP) residual Prediction of creaky voice from context Drugman, Kane, Raitio & Gobl (ICASSP 2013) The contextual factors used in HTS training are adequate for predicting the use of creaky voice
6
Previous work (2) Rendering of synthetic creaky voice
Silén, Helander, Nurminen & Gabbouj (Interspeech, 2009) Improved f0 and voicing decision, two-band voicing Improved creaky rendering compared to STRAIGHT Creaky voice NOT modeled explicitly Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku (IEEE TASLP, 2011) Accurate low-pitch f0 estimation Drugman, Kane & Gobl (Interspeech, 2012) Modeling of creaky excitation
7
This work… Compares different f0 estimation methods suitable for building creaky voice synthesis Culminates the previous research by creating a framework for creaky voice synthesis Explores the conversion of normal synthetic voice to a creaky one
8
What modification are required in order to
construct a creaky voice synthesis from a conventional HTS system?
9
Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
10
A) Use a database of creaky voice
11
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
12
B) Replace f0 estimation method with one
suitable for creaky voice
13
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
14
f0 estimation of creaky voice
Creaky voice has low f0 and irregular excitation Many f0 trackers output spurious values or classify creak as unvoiced Range of state-of-the-art f0 estimation algorithms were evaluated with creaky voice: GlottHMM SWIPE (with SPTK 3.6 voicing decision) RAPT (SPTK 3.6) SPTK 3.1 cepstrum based pitch function STRAIGHT TEMPO
15
f0 estimation of creaky voice – Evaluation
Methods were mostly used with default settings Frame length was set to 45ms whenever possible Speech data: 3 databases of read speech for TTS development American English male BDL Finnish male MV Finnish female HS Conversational speech data from 7 other speakers (Swedish, Japanese, American English)
16
f0 estimation of creaky voice – Results
GlottHMM [1] performed best with TTS data SPTK performed best with conversational speech For creaky voice TTS development, GlottHMM f0 estimation was chosen [1] Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku, “HMM-based speech synthesis utilizing glottal inverse filtering”, in IEEE Trans. on Audio, Speech, and Lang. Proc., 2011
17
What modification are required in order to
construct a creaky voice synthesis from a conventional HTS system?
18
C) Detect creaky regions and model creak
as a special case
19
Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
20
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
21
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
22
Creaky voice detection
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Average creaky residual Creaky voice model Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end
23
Creaky voice detection
Hand-annotation too laborious automatic methods An automatic creaky voice detection method by Kane & Drugman [1,2] Based on linear prediction (LP) residual features [1] Drugman, Kane & Gobl, “Resonator-based Creaky Voice Detection”, Interspeech, 2012 [2] Kane, Drugman & Gobl, “Improved automatic detection of creak”, Computer Speech & Language, 2013
24
Probability of creak LP residual
25
Modeling creaky excitation
Extension of the deterministic plus stochastic model (DSM) [1,2] which integrates a proper modeling of creaky voice [1] Drugman, Kane & Gobl,, “Modeling the creaky excitation for parametric speech synthesis”, Interspeech, 2012 [2] Drugman & Dutoit, “The Deterministic plus Stochastic Model of the Residual Signal and its Applications”, in IEEE Trans. on Audio, Speech and Lang. Proc., 2012.
26
Deterministic component Envelope of the stochastic component
Main excitation Secondary excitation GCI GCI GCI GCI GCI GCI
27
Creaky voice detection
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Average creaky residual Creaky voice model Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end
28
Voice building and synthesis
Training: Standard HTS method with the addition of 1-dimensional stream of creaky probability Spectrum: 30th order mel-generalized cepstral analysis with alpha = and gamma = -1/3 (converted to LSFs) Synthesis: Excitation: DSM vocoder with creaky parts rendered with the creaky excitation Excitation was filtered with the mel-generalized log spectral approximation (MGLSA) filter
29
Evaluation The following systems were compared
Conventional (STRAIGHT f0) Proposed (GlottHMM f0) Proposed (GlottHMM f0 and creaky excitation) Subjective online listening tests Stimuli: 20 sentences from the held-out data of BDL and MV 29 tests subjects
30
Evaluation BDL MV The following systems were compared
Conventional (STRAIGHT f0) Proposed (GlottHMM f0) Proposed (GlottHMM f0 and creaky exc.) Subjective online listening tests Stimuli: 20 sentences from the held-out data of BDL and MV 29 tests subjects
31
GlottHMM f0 + creaky excitation
Evaluation – MOS naturalness Results indicate that systems 2 and 3 have higher (p<0.001) ratings than 1 Difference between systems 2 and 3 is not significant Conclusions: Use of GlottHMM f0 improves naturalness Modeling of creaky excitation has no effect on MOS STRAIGHT f0 GlottHMM f0 GlottHMM f creaky excitation
32
Evaluation – Creaky rendering
Pairwise comparison of samples Systems 2 and 3 are preferred over system 1 System 3 is preferred over system 2 Conclusions: Both the use of GlottHMM f0 and the modeling of creaky excitation improve creaky voice rendering No pref. No pref. No pref. GlottHMM f0 + cr. exc. 3 GlottHMM f0 2 GlottHMM f0 + cr. exc. 3 STRAIGHT f0 1 GlottHMM f0 2 STRAIGHT f0 1
33
Is it possible to transplant a creaky voice
quality to a non-creaky speaker?
34
Adding creak for non-creaky speaker
Convert non-creaky voice of Scottish English male AWB to creaky Transplantation strategy: Creaky voice is predicted from American English male BDL Creaky excitation pulse from BDL is used to render creak f0 is either: kept as is substituted with BDL f0 by stream substitution transformed only in the creaky parts
35
Evaluation Four different voices were built: AWB (baseline)
AWB with BDL creaky excitation AWB with BDL creaky excitation and BDL f0 AWB with BDL creaky excitation and f0 transformation
36
Evaluation Four different voices were built: Baseline AWB
AWB with BDL creaky excitation AWB with BDL f0 and BDL creaky excitation AWB with f0 transformation and BDL creaky excitation
37
Evaluation Subjective online listening tests 14 tests subjects
28 synthesized stimuli Samples were rated with two scales: Standard MOS naturalness Impression of creakiness from 1 to 5 1 – does not sound like creaky voice 2 – 3 – 4 – 5 – sounds exactly like creaky voice
38
Evaluation results – MOS
System 3 is rated lower than system 1 No other statistically significant differences Conclusions Creaky voice transformation does not decrease naturalness, except when f0 of BDL was used Degradation of system 3 is probably due to different prosody Baseline AWB AWB + f0 transformation + creaky excitation AWB + creaky excitation AWB + BDL f0 stream + creaky excitation
39
Evaluation results – Creakiness
System 1 is rated less creaky than other systems Conclusions: Creaky voice transformation is successful: all transformed voices are rated creaky f0 has less effect on impression of creakiness, but it contributes to naturalness AWB + BDL f0 stream + creaky excitation AWB + f0 transformation + creaky excitation AWB + creaky excitation AWB
40
Summary Methods for the HMM-based synthesis of creaky voice were investigated This requires: method for detecting creaky voice robust pitch tracker and voicing decision prediction of creaky voice from contextual factors dedicated vocoder for rendering the creaky excitation Evaluation showed a significant improvement in naturalness and creakiness Transformation of a non-creaky speaker to a creaky one was successful Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.