HMM-Based Synthesis of Creaky Voice

Name: HMM-Based Synthesis of Creaky Voice
Uploaded: 2017-10-05T21:03:36+00:00
Duration: PTM17S36
Channel: Mercy Houston
Description: HMM-Based Synthesis of Creaky Voice

HMM-Based Synthesis of Creaky Voice
Tuomo Raitio John Kane Thomas Drugman Christer Gobl

Creaky voice Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration Highly irregular with secondary laryngeal excitations

Use of creaky voice Usually involuntary, but various systematic usages have been reported For instance, creaky voice has been observed as phrase boundary marker turn-yielding mechanism indication of hesitations portrayal of social status cue for communicating attitude and affective states

Synthesis of creaky voice
HMM-based synthesis of creaky voice requires Algorithm for automatic detection of creaky voice Accurate f0 estimation and voicing decision Prediction of creaky voice from context (text input) Vocoder capable of rendering creaky excitation

Previous work (1) Algorithm for automatic detection of creaky voice:
Kane, Drugman & Gobl (Interspeech 2012; Speech Comm. 2013) Based on two features derived from the linear prediction (LP) residual Prediction of creaky voice from context Drugman, Kane, Raitio & Gobl (ICASSP 2013) The contextual factors used in HTS training are adequate for predicting the use of creaky voice

Previous work (2) Rendering of synthetic creaky voice
Silén, Helander, Nurminen & Gabbouj (Interspeech, 2009) Improved f0 and voicing decision, two-band voicing Improved creaky rendering compared to STRAIGHT Creaky voice NOT modeled explicitly Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku (IEEE TASLP, 2011) Accurate low-pitch f0 estimation Drugman, Kane & Gobl (Interspeech, 2012) Modeling of creaky excitation

This work… Compares different f0 estimation methods suitable for building creaky voice synthesis Culminates the previous research by creating a framework for creaky voice synthesis Explores the conversion of normal synthetic voice to a creaky one

What modification are required in order to
construct a creaky voice synthesis from a conventional HTS system?

Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

A) Use a database of creaky voice

Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

B) Replace f0 estimation method with one
suitable for creaky voice

f0 estimation of creaky voice
Creaky voice has low f0 and irregular excitation Many f0 trackers output spurious values or classify creak as unvoiced Range of state-of-the-art f0 estimation algorithms were evaluated with creaky voice: GlottHMM SWIPE (with SPTK 3.6 voicing decision) RAPT (SPTK 3.6) SPTK 3.1 cepstrum based pitch function STRAIGHT TEMPO

f0 estimation of creaky voice – Evaluation
Methods were mostly used with default settings Frame length was set to 45ms whenever possible Speech data: 3 databases of read speech for TTS development American English male BDL Finnish male MV Finnish female HS Conversational speech data from 7 other speakers (Swedish, Japanese, American English)

f0 estimation of creaky voice – Results
GlottHMM [1] performed best with TTS data SPTK performed best with conversational speech For creaky voice TTS development, GlottHMM f0 estimation was chosen [1] Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku, “HMM-based speech synthesis utilizing glottal inverse filtering”, in IEEE Trans. on Audio, Speech, and Lang. Proc., 2011

What modification are required in order to
construct a creaky voice synthesis from a conventional HTS system?

C) Detect creaky regions and model creak
as a special case

Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

Creaky voice detection
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Average creaky residual Creaky voice model Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end

Hand-annotation too laborious  automatic methods An automatic creaky voice detection method by Kane & Drugman [1,2] Based on linear prediction (LP) residual features [1] Drugman, Kane & Gobl, “Resonator-based Creaky Voice Detection”, Interspeech, 2012 [2] Kane, Drugman & Gobl, “Improved automatic detection of creak”, Computer Speech & Language, 2013

Probability of creak LP residual

Modeling creaky excitation
Extension of the deterministic plus stochastic model (DSM) [1,2] which integrates a proper modeling of creaky voice [1] Drugman, Kane & Gobl,, “Modeling the creaky excitation for parametric speech synthesis”, Interspeech, 2012 [2] Drugman & Dutoit, “The Deterministic plus Stochastic Model of the Residual Signal and its Applications”, in IEEE Trans. on Audio, Speech and Lang. Proc., 2012.

Deterministic component Envelope of the stochastic component
Main excitation Secondary excitation GCI GCI GCI GCI GCI GCI

Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Average creaky residual Creaky voice model Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end

Voice building and synthesis
Training: Standard HTS method with the addition of 1-dimensional stream of creaky probability Spectrum: 30th order mel-generalized cepstral analysis with alpha = and gamma = -1/3 (converted to LSFs) Synthesis: Excitation: DSM vocoder with creaky parts rendered with the creaky excitation Excitation was filtered with the mel-generalized log spectral approximation (MGLSA) filter

Evaluation The following systems were compared
Conventional (STRAIGHT f0) Proposed (GlottHMM f0) Proposed (GlottHMM f0 and creaky excitation) Subjective online listening tests Stimuli: 20 sentences from the held-out data of BDL and MV 29 tests subjects

Evaluation BDL MV The following systems were compared
Conventional (STRAIGHT f0) Proposed (GlottHMM f0) Proposed (GlottHMM f0 and creaky exc.) Subjective online listening tests Stimuli: 20 sentences from the held-out data of BDL and MV 29 tests subjects

GlottHMM f0 + creaky excitation
Evaluation – MOS naturalness Results indicate that systems 2 and 3 have higher (p<0.001) ratings than 1 Difference between systems 2 and 3 is not significant Conclusions: Use of GlottHMM f0 improves naturalness Modeling of creaky excitation has no effect on MOS STRAIGHT f0 GlottHMM f0 GlottHMM f creaky excitation

Evaluation – Creaky rendering
Pairwise comparison of samples Systems 2 and 3 are preferred over system 1 System 3 is preferred over system 2 Conclusions: Both the use of GlottHMM f0 and the modeling of creaky excitation improve creaky voice rendering No pref. No pref. No pref. GlottHMM f0 + cr. exc. 3 GlottHMM f0 2 GlottHMM f0 + cr. exc. 3 STRAIGHT f0 1 GlottHMM f0 2 STRAIGHT f0 1

Is it possible to transplant a creaky voice
quality to a non-creaky speaker?

Adding creak for non-creaky speaker
Convert non-creaky voice of Scottish English male AWB to creaky Transplantation strategy: Creaky voice is predicted from American English male BDL Creaky excitation pulse from BDL is used to render creak f0 is either: kept as is substituted with BDL f0 by stream substitution transformed only in the creaky parts

Evaluation Four different voices were built: AWB (baseline)
AWB with BDL creaky excitation AWB with BDL creaky excitation and BDL f0 AWB with BDL creaky excitation and f0 transformation

Evaluation Four different voices were built: Baseline AWB
AWB with BDL creaky excitation AWB with BDL f0 and BDL creaky excitation AWB with f0 transformation and BDL creaky excitation

Evaluation Subjective online listening tests 14 tests subjects
28 synthesized stimuli Samples were rated with two scales: Standard MOS naturalness Impression of creakiness from 1 to 5 1 – does not sound like creaky voice 2 – 3 – 4 – 5 – sounds exactly like creaky voice

Evaluation results – MOS
System 3 is rated lower than system 1 No other statistically significant differences Conclusions Creaky voice transformation does not decrease naturalness, except when f0 of BDL was used Degradation of system 3 is probably due to different prosody Baseline AWB AWB + f0 transformation + creaky excitation AWB + creaky excitation AWB + BDL f0 stream + creaky excitation

Evaluation results – Creakiness
System 1 is rated less creaky than other systems Conclusions: Creaky voice transformation is successful: all transformed voices are rated creaky f0 has less effect on impression of creakiness, but it contributes to naturalness AWB + BDL f0 stream + creaky excitation AWB + f0 transformation + creaky excitation AWB + creaky excitation AWB

Summary Methods for the HMM-based synthesis of creaky voice were investigated This requires: method for detecting creaky voice robust pitch tracker and voicing decision prediction of creaky voice from contextual factors dedicated vocoder for rendering the creaky excitation Evaluation showed a significant improvement in naturalness and creakiness Transformation of a non-creaky speaker to a creaky one was successful Thank you!

HMM-Based Synthesis of Creaky Voice

Similar presentations

Presentation on theme: "HMM-Based Synthesis of Creaky Voice"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HMM-Based Synthesis of Creaky Voice

Similar presentations

Presentation on theme: "HMM-Based Synthesis of Creaky Voice"— Presentation transcript:

Similar presentations

About project

Feedback