Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Bayesian Approach to HMM-Based Speech Synthesis Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Takashi Masuko, and Keiichi Tokuda Nagoya Institute of.

Similar presentations


Presentation on theme: "A Bayesian Approach to HMM-Based Speech Synthesis Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Takashi Masuko, and Keiichi Tokuda Nagoya Institute of."— Presentation transcript:

1

2 A Bayesian Approach to HMM-Based Speech Synthesis Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Takashi Masuko, and Keiichi Tokuda Nagoya Institute of Technology Tokyo Institute of Technology 1 2 11 1 1 2

3 2 Background  HMM-based speech synthesis system  Spectrum, excitation and duration are modeled  Speech parameter seqs. are generated  Maximum likelihood (ML) criterion  Train HMMs and generate speech parameters  Point estimate ⇒ The over-fitting problem  Bayesian approach  Estimate posterior dist. of model parameters  Prior information can be use ⇒ Alleviate the over-fitting problem

4 Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Bayesian context clustering  Prior distribution using cross validation  Experiments  Conclusion & Future work 3

5 Model training and speech synthesis Bayesian speech synthesis (1/2) 4 : Model parameters : Label seq. for synthesis : Label seq. for training: Training data seq. : Synthesis data seq. ML Bayes

6 Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) 5 : HMM state seq. for synthesis data Variational Bayesian method [Attias; ’99] : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters

7 Estimate approximate posterior dist. ⇒ Maximize a lower bound Variational Bayesian method (1/2) 6 : Expectation w.r.t. ( Jensen’s inequality ) : Approximate distribution of the true posterior distribution

8  Random variables are statistically independent  Optimal posterior distributions Variational Bayesian method (2/2) 7 : normalization terms Iterative updates as the EM algorithm

9 Approximation for speech synthesis  is dependent on synthesis data ⇒ Huge computational cost in the synthesis part  Ignore the dependency of synthesis data ⇒ Estimation from only training data 8

10 Prior distribution  Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist.  Determination using statistics of prior data 9 : Dimension of feature : Covariance of prior data : # of prior data : Mean of prior data Conjugate prior distribution Likelihood function

11 Speech parameter generation  Speech parameter Consist of static and dynamic features ⇒ Only static feature seq. is generated  Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound 10

12 Relation between Bayes and ML Compare with the ML criterion  Use of expectations of model parameters  Can be solved by the same fashion of ML 11 Output dist. ML ⇒ Bayes ⇒

13 Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Bayesian context clustering  Prior distribution using cross validation  Experiments  Conclusion & Future work 12

14 Bayesian context clustering Context clustering based on maximizing 13 yes no Select question Gain of Stopping condition ⇒ Split node based on gain : Is this phoneme a vowel?

15 Impact of prior distribution  Affect model selection as tuning parameters ⇒ Require determination technique of prior dist.  Conventional: maximize the marginal likelihood  Lead to the over-fitting problem as the ML  Tuning parameters are still required  Determination technique of prior distribution using cross validation [Hashimoto; ’08] 14

16 15 Bayesian approach using CV Prior distribution based on Cross Validation 2,31,3 Cross valid prior dist. Calculate likelihood Training data is randomly divided into K groups Posterior dist. 1,2

17 Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Bayesian context clustering  Prior distribution using cross validation  Experiments  Conclusion & Future work 16

18 17 Experimental conditions (1/2) DatabaseATR Japanese speech database B-set SpeakerMHT Training data450 utterances Test data53 utterances Sampling rate16 kHz WindowBlackman window Frame size / shift25 ms / 5 ms Feature vector 24 mel-cepstrum + Δ + ΔΔ and log F0 + Δ + ΔΔ (78 dimension) HMM 5-state left-to-right HMM without skip transition

19 18 Experimental conditions (2/2)  Compared approach  Mean Opinion Score (MOS) test  Subjects were 10 Japanese students  20 sentences were chosen at random TrainingContext clustering# of states ML-MDLMLMDL2,491 Bayes-BayesBayesBayes using CV25,911 Bayes-MDLBayes Bayes using CV Adjust threshold 2,553 ML-BayesML MDL Adjust threshold 27,106

20 Mean opinion score Subjective listening test 19 2,49125,91127,1062,553

21 20 Conclusions and future work  A new framework based on Bayesian approach  All processes are derived from a single predictive distribution  Improve the naturalness of synthesized speech  Future work  Introduce HSMM instead of HMM  Investigate the relation between the speech quality and model structures

22 21

23  Marginal likelihood using cross validation Alleviate over-fitting problem  Cross valid prior distribution Cross valid prior distribution 22

24 23 Experimental conditions ( 2/2 )  Compared approach  Number of states TrainingContext clustering ML-MDLMLMDL Bayes-BayesBayesBayes using cross validation Bayes-MDLBayesBayes using threshold ML-BayesMLMDL using threshold SpectrumF0DurationSum ML-MDL9561,1512802,491 Bayes-Bayes9,07012,8364,00525,911 Bayes-MDL1,941565472,553 ML-Bayes15,0778,8443,18527,106

25 24 Bayesian context clustering using CV を最大化するモデル構造を選択 yes no : 先行音素は母音? 汎化性能の高いモデル構造を選択 分割停止条件 : 各ノードで を計算 分割前後での増加量 が最大となる分割を行う

26 ベイズ基準  モデルパラメータを確率分布によって表現 25 事前分布 : 事後分布 : 予測分布 : 学習データ 合成データ 全てのモデルパラメータを考慮 ⇒ 高い汎化性能


Download ppt "A Bayesian Approach to HMM-Based Speech Synthesis Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Takashi Masuko, and Keiichi Tokuda Nagoya Institute of."

Similar presentations


Ads by Google