Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology September 23, 2010

2 Background  Bayesian speech synthesis [Hashimoto et al., ’08]  Represent the problem of speech synthesis  All processes can be derived from one single predictive distribution  Approximation for estimating posterior  Posterior is independent of synthesis data ⇒ Training and synthesis processes are separated  Integration of training and synthesis processes  Derive an algorithm that posterior and synthesis data are iteratively updated

Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Problem & Proposed method  Approximation of posterior  Integration of training and synthesis processes  Experiments  Conclusion & Future work 3

Model training and speech synthesis Bayesian speech synthesis (1/2) 4 : Model parameters : Label seq. for synthesis : Label seq. for training: Training data : Synthesis data ML Bayes Training Synthesis Training & Synthesis

Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) 5 : HMM state seq. for synthesis data Variational Bayesian method [Attias; ’99] : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters

Estimate approximate posterior distribution ⇒ Maximize a lower bound Variational Bayesian method (1/2) 6 ： Expectation w.r.t. （ Jensen’s inequality ） : Approximate distribution of the true posterior distribution

 Random variables are statistically independent  Optimal posterior distributions Variational Bayesian method (2/2) 7 : Normalization terms Iterative updates as the EM algorithm

 Speech parameter generation based on Bayesian approach  Lower bound approximates true marginal likelihood well  Maximize the lower bound Speech parameter generation 8

Bayesian speech synthesis  Maximize the lower bound of log marginal likelihood consistently  Estimation of posterior distributions  Speech parameter generation ⇒ All processes are derived from the single predictive distribution 10

Approximation of posterior  depends on synthesis data ⇒ Synthesis data is not observed  Assume that is independent of synthesis data [Hashimoto et al., ’08] ⇒ Estimate posterior from only training data 11

Separation of training & synthesis 12 Training Synthesis Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

Use of generated data  Problem:  Posterior distribution depends on synthesis data  Synthesis data is not observed  Proposed method:  Use generated data instead of observed data for estimating posterior distribution  Iterative updates as the EM algorithm 13

Previous method 14 Training Synthesis Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

Proposed method 15 Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

 Synthesis data can include several utterances  Synthesis data impacts on posterior distributions  How many utterances are generated in one update process?  Two methods are discussed  Batch-based method Update posterior distributions for several test sentences  Sentence-based method Update posterior distributions for one test sentence 16

Update method (1/2)  Batch-based method  Generated synthesis data of all test sentences is used for update of posterior distributions  Synthesis data of all test sentences is generated by using the same posterior distributions 17 Sentence 1 Sentence 2 Sentence N ・・・

Update method (2/2)  Sentence-based method  Generated synthesis data of one test sentence is used for update of posterior distributions  Synthesis data of each test sentence is generated by using different posterior distributions 18 Sentence 1 Sentence 2 Sentence N ・・・

20 Experimental conditions DatabaseATR Japanese speech database B-set SpeakerMHT Training data450 utterances Test data53 utterances Sampling rate16 kHz WindowBlackman window Frame size / shift25 ms / 5 ms Feature vector 24 mel-cepstrum + Δ + ΔΔ and log F0 + Δ + ΔΔ (78 dimension) HMM 5-state left-to-right HSMM without skip transition

Iteration process  Update of posterior dists. and synthesis data 1. Posterior dists. are estimated from training data 2. Initial synthesis data is generated 3. Context-clustering using training and generated synthesis data 4. Posterior dists. are re-estimated from training data and generated synthesis data (Number of updates is 5) 5. Synthesis data is re-generated 6. Step 3, 4, and 5 are iterated 21

22 Comparison of number of updates Data for estimation of posterior distributions Iteration0450 training utterances Iteration1 450 utterances + 1 utterance generated in Iteration0 Iteration2 450 utterances + 1 utterance generated in Iteration1 Iteration3 450 utterances + 1 utterance generated in Iteration2

Experimental results  Comparison of the number of updates 23

24 Comparison of Batch and Sentence Training & Generation Data for estimation of posterior distributions ML 450 utterances Baseline Bayes450 utterances BatchBayes 450 + 53 generated utterances SentenceBayes 450 + 1 generated utterance (53 different posterior dists.)

Experimental results  Comparison of Batch and Sentence 25

26 Conclusions and future work  Integration of training and synthesis processes  Generated synthesis data is used for estimating posterior distributions  Posterior distributions and synthesis data are updated iteratively  Outperform the baseline method  Future work  Investigation of the relation between the amount of training and synthesis data  Experiments on various amounts of training data

Thank you

Advantage  Represent predictive distribution more exactly  Optimize posterior distributions more accurately 28

Integration training and synthesis  Estimate posterior from generated data instead of observed data  Bayesian speech synthesis  Synthesis and training processes are iterated Training process includes model selection 29

Prior distribution  Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist.  Determination using statistics of prior data 30 : Dimension of feature ： Covariance of prior data ： # of prior data ： Mean of prior data Conjugate prior distribution Likelihood function

Relation between Bayes and ML Compare with the ML criterion  Use of expectations of model parameters  Can be solved by the same fashion of ML 31 Output dist. ML ⇒ Bayes ⇒

Impact of prior distribution  Affect model selection as tuning parameters ⇒ Require determination technique of prior dist.  Maximize the marginal likelihood  Lead to the over-fitting problem as the ML  Tuning parameters are still required  Determination technique of prior distribution using cross validation [Hashimoto; ’08] 32

Speech parameter generation  Speech parameter Consist of static and dynamic features ⇒ Only static feature sequence is generated  Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound 33

Bayesian context clustering Context clustering based on maximizing 34 yes no Select question Gain of Stopping condition ⇒ Split node based on gain : Is this phoneme a vowel?

Use of generated data  Problem: Synthesis data is not observed  Proposed method: Generated data is used for estimating posterior distribution instead of observed data  Synthesis data and posterior distributions have influence on each other  Iteratively update as the EM algorithm 35

 Batch-based method  Sentence-based method Batch-based & Sentence-based 36 Sentence 1 Sentence 2 Sentence N Sentence 1 Sentence 2Sentence N ・・・

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Similar presentations

Presentation on theme: "Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Similar presentations

Presentation on theme: "Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute."— Presentation transcript:

Similar presentations

About project

Feedback