Download presentation
Presentation is loading. Please wait.
Published byAshley Bell Modified over 9 years ago
2
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology September 23, 2010
3
2 Background Bayesian speech synthesis [Hashimoto et al., ’08] Represent the problem of speech synthesis All processes can be derived from one single predictive distribution Approximation for estimating posterior Posterior is independent of synthesis data ⇒ Training and synthesis processes are separated Integration of training and synthesis processes Derive an algorithm that posterior and synthesis data are iteratively updated
4
Outline Bayesian speech synthesis Variational Bayesian method Speech parameter generation Problem & Proposed method Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 3
5
Model training and speech synthesis Bayesian speech synthesis (1/2) 4 : Model parameters : Label seq. for synthesis : Label seq. for training: Training data : Synthesis data ML Bayes Training Synthesis Training & Synthesis
6
Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) 5 : HMM state seq. for synthesis data Variational Bayesian method [Attias; ’99] : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters
7
Estimate approximate posterior distribution ⇒ Maximize a lower bound Variational Bayesian method (1/2) 6 : Expectation w.r.t. ( Jensen’s inequality ) : Approximate distribution of the true posterior distribution
8
Random variables are statistically independent Optimal posterior distributions Variational Bayesian method (2/2) 7 : Normalization terms Iterative updates as the EM algorithm
9
Speech parameter generation based on Bayesian approach Lower bound approximates true marginal likelihood well Maximize the lower bound Speech parameter generation 8
10
Outline Bayesian speech synthesis Variational Bayesian method Speech parameter generation Problem & Proposed method Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 9
11
Bayesian speech synthesis Maximize the lower bound of log marginal likelihood consistently Estimation of posterior distributions Speech parameter generation ⇒ All processes are derived from the single predictive distribution 10
12
Approximation of posterior depends on synthesis data ⇒ Synthesis data is not observed Assume that is independent of synthesis data [Hashimoto et al., ’08] ⇒ Estimate posterior from only training data 11
13
Separation of training & synthesis 12 Training Synthesis Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data
14
Use of generated data Problem: Posterior distribution depends on synthesis data Synthesis data is not observed Proposed method: Use generated data instead of observed data for estimating posterior distribution Iterative updates as the EM algorithm 13
15
Previous method 14 Training Synthesis Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data
16
Proposed method 15 Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data
17
Synthesis data can include several utterances Synthesis data impacts on posterior distributions How many utterances are generated in one update process? Two methods are discussed Batch-based method Update posterior distributions for several test sentences Sentence-based method Update posterior distributions for one test sentence 16
18
Update method (1/2) Batch-based method Generated synthesis data of all test sentences is used for update of posterior distributions Synthesis data of all test sentences is generated by using the same posterior distributions 17 Sentence 1 Sentence 2 Sentence N ・・・
19
Update method (2/2) Sentence-based method Generated synthesis data of one test sentence is used for update of posterior distributions Synthesis data of each test sentence is generated by using different posterior distributions 18 Sentence 1 Sentence 2 Sentence N ・・・
20
Outline Bayesian speech synthesis Variational Bayesian method Speech parameter generation Problem & Proposed method Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 19
21
20 Experimental conditions DatabaseATR Japanese speech database B-set SpeakerMHT Training data450 utterances Test data53 utterances Sampling rate16 kHz WindowBlackman window Frame size / shift25 ms / 5 ms Feature vector 24 mel-cepstrum + Δ + ΔΔ and log F0 + Δ + ΔΔ (78 dimension) HMM 5-state left-to-right HSMM without skip transition
22
Iteration process Update of posterior dists. and synthesis data 1. Posterior dists. are estimated from training data 2. Initial synthesis data is generated 3. Context-clustering using training and generated synthesis data 4. Posterior dists. are re-estimated from training data and generated synthesis data (Number of updates is 5) 5. Synthesis data is re-generated 6. Step 3, 4, and 5 are iterated 21
23
22 Comparison of number of updates Data for estimation of posterior distributions Iteration0450 training utterances Iteration1 450 utterances + 1 utterance generated in Iteration0 Iteration2 450 utterances + 1 utterance generated in Iteration1 Iteration3 450 utterances + 1 utterance generated in Iteration2
24
Experimental results Comparison of the number of updates 23
25
24 Comparison of Batch and Sentence Training & Generation Data for estimation of posterior distributions ML 450 utterances Baseline Bayes450 utterances BatchBayes 450 + 53 generated utterances SentenceBayes 450 + 1 generated utterance (53 different posterior dists.)
26
Experimental results Comparison of Batch and Sentence 25
27
26 Conclusions and future work Integration of training and synthesis processes Generated synthesis data is used for estimating posterior distributions Posterior distributions and synthesis data are updated iteratively Outperform the baseline method Future work Investigation of the relation between the amount of training and synthesis data Experiments on various amounts of training data
28
Thank you
29
Advantage Represent predictive distribution more exactly Optimize posterior distributions more accurately 28
30
Integration training and synthesis Estimate posterior from generated data instead of observed data Bayesian speech synthesis Synthesis and training processes are iterated Training process includes model selection 29
31
Prior distribution Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist. Determination using statistics of prior data 30 : Dimension of feature : Covariance of prior data : # of prior data : Mean of prior data Conjugate prior distribution Likelihood function
32
Relation between Bayes and ML Compare with the ML criterion Use of expectations of model parameters Can be solved by the same fashion of ML 31 Output dist. ML ⇒ Bayes ⇒
33
Impact of prior distribution Affect model selection as tuning parameters ⇒ Require determination technique of prior dist. Maximize the marginal likelihood Lead to the over-fitting problem as the ML Tuning parameters are still required Determination technique of prior distribution using cross validation [Hashimoto; ’08] 32
34
Speech parameter generation Speech parameter Consist of static and dynamic features ⇒ Only static feature sequence is generated Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound 33
35
Bayesian context clustering Context clustering based on maximizing 34 yes no Select question Gain of Stopping condition ⇒ Split node based on gain : Is this phoneme a vowel?
36
Use of generated data Problem: Synthesis data is not observed Proposed method: Generated data is used for estimating posterior distribution instead of observed data Synthesis data and posterior distributions have influence on each other Iteratively update as the EM algorithm 35
37
Batch-based method Sentence-based method Batch-based & Sentence-based 36 Sentence 1 Sentence 2 Sentence N Sentence 1 Sentence 2Sentence N ・・・
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.