Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology 28 August, 2011

Background (1/2) Model estimation Maximum likelihood (ML) approach Bayesian approach Estimation of posterior distributions Utilization of prior distributions Model selection according to the posterior probability Bayesian speech synthesis [Hashimoto et al., ’08] Model estimation and speech parameter generation can be derived from the predictive distribution Represent the problem of speech synthesis 2

Background (2/2) Acoustic features common to every speaker Speaker Adaptive Training (SAT) [Anastasakos et al., ’97] Shared Tree Clustering (STC) [Yamagishi et al., ’03] Universal Background Model (UBM) [Reynolds et al., ’00] Multi-speaker modeling with shared prior distributions and model structures Appropriate acoustic models can be estimated from training data of multiple speakers 3

Outline Bayesian speech synthesis Bayesian speech synthesis framework Variational Bayesian method Shared model structures and prior distributions Multi-speaker modeling Shared model structures Shared prior distributions Experiments Conclusion & Future work 4

Model training and speech synthesis Bayesian speech synthesis (1/3) 5 ML Bayes Training Synthesis Training & Synthesis : Label seq. for synthesis : Model parameters : Label seq. for training: Training data : Synthesis data

Bayesian speech synthesis (2/3) Introduce model structure into the predictive dist. Model selection according to posterior probability Approximate the predictive distribution 6 : Model structure

Bayesian speech synthesis (3/3) Predictive distribution (marginal likelihood) 7 Variational Bayesian method [Attias; ’99] : Likelihood of synthesis data : Likelihood of training data : HMM state seq. for training data : Prior distribution for model parameters : HMM state seq. for synthesis data

Estimate approximate posterior distribution ⇒ Maximize the lower bound Variational Bayesian method (1/2) 8 ： Expectation w.r.t. : Approximate posterior distribution Jensen’s inequality

Random variables are statistically independent Optimal posterior distributions Variational Bayesian method (2/2) 9 Iterative updates as the EM algorithm : Normalization terms

Multi-speaker modeling Acoustic feature common to every speaker Use training data of multiple speakers SAT, STC, UBM, etc… Estimate appropriate acoustic models Multi-speaker modeling Log marginal likelihood of multiple speakers 11 : Speaker index Shared model structures and prior distributions

Maximize the sum of the lower bounds Shared model structures 12 yes no Is this phoneme a vowel? Same node STC based on Bayesian model selection Stopping condition: : Posterior dist.

Prior distributions Conjugate prior distribution Determination using prior data Use training data of multiple speakers as prior data ⇒ Speaker independent prior distribution 13 State output probability dist. Prior distribution : Hyper-parameter : Amount of prior data: Mean of prior data : Covariance of prior data: Tuning parameter

Speaker adaptive prior distribution Maximize the sum of the lower bounds Prior distribution are estimated so that posterior distributions are estimated well Same fashion as the speaker adaptive training 14 : Posterior dist. for each speaker : Shared prior dist.

16 Experimental conditions DatabaseNIT Japanese speech database Speaker5 male speakers Training data450 utterances for each speaker Test data53 utterances for each speaker Sampling rate16 kHz WindowBlackman window Frame size / shift25 ms / 5 ms Feature vector 24 mel-cepstrum + Δ + ΔΔ and log F0 + Δ + ΔΔ (78 dimension) HMM 5-state left-to-right HSMM without skip transition

17 Comparison methods Compare 5 sharing methods Share among all speakers Model structurePrior distribution SD Tree ○ Prior○ Tree-Prior○ ○ (Speaker independent) Tree-SAT○ ○ (Speaker adaptive)

Experimental result 5-point Mean Opinion Score 18

Conclusions and future work Investigate sharing prior distributions and model structures among multiple speakers Estimate appropriate acoustic models Outperform single speaker modeling method Robust model structures Reliable prior distributions Future work Investigate speaker selection for sharing methods Experiments for comparing with conventional multi-speaker modeling 19

Thank you

Background Bayesian speech synthesis [Hashimoto et al., ’08] Represent the problem of speech synthesis All processes can be derived from predictive dist. Model structures affect the quality of speech Prior distributions affect Bayesian model selection Determination of prior distribution and model selection should be performed simultaneously Acoustic features common to every speaker Investigate prior distribution and model structure Share prior distributions and model structures among all speakers 22

Bayesian speech synthesis Model structure is marginalized Select model structure maximize the posterior Approximate the predictive distribution 23 : Prior distribution of model structure: Model structure

Context clustering based on VB Maximize the marginal likelihood Construct decision tree 24 yes no Select question Gain of Stopping condition ⇒ Split node based on gain : Is this phoneme a vowel?

Multi-speaker modeling Data of multiple speakers can be used Marginal likelihood of multiple speakers Sum of lower bound of each speaker 25 Training data Synthesis data

Shared model structures Model structures are selected for each speaker Sharing model structures among speakers Shared Tree Clustering (STC) based on the Bayesian model selection 26

Random variables are statistically independent Optimal posterior distributions Variational Bayesian method (2/2) 27 Iterative updates as the EM algorithm : Normalization terms

Outline Bayesian speech synthesis Variational Bayesian method Speech parameter generation Problem & Proposed method Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 28

Bayesian speech synthesis Maximize the lower bound of log marginal likelihood consistently Estimation of posterior distributions Speech parameter generation ⇒ All processes are derived from the single predictive distribution 29

Approximation of posterior depends on synthesis data ⇒ Synthesis data is not observed Assume that is independent of synthesis data [Hashimoto et al., ’08] ⇒ Estimate posterior from only training data 30

Use of generated data Problem: Posterior distribution depends on synthesis data Synthesis data is not observed Proposed method: Use generated data instead of observed data for estimating posterior distribution Iterative updates as the EM algorithm 31

Prior distribution Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist.  Determination using statistics of prior data 32 : Dimension of feature ： Covariance of prior data ： # of prior data ： Mean of prior data Conjugate prior distribution Likelihood function

Relation between Bayes and ML Compare with the ML criterion Use of expectations of model parameters Can be solved by the same fashion of ML 33 Output dist. ML ⇒ Bayes ⇒

Impact of prior distribution Affect model selection as tuning parameters ⇒ Require determination technique of prior dist. Maximize the marginal likelihood Lead to the over-fitting problem as the ML Tuning parameters are still required Determination technique of prior distribution using cross validation [Hashimoto; ’08] 34

Speech parameter generation Speech parameter Consist of static and dynamic features ⇒ Only static feature sequence is generated Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound 35

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Similar presentations

Presentation on theme: "Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Similar presentations

Presentation on theme: "Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi."— Presentation transcript:

Similar presentations

About project

Feedback