Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Biointelligence Laboratory, Seoul National University
A Bayesian Approach to HMM-Based Speech Synthesis Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Takashi Masuko, and Keiichi Tokuda Nagoya Institute of.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Lecture 5: Learning models using EM
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Real-Time Odor Classification Through Sequential Bayesian Filtering Javier G. Monroy Javier Gonzalez-Jimenez
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On-line Learning of Sequence Data Based on Self-Organizing.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Randomized Algorithms for Bayesian Hierarchical Clustering
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Full-rank Gaussian modeling of convolutive audio mixtures applied to source separation Ngoc Q. K. Duong, Supervisor: R. Gribonval and E. Vincent METISS.
CS Statistical Machine learning Lecture 24
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Variational Bayesian Methods for Audio Indexing
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
NTU & MSRA Ming-Feng Tsai
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
ICS 280 Learning in Graphical Models
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Expectation-Maximization
Bayesian Models in Machine Learning
Learning Markov Networks
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Stochastic Optimization Maximization for Latent Variable Models
Speech recognition, machine learning
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Speech recognition, machine learning
Presentation transcript:

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology September 23, 2010

2 Background  Bayesian speech synthesis [Hashimoto et al., ’08]  Represent the problem of speech synthesis  All processes can be derived from one single predictive distribution  Approximation for estimating posterior  Posterior is independent of synthesis data ⇒ Training and synthesis processes are separated  Integration of training and synthesis processes  Derive an algorithm that posterior and synthesis data are iteratively updated

Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Problem & Proposed method  Approximation of posterior  Integration of training and synthesis processes  Experiments  Conclusion & Future work 3

Model training and speech synthesis Bayesian speech synthesis (1/2) 4 : Model parameters : Label seq. for synthesis : Label seq. for training: Training data : Synthesis data ML Bayes Training Synthesis Training & Synthesis

Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) 5 : HMM state seq. for synthesis data Variational Bayesian method [Attias; ’99] : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters

Estimate approximate posterior distribution ⇒ Maximize a lower bound Variational Bayesian method (1/2) 6 : Expectation w.r.t. ( Jensen’s inequality ) : Approximate distribution of the true posterior distribution

 Random variables are statistically independent  Optimal posterior distributions Variational Bayesian method (2/2) 7 : Normalization terms Iterative updates as the EM algorithm

 Speech parameter generation based on Bayesian approach  Lower bound approximates true marginal likelihood well  Maximize the lower bound Speech parameter generation 8

Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Problem & Proposed method  Approximation of posterior  Integration of training and synthesis processes  Experiments  Conclusion & Future work 9

Bayesian speech synthesis  Maximize the lower bound of log marginal likelihood consistently  Estimation of posterior distributions  Speech parameter generation ⇒ All processes are derived from the single predictive distribution 10

Approximation of posterior  depends on synthesis data ⇒ Synthesis data is not observed  Assume that is independent of synthesis data [Hashimoto et al., ’08] ⇒ Estimate posterior from only training data 11

Separation of training & synthesis 12 Training Synthesis Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

Use of generated data  Problem:  Posterior distribution depends on synthesis data  Synthesis data is not observed  Proposed method:  Use generated data instead of observed data for estimating posterior distribution  Iterative updates as the EM algorithm 13

Previous method 14 Training Synthesis Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

Proposed method 15 Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

 Synthesis data can include several utterances  Synthesis data impacts on posterior distributions  How many utterances are generated in one update process?  Two methods are discussed  Batch-based method Update posterior distributions for several test sentences  Sentence-based method Update posterior distributions for one test sentence 16

Update method (1/2)  Batch-based method  Generated synthesis data of all test sentences is used for update of posterior distributions  Synthesis data of all test sentences is generated by using the same posterior distributions 17 Sentence 1 Sentence 2 Sentence N ・・・

Update method (2/2)  Sentence-based method  Generated synthesis data of one test sentence is used for update of posterior distributions  Synthesis data of each test sentence is generated by using different posterior distributions 18 Sentence 1 Sentence 2 Sentence N ・・・

Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Problem & Proposed method  Approximation of posterior  Integration of training and synthesis processes  Experiments  Conclusion & Future work 19

20 Experimental conditions DatabaseATR Japanese speech database B-set SpeakerMHT Training data450 utterances Test data53 utterances Sampling rate16 kHz WindowBlackman window Frame size / shift25 ms / 5 ms Feature vector 24 mel-cepstrum + Δ + ΔΔ and log F0 + Δ + ΔΔ (78 dimension) HMM 5-state left-to-right HSMM without skip transition

Iteration process  Update of posterior dists. and synthesis data 1. Posterior dists. are estimated from training data 2. Initial synthesis data is generated 3. Context-clustering using training and generated synthesis data 4. Posterior dists. are re-estimated from training data and generated synthesis data (Number of updates is 5) 5. Synthesis data is re-generated 6. Step 3, 4, and 5 are iterated 21

22 Comparison of number of updates Data for estimation of posterior distributions Iteration0450 training utterances Iteration1 450 utterances + 1 utterance generated in Iteration0 Iteration2 450 utterances + 1 utterance generated in Iteration1 Iteration3 450 utterances + 1 utterance generated in Iteration2

Experimental results  Comparison of the number of updates 23

24 Comparison of Batch and Sentence Training & Generation Data for estimation of posterior distributions ML 450 utterances Baseline Bayes450 utterances BatchBayes generated utterances SentenceBayes generated utterance (53 different posterior dists.)

Experimental results  Comparison of Batch and Sentence 25

26 Conclusions and future work  Integration of training and synthesis processes  Generated synthesis data is used for estimating posterior distributions  Posterior distributions and synthesis data are updated iteratively  Outperform the baseline method  Future work  Investigation of the relation between the amount of training and synthesis data  Experiments on various amounts of training data

Thank you

Advantage  Represent predictive distribution more exactly  Optimize posterior distributions more accurately 28

Integration training and synthesis  Estimate posterior from generated data instead of observed data  Bayesian speech synthesis  Synthesis and training processes are iterated Training process includes model selection 29

Prior distribution  Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist.  Determination using statistics of prior data 30 : Dimension of feature : Covariance of prior data : # of prior data : Mean of prior data Conjugate prior distribution Likelihood function

Relation between Bayes and ML Compare with the ML criterion  Use of expectations of model parameters  Can be solved by the same fashion of ML 31 Output dist. ML ⇒ Bayes ⇒

Impact of prior distribution  Affect model selection as tuning parameters ⇒ Require determination technique of prior dist.  Maximize the marginal likelihood  Lead to the over-fitting problem as the ML  Tuning parameters are still required  Determination technique of prior distribution using cross validation [Hashimoto; ’08] 32

Speech parameter generation  Speech parameter Consist of static and dynamic features ⇒ Only static feature sequence is generated  Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound 33

Bayesian context clustering Context clustering based on maximizing 34 yes no Select question Gain of Stopping condition ⇒ Split node based on gain : Is this phoneme a vowel?

Use of generated data  Problem: Synthesis data is not observed  Proposed method: Generated data is used for estimating posterior distribution instead of observed data  Synthesis data and posterior distributions have influence on each other  Iteratively update as the EM algorithm 35

 Batch-based method  Sentence-based method Batch-based & Sentence-based 36 Sentence 1 Sentence 2 Sentence N Sentence 1 Sentence 2Sentence N ・・・