Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Biointelligence Laboratory, Seoul National University
A Bayesian Approach to HMM-Based Speech Synthesis Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Takashi Masuko, and Keiichi Tokuda Nagoya Institute of.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Lecture 5: Learning models using EM
Speaker Adaptation for Vowel Classification
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Computer vision: models, learning and inference
Real-Time Odor Classification Through Sequential Bayesian Filtering Javier G. Monroy Javier Gonzalez-Jimenez
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Isolated-Word Speech Recognition Using Hidden Markov Models
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.
Randomized Algorithms for Bayesian Hierarchical Clustering
In-car Speech Recognition Using Distributed Microphones Tetsuya Shinde Kazuya Takeda Fumitada Itakura Center for Integrated Acoustic Information Research.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
Variational Bayesian Methods for Audio Indexing
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Online Multiscale Dynamic Topic Models
Variational Bayes Model Selection for Mixture Distribution
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Expectation-Maximization
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Speech recognition, machine learning
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Qiang Huo(*) and Chorkin Chan(**)
Speech recognition, machine learning
Presentation transcript:

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology 28 August, 2011

Background (1/2) Model estimation Maximum likelihood (ML) approach Bayesian approach Estimation of posterior distributions Utilization of prior distributions Model selection according to the posterior probability Bayesian speech synthesis [Hashimoto et al., ’08] Model estimation and speech parameter generation can be derived from the predictive distribution Represent the problem of speech synthesis 2

Background (2/2) Acoustic features common to every speaker Speaker Adaptive Training (SAT) [Anastasakos et al., ’97] Shared Tree Clustering (STC) [Yamagishi et al., ’03] Universal Background Model (UBM) [Reynolds et al., ’00] Multi-speaker modeling with shared prior distributions and model structures Appropriate acoustic models can be estimated from training data of multiple speakers 3

Outline Bayesian speech synthesis Bayesian speech synthesis framework Variational Bayesian method Shared model structures and prior distributions Multi-speaker modeling Shared model structures Shared prior distributions Experiments Conclusion & Future work 4

Model training and speech synthesis Bayesian speech synthesis (1/3) 5 ML Bayes Training Synthesis Training & Synthesis : Label seq. for synthesis : Model parameters : Label seq. for training: Training data : Synthesis data

Bayesian speech synthesis (2/3) Introduce model structure into the predictive dist. Model selection according to posterior probability Approximate the predictive distribution 6 : Model structure

Bayesian speech synthesis (3/3) Predictive distribution (marginal likelihood) 7 Variational Bayesian method [Attias; ’99] : Likelihood of synthesis data : Likelihood of training data : HMM state seq. for training data : Prior distribution for model parameters : HMM state seq. for synthesis data

Estimate approximate posterior distribution ⇒ Maximize the lower bound Variational Bayesian method (1/2) 8 : Expectation w.r.t. : Approximate posterior distribution Jensen’s inequality

Random variables are statistically independent Optimal posterior distributions Variational Bayesian method (2/2) 9 Iterative updates as the EM algorithm : Normalization terms

Outline Bayesian speech synthesis Bayesian speech synthesis framework Variational Bayesian method Shared model structures and prior distributions Multi-speaker modeling Shared model structures Shared prior distributions Experiments Conclusion & Future work 10

Multi-speaker modeling Acoustic feature common to every speaker Use training data of multiple speakers SAT, STC, UBM, etc… Estimate appropriate acoustic models Multi-speaker modeling Log marginal likelihood of multiple speakers 11 : Speaker index Shared model structures and prior distributions

Maximize the sum of the lower bounds Shared model structures 12 yes no Is this phoneme a vowel? Same node STC based on Bayesian model selection Stopping condition: : Posterior dist.

Prior distributions Conjugate prior distribution Determination using prior data Use training data of multiple speakers as prior data ⇒ Speaker independent prior distribution 13 State output probability dist. Prior distribution : Hyper-parameter : Amount of prior data: Mean of prior data : Covariance of prior data: Tuning parameter

Speaker adaptive prior distribution Maximize the sum of the lower bounds Prior distribution are estimated so that posterior distributions are estimated well Same fashion as the speaker adaptive training 14 : Posterior dist. for each speaker : Shared prior dist.

Outline Bayesian speech synthesis Bayesian speech synthesis framework Variational Bayesian method Shared model structures and prior distributions Multi-speaker modeling Shared model structures Shared prior distributions Experiments Conclusion & Future work 15

16 Experimental conditions DatabaseNIT Japanese speech database Speaker5 male speakers Training data450 utterances for each speaker Test data53 utterances for each speaker Sampling rate16 kHz WindowBlackman window Frame size / shift25 ms / 5 ms Feature vector 24 mel-cepstrum + Δ + ΔΔ and log F0 + Δ + ΔΔ (78 dimension) HMM 5-state left-to-right HSMM without skip transition

17 Comparison methods Compare 5 sharing methods Share among all speakers Model structurePrior distribution SD Tree ○ Prior○ Tree-Prior○ ○ (Speaker independent) Tree-SAT○ ○ (Speaker adaptive)

Experimental result 5-point Mean Opinion Score 18

Conclusions and future work Investigate sharing prior distributions and model structures among multiple speakers Estimate appropriate acoustic models Outperform single speaker modeling method Robust model structures Reliable prior distributions Future work Investigate speaker selection for sharing methods Experiments for comparing with conventional multi-speaker modeling 19

Thank you

21

Background Bayesian speech synthesis [Hashimoto et al., ’08] Represent the problem of speech synthesis All processes can be derived from predictive dist. Model structures affect the quality of speech Prior distributions affect Bayesian model selection Determination of prior distribution and model selection should be performed simultaneously Acoustic features common to every speaker Investigate prior distribution and model structure Share prior distributions and model structures among all speakers 22

Bayesian speech synthesis Model structure is marginalized Select model structure maximize the posterior Approximate the predictive distribution 23 : Prior distribution of model structure: Model structure

Context clustering based on VB Maximize the marginal likelihood Construct decision tree 24 yes no Select question Gain of Stopping condition ⇒ Split node based on gain : Is this phoneme a vowel?

Multi-speaker modeling Data of multiple speakers can be used Marginal likelihood of multiple speakers Sum of lower bound of each speaker 25 Training data Synthesis data

Shared model structures Model structures are selected for each speaker Sharing model structures among speakers Shared Tree Clustering (STC) based on the Bayesian model selection 26

Random variables are statistically independent Optimal posterior distributions Variational Bayesian method (2/2) 27 Iterative updates as the EM algorithm : Normalization terms

Outline Bayesian speech synthesis Variational Bayesian method Speech parameter generation Problem & Proposed method Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 28

Bayesian speech synthesis Maximize the lower bound of log marginal likelihood consistently Estimation of posterior distributions Speech parameter generation ⇒ All processes are derived from the single predictive distribution 29

Approximation of posterior depends on synthesis data ⇒ Synthesis data is not observed Assume that is independent of synthesis data [Hashimoto et al., ’08] ⇒ Estimate posterior from only training data 30

Use of generated data Problem: Posterior distribution depends on synthesis data Synthesis data is not observed Proposed method: Use generated data instead of observed data for estimating posterior distribution Iterative updates as the EM algorithm 31

Prior distribution Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist.  Determination using statistics of prior data 32 : Dimension of feature : Covariance of prior data : # of prior data : Mean of prior data Conjugate prior distribution Likelihood function

Relation between Bayes and ML Compare with the ML criterion Use of expectations of model parameters Can be solved by the same fashion of ML 33 Output dist. ML ⇒ Bayes ⇒

Impact of prior distribution Affect model selection as tuning parameters ⇒ Require determination technique of prior dist. Maximize the marginal likelihood Lead to the over-fitting problem as the ML Tuning parameters are still required Determination technique of prior distribution using cross validation [Hashimoto; ’08] 34

Speech parameter generation Speech parameter Consist of static and dynamic features ⇒ Only static feature sequence is generated Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound 35