Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Generalized Method of Moments: Introduction
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Speech Recognition with Hidden Markov Models Winter 2011
Biointelligence Laboratory, Seoul National University
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
Visual Recognition Tutorial
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
Independent Component Analysis (ICA) and Factor Analysis (FA)
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
Experimental Group Designs
Correlation. The sample covariance matrix: where.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
 (Worse) The number of banks charging their customers ATM user fees are increasing.  (Better) The number of banks charging their customers ATM user.
The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable.
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
CSD 5100 Introduction to Research Methods in CSD Observation and Data Collection in CSD Research Strategies Measurement Issues.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Full-rank Gaussian modeling of convolutive audio mixtures applied to source separation Ngoc Q. K. Duong, Supervisor: R. Gribonval and E. Vincent METISS.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Computacion Inteligente Least-Square Methods for System Identification.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Estimating standard error using bootstrap
High Quality Voice Morphing
PREDICT 422: Practical Machine Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
Deep Feedforward Networks
Ch3: Model Building through Regression
CJT 765: Structural Equation Modeling
Final Year Project Presentation --- Magic Paint Face
Voice conversion using Artificial Neural Networks
Bayesian Models in Machine Learning
SMEM Algorithm for Mixture Models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Generally Discriminant Analysis
Multivariate Methods Berlin Chen
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Presentation transcript:

Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano Nara Institute of Science and Technology (NAIST), Japan August 23rd, 2007

1 –Amusement device –Speech enhancement device for a speaking aid system recovering a disabled person’s voice for a hearing aid system to make speech sounds more intelligible Voice Quality Control Technique for converting user’s voice quality into another one Applications Development of voice quality control with high quality and high controllability is desired! Controller Hello.

2 Contents 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions

3 Arbitrary speakers Multiple pre-stored target speakers Conversion Training Source speaker Hello. Thank you. Hello. Thank you. Hello. Thank you. Hello. Thank you. Let’s convert. Eigenvoice GMM (EV-GMM) Manually setting Parallel data One-to-Many Eigenvoice Conversion (EVC) [Toda et al., 2006] A source speaker’s voice is statistically converted into an arbitrary speaker’s one.

4 Converted voice quality is controlled by weights for eigenvectors. Eigenvoice GMM (EV-GMM) Weight Mean vector Covariance matrix Eigenvectors (for eigenvoices) Bias vector (for average voice) Parameters of the i th mixture Source mean vector Target mean vector = + Weights for eigenvoices (free parameters) Problem: eigenvoices do NOT represent a specific physical meaning (such as a masculine voice or a clear voice). Intuitive control of the converted voice quality is difficult! : Speaker independent parameters : Free parameters

5 Contents 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions

6 Proposed Framework We would like to intuitively control the converted voice quality! We propose multiple regression approaches to one-to-many EVC. Converted voice quality is controlled with the voice quality control vector. * Similar approaches have been proposed in HMM-based speech synthesis [Tachibana et al., 2006].

7 Process of Proposed Framework 1. Preparing multiple parallel data sets 2. Setting the voice quality control vector for every pre-stored target speaker 3. Modeling the target mean vectors with voice quality control vector 1. Preparing multiple parallel data sets 2. Setting the voice quality control vector for every pre-stored target speaker 3. Modeling the target mean vectors with voice quality control vector

8 Setting Voice Quality Control Vector We manually assign scores for expression word pairs to each pre-stored target speaker. Assigned scores are used as components of the voice quality control vector. Tense Hoarse Masculine Elderly Thin Feminine Clear Youthful Deep Lax Very Quite Some- what No preference Voice quality control vector for the speaker A Assigned scores for the speaker A

9 Process of Proposed Framework 1. Preparing multiple parallel data sets 2. Setting the voice quality control vector for every pre-stored target speaker 3. Modeling the target mean vectors with voice quality control vector We propose 3 regression methods.

10 Proposed Method A Regression parameters Principal components for the s th target speaker  Modeling principal components is modeled by  Minimizing the following error function: Error of principal components for the s th pre-stored target speaker Total error over all pre- stored target speakers Least-squares (LS) estimation of regression parameters converting the voice quality control vector into principal components Voice quality control vector for the s th target speaker

11 Resulting EV-GMM in Method A = Weight Mean vector Covariance matrix Eigenvectors Bias vector Parameters of the i th mixture Target mean vector Regression parameters Voice quality control vector + + Problem: the desired voice characteristics might not be represented as a linear combination of eigenvectors. Changing the eigenvectors themselves is necessary! : Training parameters : Speaker independent EV-GMM parameters

12 Proposed Method B  Minimizing the following error function:  Target mean vector is modeled by Error of target mean vectors for the s th pre-stored target speaker Total error over all pre- stored target speakers LS estimation of a regression parameters converting the voice quality control vector into the target mean vectors = + Regression parameters Target mean vector for the s th target speaker Voice quality control vector for the s th target speaker

13 Resulting EV-GMM in Method B = Weight Mean vector + Covariance matrix Regression parameters Parameters of the i th mixture Target mean vector Voice quality control vector Problem: the desired voice quality might not be obtained because the converted voice quality is affected by all EV- GMM parameters. : Training parameters : Speaker independent EV-GMM parameters

14 Proposed Method C  Maximizing the following likelihood function: * This process is considered as speaker adaptive training (SAT) of EV-GMM [Ohtani et al., Interspeech 2007]. Likelihood of the adapted EV-GMM for each pre-stored target speaker Maximum Likelihood (ML) estimation of all EV-GMM parameters while fixing the voice quality control vector Total likelihood over all pre- stored target speakers  Target mean vector is modeled by = + Regression parameters Target mean vector for the s th target speaker Voice quality control vector for the s th target speaker

15 Resulting EV-GMM in Method C = Weight Mean vector + Covariance matrix Parameters of the i th mixture Target mean vector Voice quality control vector Regression parameters : Training parameters

16 Comparison of Proposed Methods Dependent variables Tied parameters of EV-GMM Training criterion Method A Principal components Speaker independent LS Method B Target mean vectors Speaker independent LS Method C Target mean vectorsOptimizedML

17 Contents 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions

18 Verification of Proposed Methods Objective verification Subjective verification Source speakerOne female Pre-stored target speakers15 males and 15 females SentencesPhonetically balanced 50 sentences per a speaker Expression word pairsmasculine / feminine, hoarse / clear, elderly / youthful, thin / deep, lax / tense Number of mixtures128 Number of Eigenvectors29 (no loss of information) Experimental conditions

19 Objective Verification Is a correspondence of the voice quality control vector into the converted voice quality appropriately modeled? For each pre-stored target speaker in the training data, the following two voice quality control vectors were compared. 1. Manually assigned one 2. Adjusted one on the trained EV-GMM so that the converted voice quality becomes similar to the target * approximately determined by maximum likelihood eigen- decomposition for EV-GMM [ Toda et al., 2006 ] using two sentences Euclidean distance and correlation coefficient between those two vectors were calculated as objective measures.

20 Results of Objective Verification * Reassigned: assigned scores by the same listener a second time on a different day Worse Better Worse Better! 1. The method A does not work at all. 2. The method B works but not so good. 1. The method A does not work at all. 2. The method B works but not so good. 3. The method C works reasonably well. Too consistent compared with human judgment? Better!

21 Subjective Verification Preference test on the converted speech quality was conducted. –Comparison of average voices* by the trained EV-GMMs * converted voices when setting every component of the voice quality control vector to zero Test sentences50 sentences not included in training data Number of subjects5 Experimental conditions Having very similar speaker individuality in both method B and C Which is better, the method B or the method C?

22 Result of Subjective Verification The method B outperforms the method C. Possibility to be thought –The EV-GMM parameters trained in EM algorithm converged to local optima due to using inappropriate initial model (i.e., the target independent GMM).

23 Contents 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions

24 Conclusions Proposal of regression approaches to the voice quality control based on one-to-many eigenvoice conversion (EVC) –Based on a statistical conversion framework –Allowing intuitive control of converted voice quality with voice quality control vector Experimental verification –Showing the possibility that voice quality control with high quality and high controllability is realized.