Download presentation
Presentation is loading. Please wait.
Published byJody Sims Modified over 9 years ago
1
An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari and Kiyohiro Shikano Nara Institute of Science and Technology (NAIST), Japan August 23rd, 2007
2
Contents Many-to-One VC framework Many-to-One VC algorithms Experimental evaluations Conclusion Many-to-One VC framework
3
Convetional Voice Conversion (VC) Source speaker Target speaker Training Conversion model Please say the same thing. Please say the same thing. We would like to make VC more flexible! Using arbitrary utterances Using a few utterances Converting arbitrary source speakers Training of conversion model has some limitations. Using parallel data Using around 50 pairs Converting only trained source speaker
4
Many-to-One VC (M-to-O VC) Convert arbitrary source speakers into target speaker [T. Toda et al.] Target speaker Pre-stored source speakers ? Initial model training with multiple parallel data sets Adaptation of model parameters for an arbitrary source speaker Applications Voice changer to movie stars Speech translation system, etc.
5
Contents Many-to-One VC framework Many-to-One VC algorithms Experimental evaluations Conclusion
6
M-to-O VC Algorithms Based on source independent GMM (SI-GMM) Based on speaker selection Based on Eigenvoice conversion (EVC) Based on EVC with speaker adaptive training (SAT) 1. 2. 3. 4. [T. Toda et al.] New algorithm
7
M-to-O VC based on Source Independent GMM (SI-GMM) 1. [T. Toda et al.] We train the conversion model for arbitrary source speakers. Weight Mean vector Covariance matrix Parameters of the i-th mixture component of SI-GMM Red : Speaker A Blue : Speaker B Green : Speaker C 3 rd mixture component 1 st mixture component 2 nd mixture component Source mean vector Target mean vector : /a/ : /i/ : /o/ : Tied parameters
8
Previous Training of SI-GMM Target speaker Multiple pre-stored source speakers SI-GMM Training using all parallel data sets Previous training process The SI-GMM converts arbitrary source speaker’s voice without any adaptation processes.
9
Problem of SI-GMM Phonemic spaces of a certain speaker often overlap with those of another speaker. SI-GMM might cause a conversion error ! Red : Speaker A Blue : Speaker B Green : Speaker C 3 rd mixture component 1 st mixture component 2 nd mixture component : /a/ : /i/ : /o/
10
M-to-O VC based on Speaker Selection 2. We train the conversion model using a part of pre- stored source speakers whose voice characteristics are similar to those of the given source speaker. Speaker Selection [S. Yoshizawa, et al.,2001] * * Red : Speaker A Blue : Speaker B Green : Speaker C Black : Source speaker Speaker A and C are selected. 3 rd mixture component 1 st mixture component 2 nd mixture component : /a/ : /i/ : /o/
11
Previous training process Target speaker SI-GMM 1. Training of SI-GMM 2. Training of speaker dependent GMMs (SD-GMMs) Multiple pre-stored source speakers Adaptation process SD-GMMs Adaptation data of source speaker 4. Sort of likelihood 5. Selection of N-best parallel data sets based on likelihoods 6. Training of conversion model Conversion model 3. Calculation of likelihood Selected pre-stored source speakers Target speaker Process of Speaker Selection
12
Problem of Speaker Selection Such a model is not necessarily suitable for the given source speaker. Red : Speaker A Blue : Speaker B Green : Speaker C Black : Source speaker Speaker A and C are selected. Trained conversion model by speaker selection Desired conversion model The resulting conversion model just covers the selected pre-stored source speakers.
13
M-to-O VC based on Eigenvoice Conversion (EVC) 3. [T. Toda et al.] The conversion model is adapted by adjusting weights for individual eigenvoices. Conversion model Source speaker Weighting Unsupervised adaptation 1 st eigen vector 2 nd eigen vector (S-1)th eigen vector
14
Eigenvoice GMM (EV-GMM) Weight Mean vector Covariance matrix Representative vectors (eigenvoices) Bias vector (average voice) Parameters of the i-th mixture component Free parameter = + Free parameter can be estimated with adaptation data. : Tied parameters Weigt vector
15
Previous Training of EV-GMM 3. SI-GMM bias vectors Representative vectors & + = EV-GMM 1. Training of SI-GMM 2. Training of SD-GMMs 3. Construction of supervectors 4. Estimation of bias vectors and representative vectors Multiple pre-stored source speakers Previous training process Target speaker 5. Construction of EV-GMM
16
Problem of EVC The tied parameters of the EV-GMM are from the SI-GMM. They are not suitable for the given source speaker, e.g., source covariance values are much larger than those of the desired conversion model. Red : Speaker A Blue : Speaker B Green : Speaker C Black : Source speaker Adapted EV-GMM Desired conversion model EV-GMM
17
M-to-O VC based on EVC with Speaker Adaptive Training (SAT) 4. SAT [T. Anastasakos, et al., 1996] * * We previously train EV-GMM so that the adaptation performance is improved. Training criterion: Likelihood of the adapted EV-GMM for each pre-stored source speaker Total likelihood over all pre- stored source speakers Red : Speaker A Blue : Speaker B Green : Speaker C Black : Source speaker SAT EV-GMM with SAT Adapted EV-GMM with SAT EV-GMM Adapted EV-GMM
18
SAT for EV-GMM + = Canonical EV-GMM 1. Training of speaker dependent parameters GMM weights Bias vectors Representative vectors Target mean vectors Covariance matrices 2. Training of tied parameters 3. Iteration Multiple pre-stored source speakers Previous training process Target speaker Weight vectors
19
Source mean vectors Tied parameters Based on SI-GMM Based on speaker selection Based on EVC Based on EVC with SAT Not adapted Roughly adapted Previously optimizedAdapted Not adapted Comparison of M-to-O VC Algorithms
20
Contents Many-to-One VC framework Many-to-One VC algorithms Experimental evaluations Conclusion
21
Experimental Conditions 160 pre-stored source speakers (80 males and 80 females) 10 source speakers (5 males and 5 females) ? Training stage Adaptation stage 1 male target speaker 50 sentences uttered by each speaker The number of mixtures The number of representative vectors The number of selected speakers 128 159 27
22
Experimental Conditions (cont’d) Test data Objective measure The number of adaptation sentences 21 sentences Spectral distortion Varying from 1/32 to 32 Objective evaluation Subjective evaluation Preference test on speech quality of converted voices The number of subjects (Each subject evaluated 120 sample-pairs) The number of adaptation sentences 6 2
23
Result of Objective Evaluation Worse Better The adaptation techniques cause improvements of the conversion accuracy. SAT causes further improvements. EVC and EVC with SAT cause large distortions when the amount of adaptation data is very limited. Speaker selection is effective even when using very limited amount of adaptation data.
24
Result of Subjective Evaluation Every adaptation technique causes improvements of the converted speech quality.
25
Contents Many-to-One VC framework Many-to-One VC algorithms Experimental evaluations Conclusion
26
Conclusions We conducted an experimental evaluation of many-to-one VC algorithms. based on SI-GMM. based on EVC. based on speaker selection. based on EVC with SAT. [T. Toda, et al.] New methods Results of objective and subjective evaluations showed the adaptation process results in a better conversion model than the SI-GMM. the algorithm based on speaker selection works well with very little amount of adaptation data.
27
Thank you for your attention! Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.