1-R-43 Neutral-to-Emotional Voice Conversion with Latent Representations of F0 using Generative Adversarial Networks Zhaojie Luo, Tetsuya Takiguchi, and Yasuo Ariki (Kobe University) Canonical Correlation Analysis Overview Background Problems Goal 1. Applying the continuous wavelet transform (CWT) and cross wavelet transform (XWT) method to systematically capture the F0 features of different temporal scales. 2. Using the VAE-GAN to train the MCC and AS-CWT features. 1. The representation of fundamental frequency (F0) is too simple for emotion conversion. 2. The emotional voice data is insufficient. keep linguistic information unchanged Hey Hey neutral sad happy angry Emotional voice conversion Emotional robot Framework L = LGAN + LDl like + Lprior Training model Dataset Samples: Results x E h G D x’ y input real data ouput Table 1 F0-RMSE results for different emotions. N2A, N2S and N2H represent the datasets from neutral to angry, sad and happy voice, respectively. MOS evaluation of emotional voice conversion Source LG NN VAE GAN VA-GAN N2A 76.8 76.3 70.4 73.4 59.5 51.2 N2S 73.7 72.0 62.3 77.5 56.1 58.5 N2H 100.4 99.1 75.2 85.8 65.5 62.1