Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS Handcrafted features, requires domain expertise Different modules are trained independently, errors may compound Substantial engineering efforts required when building a new model Neural TTS Learns directly from text-speech pairs, end-to-end Made possible by training neural networks with large amounts of data WaveNet [DeepMind, 2016], Tacotron [Google, 2017], Deep Voice [Baidu, 2017] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Current state of neural TTS Difficult to control the style of speech, e.g., identity, emotion May negatively impact user experience Deep Voice 3 for multi-speaker neural TTS (Baidu) [ICLR 2018] 2000 identities! Requires categorized samples of identities Does not generalize to new identities Global Style Tokens (Google) [ICML 2018] Does not require categorized training data Style is extracted on-the-fly from a reference audio clip Can also control prosodic styles (accents, intonation, etc.) Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Tacotron [Interspeech 2017] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Tacotron-GST [ICML 2018] Style token = Bottleneck A, B, C, D: Basis vectors [0.2, 0.1, 0.3, 0.4]: Coefficients ----> Style embedding Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Limitations of Tacotron-GST Model is trained with a reconstruction loss alone Style embedding is underconstrained Content information can “leak into” the style embedding We propose a new training technique to improve Tacotron-GST Improve an ability to disentangle content/style from reference audio Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
What we’ve changed: Triplet input formulation Adversarial training Paired (text, audio) samples, with an extra audio sample unpaired with text Adversarial training Ensure the realism of synthesized samples No reconstruction required Borrow ideas from image style transfer [Gatys et al., CVPR 2016] Apply the idea in the mel-spectrogram domain Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Model Architecture Reconstruction loss Text …… Paired CNN1 Paired Attention RNN Decoder Text Char Embed …… Tacotron-GST [ICML 2018] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Model Architecture Introduce an extra “unpaired” audio sample Reconstruction loss CNN1 Paired Attention RNN Decoder Text Char Embed …… Unpaired Introduce an extra “unpaired” audio sample May contains different content & style than the paired audio input How to ensure the correctness of the synthesized output? Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Model Architecture Reconstruction loss Discriminator Adversarial loss CNN1 Paired Attention RNN Decoder Text Char Embed …… Unpaired Use a discriminator that checks the realism of the output GAN training Content information can still leak into the style embedding Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Architecture Overview Reconstruction loss Discriminator Adversarial loss CNN1 Paired Attention RNN Decoder Text Char Embed …… CNN2 Style loss Unpaired Introduce another CNN Compute the style loss on mel-spectrogram images! Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Style loss on mel-spectrograms Mel-spectrograms represent the short-term power spectrum of sound The gram matrix of feature maps extracted from mel-spectrograms Captures local variations of prosody in the time-frequency domain See also: a temporary reduction in the average fundamental frequency significantly correlates with sarcasm expression [Cheang and Pell, 2008] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Emotion Style Transfer Reference audio clips Neutral Happy Sad Angry Input sentences Neutral Happy Sad Angry GST Ours “no no no, you see, the key to being healthy is being happy! and cookies make you happy.” “This is why I want you gone ninety percent of the time.” “Dude, you never texted me!!” Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Evaluation: Emotion Style Transfer Both GST and our model synthesize 15 sentences 10 sentences from the test set, 5 sentences from the web All 4 reference audio clips are unseen during training neutral, happy, sad, angry Seven human subjects, each listen to all 60 permutations of samples Our model is rated significantly closer to the reference (𝑝≪0.001) neutral: 𝑝=0.01, happy: 𝑝≪0.001, sad: 𝑝≪0.001, angry: 𝑝≪0.001 Audio naturalness test using Mean Opinion Score (MOS) Ours: 4.3 MOS (higher the better) Tacotron-GST: 4.0 MOS, Tacotron: 3.82 MOS Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
Content-Style Swap Neutral Happy Sad Angry Neutral: “love talking to random people. It always cheers me up and makes feel like I'm cheering someone up too.” Happy: “it thinks you're gorgeous, too, and yo...very good question. I'll have to find out and you'll be the first to know.” Sad: “I'm so bad about sticking to my plans. I just don't want people to start thinking I'm a flake.” Angry: “That's ridiculous! It's like they don't even give a damn about their customers.” Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019
t-SNE visualization of the learned embeddings