Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS

Slides:



Advertisements
Similar presentations
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Advertisements

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Deep Learning and its applications to Speech EE 225D - Audio Signal Processing in Humans and Machines Oriol Vinyals UC Berkeley.
1 Emotion Classification Using Massive Examples Extracted from the Web Ryoko Tokuhisa, Kentaro Inui, Yuji Matsumoto Toyota Central R&D Labs/Nara Institute.
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Predicting Voice Elicited Emotions
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
DeepWalk: Online Learning of Social Representations
Using Speech Recognition to Predict VoIP Quality
Conditional Generative Adversarial Networks
Tenacious Deep Learning
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
August 15, 2008, presented by Rio Akasaka
Deep Learning Amin Sobhani.
Deep Reinforcement Learning
ARTIFICIAL NEURAL NETWORKS
Perceptual Loss Deep Feature Interpolation for Image Content Changes
Compositional Human Pose Regression
Synthesis of X-ray Projections via Deep Learning
Final Year Project Presentation --- Magic Paint Face
INITIAL GOAL: Detecting personality based on interaction with Alexa
OpenWorld 2018 Audio Recognition Using Oracle Data Science Platform
Authors: Jun-Yan Zhu*, Taesun Park*, Phillip Isola, Alexei A. Efros
Presenter: Hajar Emami
Low Dose CT Image Denoising Using WGAN and Perceptual Loss
By: Kevin Yu Ph.D. in Computer Engineering
Distributed Representation of Words, Sentences and Paragraphs
Convolutional Neural Networks for Visual Tracking
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
ECE 599/692 – Deep Learning Lecture 1 - Introduction
Final Presentation: Neural Network Doc Summarization
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
The power of WORDS Although we are born with the gift of language, research shows that we are surprisngly unskilled when it comes to communicate with others.
1-R-43  Neutral-to-Emotional Voice Conversion with Latent Representations of F0 using Generative Adversarial Networks Zhaojie Luo, Tetsuya Takiguchi, and.
GAN Applications.
Neural Speech Synthesis with Transformer Network
Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2
Lip movement Synthesis from Text
John H.L. Hansen & Taufiq Al Babba Hasan
Word2Vec.
Textual Video Prediction
What makes a family?.
Advances in Deep Audio and Audio-Visual Processing
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Attention for translation
LOGAN: Unpaired Shape Transform in Latent Overcomplete Space
Automatic Handwriting Generation
Human-object interaction
Presented by: Anurag Paul
STAT Midterm Presentation
Learning and Memorization
Speech Prosody Conversion using Sequence Generative Adversarial Nets
Machine Learning.
Cengizhan Can Phoebe de Nooijer
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
End-to-End Speech-Driven Facial Animation with Temporal GANs
Week 7 Presentation Ngoc Ta Aidean Sharghi
Listen Attend and Spell – a brief introduction
CS249: Neural Language Model
Huawei CBG AI Challenges
Directional Occlusion with Neural Network
Do Better ImageNet Models Transfer Better?
1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe.
Presentation transcript:

Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS Handcrafted features, requires domain expertise Different modules are trained independently, errors may compound Substantial engineering efforts required when building a new model Neural TTS Learns directly from text-speech pairs, end-to-end Made possible by training neural networks with large amounts of data WaveNet [DeepMind, 2016], Tacotron [Google, 2017], Deep Voice [Baidu, 2017] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Current state of neural TTS Difficult to control the style of speech, e.g., identity, emotion May negatively impact user experience Deep Voice 3 for multi-speaker neural TTS (Baidu) [ICLR 2018] 2000 identities! Requires categorized samples of identities Does not generalize to new identities Global Style Tokens (Google) [ICML 2018] Does not require categorized training data Style is extracted on-the-fly from a reference audio clip Can also control prosodic styles (accents, intonation, etc.) Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Tacotron [Interspeech 2017] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Tacotron-GST [ICML 2018] Style token = Bottleneck A, B, C, D: Basis vectors [0.2, 0.1, 0.3, 0.4]: Coefficients ----> Style embedding Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Limitations of Tacotron-GST Model is trained with a reconstruction loss alone Style embedding is underconstrained Content information can “leak into” the style embedding We propose a new training technique to improve Tacotron-GST Improve an ability to disentangle content/style from reference audio Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

What we’ve changed: Triplet input formulation Adversarial training Paired (text, audio) samples, with an extra audio sample unpaired with text Adversarial training Ensure the realism of synthesized samples No reconstruction required Borrow ideas from image style transfer [Gatys et al., CVPR 2016] Apply the idea in the mel-spectrogram domain Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Model Architecture Reconstruction loss Text …… Paired CNN1 Paired Attention RNN Decoder Text Char Embed …… Tacotron-GST [ICML 2018] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Model Architecture Introduce an extra “unpaired” audio sample Reconstruction loss CNN1 Paired Attention RNN Decoder Text Char Embed …… Unpaired Introduce an extra “unpaired” audio sample May contains different content & style than the paired audio input How to ensure the correctness of the synthesized output? Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Model Architecture Reconstruction loss Discriminator Adversarial loss CNN1 Paired Attention RNN Decoder Text Char Embed …… Unpaired Use a discriminator that checks the realism of the output GAN training Content information can still leak into the style embedding Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Architecture Overview Reconstruction loss Discriminator Adversarial loss CNN1 Paired Attention RNN Decoder Text Char Embed …… CNN2 Style loss Unpaired Introduce another CNN Compute the style loss on mel-spectrogram images! Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Style loss on mel-spectrograms Mel-spectrograms represent the short-term power spectrum of sound The gram matrix of feature maps extracted from mel-spectrograms Captures local variations of prosody in the time-frequency domain See also: a temporary reduction in the average fundamental frequency significantly correlates with sarcasm expression [Cheang and Pell, 2008] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Emotion Style Transfer Reference audio clips Neutral Happy Sad Angry Input sentences Neutral Happy Sad Angry GST Ours “no no no, you see, the key to being healthy is being happy! and cookies make you happy.” “This is why I want you gone ninety percent of the time.” “Dude, you never texted me!!” Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Evaluation: Emotion Style Transfer Both GST and our model synthesize 15 sentences 10 sentences from the test set, 5 sentences from the web All 4 reference audio clips are unseen during training neutral, happy, sad, angry Seven human subjects, each listen to all 60 permutations of samples Our model is rated significantly closer to the reference (𝑝≪0.001) neutral: 𝑝=0.01, happy: 𝑝≪0.001, sad: 𝑝≪0.001, angry: 𝑝≪0.001 Audio naturalness test using Mean Opinion Score (MOS) Ours: 4.3 MOS (higher the better) Tacotron-GST: 4.0 MOS, Tacotron: 3.82 MOS Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Content-Style Swap Neutral Happy Sad Angry Neutral: “love talking to random people. It always cheers me up and makes feel like I'm cheering someone up too.” Happy: “it thinks you're gorgeous, too, and yo...very good question. I'll have to find out and you'll be the first to know.” Sad: “I'm so bad about sticking to my plans. I just don't want people to start thinking I'm a flake.” Angry: “That's ridiculous! It's like they don't even give a damn about their customers.” Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

t-SNE visualization of the learned embeddings