Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS

Slides:

Advertisements

Similar presentations

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

Advertisements

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Deep Learning and its applications to Speech EE 225D - Audio Signal Processing in Humans and Machines Oriol Vinyals UC Berkeley.

1 Emotion Classification Using Massive Examples Extracted from the Web Ryoko Tokuhisa, Kentaro Inui, Yuji Matsumoto Toyota Central R&D Labs/Nara Institute.

Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Predicting Voice Elicited Emotions

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

DeepWalk: Online Learning of Social Representations

Using Speech Recognition to Predict VoIP Quality

Conditional Generative Adversarial Networks

Tenacious Deep Learning

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

August 15, 2008, presented by Rio Akasaka

Deep Learning Amin Sobhani.

Deep Reinforcement Learning

ARTIFICIAL NEURAL NETWORKS

Perceptual Loss Deep Feature Interpolation for Image Content Changes

Compositional Human Pose Regression

Synthesis of X-ray Projections via Deep Learning

Final Year Project Presentation --- Magic Paint Face

INITIAL GOAL: Detecting personality based on interaction with Alexa

OpenWorld 2018 Audio Recognition Using Oracle Data Science Platform

Authors: Jun-Yan Zhu*, Taesun Park*, Phillip Isola, Alexei A. Efros

Presenter: Hajar Emami

Low Dose CT Image Denoising Using WGAN and Perceptual Loss

By: Kevin Yu Ph.D. in Computer Engineering

Distributed Representation of Words, Sentences and Paragraphs

Convolutional Neural Networks for Visual Tracking

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

ECE 599/692 – Deep Learning Lecture 1 - Introduction

Final Presentation: Neural Network Doc Summarization

MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.

The power of WORDS Although we are born with the gift of language, research shows that we are surprisngly unskilled when it comes to communicate with others.

1-R-43 　Neutral-to-Emotional Voice Conversion with Latent Representations of F0 using Generative Adversarial Networks Zhaojie Luo, Tetsuya Takiguchi, and.

GAN Applications.

Neural Speech Synthesis with Transformer Network

Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2

Lip movement Synthesis from Text

John H.L. Hansen & Taufiq Al Babba Hasan

Textual Video Prediction

What makes a family?.

Advances in Deep Audio and Audio-Visual Processing

Word embeddings (continued)

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Attention for translation

LOGAN: Unpaired Shape Transform in Latent Overcomplete Space

Automatic Handwriting Generation

Human-object interaction

Presented by: Anurag Paul

STAT Midterm Presentation

Learning and Memorization

Speech Prosody Conversion using Sequence Generative Adversarial Nets

Machine Learning.

Cengizhan Can Phoebe de Nooijer

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

End-to-End Speech-Driven Facial Animation with Temporal GANs

Week 7 Presentation Ngoc Ta Aidean Sharghi

Listen Attend and Spell – a brief introduction

CS249: Neural Language Model

Huawei CBG AI Challenges

Directional Occlusion with Neural Network

Do Better ImageNet Models Transfer Better?

1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe.

Presentation transcript:

Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS Handcrafted features, requires domain expertise Different modules are trained independently, errors may compound Substantial engineering efforts required when building a new model Neural TTS Learns directly from text-speech pairs, end-to-end Made possible by training neural networks with large amounts of data WaveNet [DeepMind, 2016], Tacotron [Google, 2017], Deep Voice [Baidu, 2017] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Current state of neural TTS Difficult to control the style of speech, e.g., identity, emotion May negatively impact user experience Deep Voice 3 for multi-speaker neural TTS (Baidu) [ICLR 2018] 2000 identities! Requires categorized samples of identities Does not generalize to new identities Global Style Tokens (Google) [ICML 2018] Does not require categorized training data Style is extracted on-the-fly from a reference audio clip Can also control prosodic styles (accents, intonation, etc.) Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Tacotron [Interspeech 2017] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Tacotron-GST [ICML 2018] Style token = Bottleneck A, B, C, D: Basis vectors [0.2, 0.1, 0.3, 0.4]: Coefficients ----> Style embedding Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Limitations of Tacotron-GST Model is trained with a reconstruction loss alone Style embedding is underconstrained Content information can “leak into” the style embedding We propose a new training technique to improve Tacotron-GST Improve an ability to disentangle content/style from reference audio Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

What we’ve changed: Triplet input formulation Adversarial training Paired (text, audio) samples, with an extra audio sample unpaired with text Adversarial training Ensure the realism of synthesized samples No reconstruction required Borrow ideas from image style transfer [Gatys et al., CVPR 2016] Apply the idea in the mel-spectrogram domain Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Model Architecture Reconstruction loss Text …… Paired CNN1 Paired Attention RNN Decoder Text Char Embed …… Tacotron-GST [ICML 2018] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Model Architecture Introduce an extra “unpaired” audio sample Reconstruction loss CNN1 Paired Attention RNN Decoder Text Char Embed …… Unpaired Introduce an extra “unpaired” audio sample May contains different content & style than the paired audio input How to ensure the correctness of the synthesized output? Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Model Architecture Reconstruction loss Discriminator Adversarial loss CNN1 Paired Attention RNN Decoder Text Char Embed …… Unpaired Use a discriminator that checks the realism of the output GAN training Content information can still leak into the style embedding Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Architecture Overview Reconstruction loss Discriminator Adversarial loss CNN1 Paired Attention RNN Decoder Text Char Embed …… CNN2 Style loss Unpaired Introduce another CNN Compute the style loss on mel-spectrogram images! Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Style loss on mel-spectrograms Mel-spectrograms represent the short-term power spectrum of sound The gram matrix of feature maps extracted from mel-spectrograms Captures local variations of prosody in the time-frequency domain See also: a temporary reduction in the average fundamental frequency significantly correlates with sarcasm expression [Cheang and Pell, 2008] Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Emotion Style Transfer Reference audio clips Neutral Happy Sad Angry Input sentences Neutral Happy Sad Angry GST Ours “no no no, you see, the key to being healthy is being happy! and cookies make you happy.” “This is why I want you gone ninety percent of the time.” “Dude, you never texted me!!” Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Evaluation: Emotion Style Transfer Both GST and our model synthesize 15 sentences 10 sentences from the test set, 5 sentences from the web All 4 reference audio clips are unseen during training neutral, happy, sad, angry Seven human subjects, each listen to all 60 permutations of samples Our model is rated significantly closer to the reference (𝑝≪0.001) neutral: 𝑝=0.01, happy: 𝑝≪0.001, sad: 𝑝≪0.001, angry: 𝑝≪0.001 Audio naturalness test using Mean Opinion Score (MOS) Ours: 4.3 MOS (higher the better) Tacotron-GST: 4.0 MOS, Tacotron: 3.82 MOS Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

Content-Style Swap Neutral Happy Sad Angry Neutral: “love talking to random people. It always cheers me up and makes feel like I'm cheering someone up too.” Happy: “it thinks you're gorgeous, too, and yo...very good question. I'll have to find out and you'll be the first to know.” Sad: “I'm so bad about sticking to my plans. I just don't want people to start thinking I'm a flake.” Angry: “That's ridiculous! It's like they don't even give a damn about their customers.” Shuang Ma, Daniel McDuff, Yale Song, “Neural TTS Stylization with Adversarial and Collaborative Games”, ICLR 2019

t-SNE visualization of the learned embeddings