Neural Speech Synthesis with Transformer Network

Slides:

Advertisements

Similar presentations

Neural Networks Chapter 6 Joost N. Kok Universiteit Leiden.

Advertisements

Dynamic Background Learning through Deep Auto-encoder Networks Pei Xu 1, Mao Ye 1, Xue Li 2, Qihe Liu 1, Yi Yang 2 and Jian Ding 3 1.University of Electronic.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

R-NET: Machine Reading Comprehension With Self-Matching Networks

Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi

Recent developments in object detection

Convolutional Sequence to Sequence Learning

Learning to Compare Image Patches via Convolutional Neural Networks

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

ECE 417 Lecture 1: Multimedia Signal Processing

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Deep Feedforward Networks

Mr. Darko Pekar, Speech Morphing Inc.

Summary of “Efficient Deep Learning for Stereo Matching”

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Deep Predictive Model for Autonomous Driving

Adversarial Learning for Neural Dialogue Generation

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Intelligent Information System Lab

Intro to NLP and Deep Learning

Natural Language Processing of Knee MRI Reports

CS6890 Deep Learning Weizhen Cai

Random walk initialization for training very deep feedforward networks

Neural Networks and Backpropagation

Attention Is All You Need

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

RNNs: Going Beyond the SRN in Language Prediction

Attention-based Caption Description Mun Jonghwan.

Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong

Vessel Extraction in X-Ray Angiograms Using Deep Learning

CS 4501: Introduction to Computer Vision Training Neural Networks II

Final Presentation: Neural Network Doc Summarization

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.

Introduction of MATRIX CAPSULES WITH EM ROUTING

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Word embeddings based mapping

Word embeddings based mapping

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Neural Networks Geoff Hulten.

Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2

Lip movement Synthesis from Text

Machine Translation(MT)

John H.L. Hansen & Taufiq Al Babba Hasan

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

RNNs: Going Beyond the SRN in Language Prediction

Advances in Deep Audio and Audio-Visual Processing

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Department of Computer Science Ben-Gurion University of the Negev

Automatic Handwriting Generation

Deep Object Co-Segmentation

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Modeling IDS using hybrid intelligent systems

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Neural Machine Translation using CNN

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Sequence-to-Sequence Models

Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Neural Machine Translation by Jointly Learning to Align and Translate

Directional Occlusion with Neural Network

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Neural Speech Synthesis with Transformer Network Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu University of Electronic Science and Technology of China Microsoft Research Asia Microsoft STC Asia

Tacotron2 A neural network architecture for speech synthesis directly from text 3-layer CNN: extracts a longer-term text context. Bi-directional LSTM: encoder. Location sensitive attention: connects encoder and decoder. Decoder pre-net: a 2-layer fully connected network. 2-layer LSTM: decoder. Mel linear: a fully-connected layer, generates mel spectrogram frames. Stop linear: a fully-connected layer, predicts the stop token for each frame. Post-net: a 5-layer CNN with residual connections, refines the mel spectrogram.

Transformer A sequence to sequence network based solely on attention mechanisms Encoder: 6 blocks. Decoder: 6 blocks. Positional embeddings: add positional information (PE) to input embeddings (Masked) Multi-head attention: Splits each Q, K and V into 8 heads Calculates attention contexts respectively Concatenates 8 context vectors FFN: feed forward network, 2 fully connected layers. Add & Norm: residual connection and layer normalization.

Neural TTS with Transformer Why apply Transformer in TTS Parallel training Frames of an input sequence can be provided in parallel. Long range dependencies Self attention injects global context of the whole sequence into each input frame.

Neural TTS with Transformer

Neural TTS with Transformer Text-to-Phoneme Converter Difficult to learn all the regularities without sufficient training data Some exceptions have too few occurrences for neural networks to learn Convert text into phonemes by rule: This is the waistline . dh . ih . s / ih . z / dh . ax / w . ey . s . t - l . ay . n / punc.

Neural TTS with Transformer Scaled Positional Encoding Transformer adds information about the relative or absolute position by adding a positional embedding (PE) to input embeddings In TTS scenario, texts and mel spectrograms may have different scales Scale-fixed positional embeddings may impose heavy constraints on both the encoder and decoder pre-nets Add a trainable weight to positional embeddings 𝑥 𝑖 = 𝑝𝑟𝑒𝑛𝑒𝑡 𝑝ℎ𝑜𝑛𝑒𝑚 𝑒 𝑖 +𝜶⋅𝑃𝐸 𝑖 Positional embeddings can adaptively fit the scales of both encoder and decoder pre-nets’ output

Neural TTS with Transformer Pre-nets of Encoder and decoder Similar structure and function as in Tacotron2 An additional linear projection is appended Positional embeddings are in [−1, 1] After relu, the outputs of pre-nets are in [0, +∞) Re-center to have a 0-centered range

Neural TTS with Transformer Encoder and decoder Model the frame relationship in multiple different aspects. Inject the global context of the whole sequence into each frames. Enable parallel computing to improve training speed.

Neural TTS with Transformer Stop linear Predicts whether should inference stop at current frame. Unstoppable inference may occur During training, each sequence has only one "stop" while hundreds of “continue”. Imbalance positive/negative samples result in biased stop linear. Solution: impose a weight (5.0 ∼ 8.0) on the "stop" token when calculating binary cross entropy loss during training.

Experiment Training Setup Text-to-Phoneme Conversion and Pre-process 4 Nvidia Tesla P100 Internal US English female dataset 25-hour professional speech 17584 text-wave pairs Dynamic batch size Various sample number Fixed mel spectrogram frame number Text-to-Phoneme Conversion and Pre-process Phoneme type: Normal phonemes Word boundaries Syllable boundaries Punctuations Process pipeline: Sentence separation Text normalization Word segmentation Obtaining pronunciation WaveNet Settings Sample rate: 16000 Frame rate (frames per second): 80 2 QRNN layers 20 dilated layers Residual and dilation channel size: 256

Experiment Training Time Comparison Inference Time Comparison Tacotron2 Transformer Single step (batch size=~16) ∼1.7s ∼0.4s Total time ~4.5 days ~3 days Inference Time Comparison Synthesize 1s spectrogram ∼0.13s ∼0.36s

Evaluation Evaluation setup Results Baseline model 38 fixed examples with various lengths Each audio is listened to by at least 20 testers (8 testers Shen et al. (2017)) Each tester listens less than 30 audios Results Baseline model Tacotron2 Use phone sequence as inputs Other structure are same as Google’s version CMOS: comparison mean option score. Testers listen to two audios each time and evaluates how the latter feels comparing to the former using a score in [−3, 3] with intervals of 1

Evaluation Generated sample comparison Our model Tacotron2 More samples at https://neuraltts.github.io/transformertts/

Evaluation Mel spectrogram details Our model does better in reproducing high frequency region

Ablation Studies Re-centering Pre-net’s Output Re-project pre-nets’ outputs for consistent center with positional embeddings Center-consistent PE performs slightly better

Ablation Studies Different Positional Encoding Methods Final positional embedding scales of encoder and decoder are different Trainable scale performs slightly better. Reason: Constraint on encoder and decoder pre-nets are relaxes Positional information are more adaptive for different embedding spaces

Ablation Studies Different Hyper-Parameter: layer number impact For encoder-decoder attention, only alignments from certain heads of the beginning 2 layers’ are interpretable diagonal lines More layers can still refine the synthesized mel spectrogram and improve audio quality Decoder input time step Encoder input time step The reason is that with residual connection between different layers, our model fits target transformation in a Taylor-expansion way: the starting terms account most as low ordering ones, while the subsequential ones can refine the function

Ablation Studies Different Hyper-Parameter: head number impact Reducing head numbers harms performance Comparison of time consuming (in second) per training step of different layer and head numbers (Tested on 4 GPUs with dynamic batch size)

Thank you!