Neural Speech Synthesis with Transformer Network

Slides:



Advertisements
Similar presentations
Neural Networks Chapter 6 Joost N. Kok Universiteit Leiden.
Advertisements

Dynamic Background Learning through Deep Auto-encoder Networks Pei Xu 1, Mao Ye 1, Xue Li 2, Qihe Liu 1, Yi Yang 2 and Jian Ding 3 1.University of Electronic.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
R-NET: Machine Reading Comprehension With Self-Matching Networks
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Recent developments in object detection
Convolutional Sequence to Sequence Learning
Learning to Compare Image Patches via Convolutional Neural Networks
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
ECE 417 Lecture 1: Multimedia Signal Processing
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Feedforward Networks
Mr. Darko Pekar, Speech Morphing Inc.
Summary of “Efficient Deep Learning for Stereo Matching”
WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Deep Predictive Model for Autonomous Driving
Adversarial Learning for Neural Dialogue Generation
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Intelligent Information System Lab
Intro to NLP and Deep Learning
Natural Language Processing of Knee MRI Reports
CS6890 Deep Learning Weizhen Cai
Random walk initialization for training very deep feedforward networks
Neural Networks and Backpropagation
Attention Is All You Need
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
RNNs: Going Beyond the SRN in Language Prediction
Attention-based Caption Description Mun Jonghwan.
Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong
Vessel Extraction in X-Ray Angiograms Using Deep Learning
CS 4501: Introduction to Computer Vision Training Neural Networks II
Final Presentation: Neural Network Doc Summarization
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
Introduction of MATRIX CAPSULES WITH EM ROUTING
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Word embeddings based mapping
Word embeddings based mapping
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
Neural Networks Geoff Hulten.
Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2
Lip movement Synthesis from Text
Machine Translation(MT)
John H.L. Hansen & Taufiq Al Babba Hasan
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
RNNs: Going Beyond the SRN in Language Prediction
Advances in Deep Audio and Audio-Visual Processing
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Department of Computer Science Ben-Gurion University of the Negev
Automatic Handwriting Generation
Deep Object Co-Segmentation
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Modeling IDS using hybrid intelligent systems
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Neural Machine Translation using CNN
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Sequence-to-Sequence Models
Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
Directional Occlusion with Neural Network
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Neural Speech Synthesis with Transformer Network Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu University of Electronic Science and Technology of China Microsoft Research Asia Microsoft STC Asia

Tacotron2 A neural network architecture for speech synthesis directly from text 3-layer CNN: extracts a longer-term text context. Bi-directional LSTM: encoder. Location sensitive attention: connects encoder and decoder. Decoder pre-net: a 2-layer fully connected network. 2-layer LSTM: decoder. Mel linear: a fully-connected layer, generates mel spectrogram frames. Stop linear: a fully-connected layer, predicts the stop token for each frame. Post-net: a 5-layer CNN with residual connections, refines the mel spectrogram.

Transformer A sequence to sequence network based solely on attention mechanisms Encoder: 6 blocks. Decoder: 6 blocks. Positional embeddings: add positional information (PE) to input embeddings (Masked) Multi-head attention: Splits each Q, K and V into 8 heads Calculates attention contexts respectively Concatenates 8 context vectors FFN: feed forward network, 2 fully connected layers. Add & Norm: residual connection and layer normalization.

Neural TTS with Transformer Why apply Transformer in TTS Parallel training Frames of an input sequence can be provided in parallel. Long range dependencies Self attention injects global context of the whole sequence into each input frame.

Neural TTS with Transformer

Neural TTS with Transformer Text-to-Phoneme Converter Difficult to learn all the regularities without sufficient training data Some exceptions have too few occurrences for neural networks to learn Convert text into phonemes by rule: This is the waistline . dh . ih . s / ih . z / dh . ax / w . ey . s . t - l . ay . n / punc.

Neural TTS with Transformer Scaled Positional Encoding Transformer adds information about the relative or absolute position by adding a positional embedding (PE) to input embeddings In TTS scenario, texts and mel spectrograms may have different scales Scale-fixed positional embeddings may impose heavy constraints on both the encoder and decoder pre-nets Add a trainable weight to positional embeddings 𝑥 𝑖 = 𝑝𝑟𝑒𝑛𝑒𝑡 𝑝ℎ𝑜𝑛𝑒𝑚 𝑒 𝑖 +𝜶⋅𝑃𝐸 𝑖 Positional embeddings can adaptively fit the scales of both encoder and decoder pre-nets’ output

Neural TTS with Transformer Pre-nets of Encoder and decoder Similar structure and function as in Tacotron2 An additional linear projection is appended Positional embeddings are in [−1, 1] After relu, the outputs of pre-nets are in [0, +∞) Re-center to have a 0-centered range

Neural TTS with Transformer Encoder and decoder Model the frame relationship in multiple different aspects. Inject the global context of the whole sequence into each frames. Enable parallel computing to improve training speed.

Neural TTS with Transformer Stop linear Predicts whether should inference stop at current frame. Unstoppable inference may occur During training, each sequence has only one "stop" while hundreds of “continue”. Imbalance positive/negative samples result in biased stop linear. Solution: impose a weight (5.0 ∼ 8.0) on the "stop" token when calculating binary cross entropy loss during training.

Experiment Training Setup Text-to-Phoneme Conversion and Pre-process 4 Nvidia Tesla P100 Internal US English female dataset 25-hour professional speech 17584 text-wave pairs Dynamic batch size Various sample number Fixed mel spectrogram frame number Text-to-Phoneme Conversion and Pre-process Phoneme type: Normal phonemes Word boundaries Syllable boundaries Punctuations Process pipeline: Sentence separation Text normalization Word segmentation Obtaining pronunciation WaveNet Settings Sample rate: 16000 Frame rate (frames per second): 80 2 QRNN layers 20 dilated layers Residual and dilation channel size: 256

Experiment Training Time Comparison Inference Time Comparison Tacotron2 Transformer Single step (batch size=~16) ∼1.7s ∼0.4s Total time ~4.5 days ~3 days Inference Time Comparison Synthesize 1s spectrogram ∼0.13s ∼0.36s

Evaluation Evaluation setup Results Baseline model 38 fixed examples with various lengths Each audio is listened to by at least 20 testers (8 testers Shen et al. (2017)) Each tester listens less than 30 audios Results Baseline model Tacotron2 Use phone sequence as inputs Other structure are same as Google’s version CMOS: comparison mean option score. Testers listen to two audios each time and evaluates how the latter feels comparing to the former using a score in [−3, 3] with intervals of 1

Evaluation Generated sample comparison Our model Tacotron2 More samples at https://neuraltts.github.io/transformertts/

Evaluation Mel spectrogram details Our model does better in reproducing high frequency region

Ablation Studies Re-centering Pre-net’s Output Re-project pre-nets’ outputs for consistent center with positional embeddings Center-consistent PE performs slightly better

Ablation Studies Different Positional Encoding Methods Final positional embedding scales of encoder and decoder are different Trainable scale performs slightly better. Reason: Constraint on encoder and decoder pre-nets are relaxes Positional information are more adaptive for different embedding spaces

Ablation Studies Different Hyper-Parameter: layer number impact For encoder-decoder attention, only alignments from certain heads of the beginning 2 layers’ are interpretable diagonal lines More layers can still refine the synthesized mel spectrogram and improve audio quality Decoder input time step Encoder input time step The reason is that with residual connection between different layers, our model fits target transformation in a Taylor-expansion way: the starting terms account most as low ordering ones, while the subsequential ones can refine the function

Ablation Studies Different Hyper-Parameter: head number impact Reducing head numbers harms performance Comparison of time consuming (in second) per training step of different layer and head numbers (Tested on 4 GPUs with dynamic batch size)

Thank you!