DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Slides:

Advertisements

Similar presentations

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Advertisements

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Advances in WP2 Nancy Meeting – 6-7 July

Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Speaker Adaptation for Vowel Classification

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks II PROF. DR. YUSUF OYSAL.

Speech Recognition Deep Learning and Neural Nets Spring 2015.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Supervised Sequence Labelling with Recurrent Neural Networks PRESENTED BY: KUNAL PARMAR UHID:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.

Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.

Olivier Siohan David Rybach

Big data classification using neural network

Deep Learning for Dual-Energy X-Ray

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Deep Feedforward Networks

Mr. Darko Pekar, Speech Morphing Inc.

Environment Generation with GANs

Summary of “Efficient Deep Learning for Stereo Matching”

Deep Learning Amin Sobhani.

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

2 Research Department, iFLYTEK Co. LTD.

ARTIFICIAL NEURAL NETWORKS

Recurrent Neural Networks for Natural Language Processing

Pick samples from task t

Matt Gormley Lecture 16 October 24, 2016

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Urban Sound Classification with a Convolution Neural Network

Random walk initialization for training very deep feedforward networks

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Human-level control through deep reinforcement learning

Grid Long Short-Term Memory

Statistical Models for Automatic Speech Recognition

Neural Networks Geoff Hulten.

Neural Speech Synthesis with Transformer Network

Vinit Shah, Joseph Picone and Iyad Obeid

LECTURE 42: AUTOMATIC INTERPRETATION OF EEGS

Ch4: Backpropagation (BP)

Lip movement Synthesis from Text

Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng

LECTURE 41: AUTOMATIC INTERPRETATION OF EEGS

Sequence Student-Teacher Training of Deep Neural Networks

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Advances in Deep Audio and Audio-Visual Processing

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee

Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,

Ch4: Backpropagation (BP)

Combination of Feature and Channel Compensation (1/2)

Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS

End-to-End Speech-Driven Facial Animation with Temporal GANs

Random Neural Network Texture Model

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Presentation transcript:

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk Demiroglu2 Bogazici University1, Ozyegin University2, Istanbul, TURKEY Abstract Results Deep neural networks (DNNs) have been successfully deployed for acoustic modelling in statistical parametric speech synthesis (SPSS) systems. Moreover, DNN-based postfilters (PF) have also been shown to outperform conventional postfilters that are widely used in SPSS systems for increasing the quality of synthesized speech. However, existing DNN-based postfilters are trained with speaker-dependent databases. Given that SPSS systems can rapidly adapt to new speakers from generic models, there is a need for DNN-based postfilters that can adapt to new speakers with minimal adaptation data. Here, we compare DNN-, RNN-, and CNN-based postfilters together with adversarial (GAN) training and cluster-based initialization (CI) for rapid adaptation. Results indicate that the feedforward (FF) DNN, together with GAN and CI, significantly outperforms the other recently proposed postfilters. Figure 1: Generation of enhanced acoustic features. Table 1: Mel cepstral distortion (MCD) scores of the speaker-independent postfilters with and without cluster-based initialization (CI) are shown. Scores for tandem use of FF- and RNN-based postfilters with the CNN-based postfilter are also shown. Previously proposed DNN-based postfilters were trained for speaker independent systems where large amount of data is available for a single speaker. A major advantage of the SPSS approach is its ability to adapt to different speakers with limited amounts of data. To retain the rapid adaptation advantage of SPSS, post-filters that perform well with few shots of data is needed. 3 architectures are compared: FF, CNN, and RNN. Networks are trained with and without the GAN approach. Moreover, a cluster-based initialization method is used where the postfilter is initialized with a model that is trained with speakers similar to the target speaker. Introduction 1.3. CNN Network Batch normalization layers followed by ReLU nonlinearities are used between the convolutional layers Different from DNN-based and RNN-based postfilters, state and phoneme features are not used as inputs. A T×M feature matrix is used at the input producing a reconstructed T×M matrix at the output. DNN Architectures ( Postfilters ) Adaptation Cluster-Based Initialization Because adaptation is performed with very limited data, the optimization algorithm can quickly fall into a nearby local optima. Thus, good initialization is important to improve the performance. We hypothesized that a PF model that is pre-adapted with the speakers in the training set whose voice characteristics are similar to a target speaker can be a better initialization for the target speaker’s PF model than a more general initialization. We clustered the reference speakers into 5 groups using i-vectors with the k-means method. Adversarial Training When GAN is not used, the PF is trained using the standard Mean Squared Error (MSE) loss as below: In the adversarial approach, in addition to the MSE loss, a binary cross entropy loss function is used. The loss function with the adversarial approach is below where ELMSE and ELBCE are the expected values of MSE and binary cross entropy losses. apcpf is the prediction of the adversarial network when its input is the parameters generated by the PF network. Figure 2: (a) CNN-based postfilter, (b) Feedforward postfilter,(c) RNN-based postfilter Table 2: Mel cepstral distortion (MCD) scores of the speaker-adapted postfilters with 5, 10, and 15 seconds of adaptation data. Methods DNN Architectures ( of Postfilters ) Let c= [cT,∆cT,∆2cT]T be the output vector of the baseline network model where cT,∆cT,∆2cT are the cepstral,delta-cepstral and delta-delta-cepstral coefficients, and pn and st be the phoneme and the state information of utterance. We used T, M, P, S to denote the number of frames in an utterance,the number of MGC coefficients, the size of pn vector, and the size of st vector, respectively. Feedforward Network Fully-connected model with one hidden layer having 64 units. Since the PF structure does not include any recurrent layer, the context information is provided to the PF model by giving the previous,ci−1, and the next frame’s,ci+1, feature vectors together with the current feature vector ci. In addition, the PF model also takes pni and sti as inputs, resulting in a (3M+P+S) dimensional input vector Ii= [ci-1T, ciT,ci+1T, pniT, stiT]T whereas the corresponding output vector is M dimensional cpfi. RNN Network A fully connected layer on top of a long short-term memory (LSTM). T×(M+P+S) matrix including state, phoneme and MGC features is used sequentially, one frame at a time, at the input producing an M-dimensional vector at the output for each input frame. Experimental Setup All experiments were conducted on the Wall Street Journal(WSJ) speech database 154 features including 25 Mel-Generalized Cepstrum Coefficients (MGC), 1 log of fundamental frequency (LF0) and 25 Band Aperiodicity (BAP) . Together with their delta and delta-delta features were extracted from speech data at a sampling rate of 16 KHz and 5 msec frame rate. 5-state HMM model was applied to model and align phonemes. Total of 156 speakers, 135 of them were used for training, whereas the test and adaptation processes were performed on the remaining 21 target speakers’ data. Adaptation was performed with 5, 10, and 15 seconds of data. Figure 3: In the top figure, SI-baseline system (A) is compared with the RNN-SI postfilter (B) using the AB test. Significance(p-value) is 0.01. In the middle figure, SI-baseline system (A) is compared with the RNN-SI postfilter (B) using the AB test. Significance (p-value) is 0.01. In the bottom figure, RNN-SI postfilter (A) is compared with the CNN-SI postfilter (B)using the AB test. Significance (p-value) is 0.55