DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Slides:



Advertisements
Similar presentations
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Advertisements

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Advances in WP2 Nancy Meeting – 6-7 July
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Speaker Adaptation for Vowel Classification
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks II PROF. DR. YUSUF OYSAL.
Speech Recognition Deep Learning and Neural Nets Spring 2015.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Supervised Sequence Labelling with Recurrent Neural Networks PRESENTED BY: KUNAL PARMAR UHID:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
Olivier Siohan David Rybach
Big data classification using neural network
Deep Learning for Dual-Energy X-Ray
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Feedforward Networks
Mr. Darko Pekar, Speech Morphing Inc.
Environment Generation with GANs
Summary of “Efficient Deep Learning for Stereo Matching”
Deep Learning Amin Sobhani.
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
2 Research Department, iFLYTEK Co. LTD.
ARTIFICIAL NEURAL NETWORKS
Recurrent Neural Networks for Natural Language Processing
Pick samples from task t
Matt Gormley Lecture 16 October 24, 2016
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Urban Sound Classification with a Convolution Neural Network
Random walk initialization for training very deep feedforward networks
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Human-level control through deep reinforcement learning
Grid Long Short-Term Memory
Statistical Models for Automatic Speech Recognition
Neural Networks Geoff Hulten.
Neural Speech Synthesis with Transformer Network
Vinit Shah, Joseph Picone and Iyad Obeid
LECTURE 42: AUTOMATIC INTERPRETATION OF EEGS
Ch4: Backpropagation (BP)
Lip movement Synthesis from Text
Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng 
LECTURE 41: AUTOMATIC INTERPRETATION OF EEGS
Sequence Student-Teacher Training of Deep Neural Networks
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Advances in Deep Audio and Audio-Visual Processing
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,
Ch4: Backpropagation (BP)
Combination of Feature and Channel Compensation (1/2)
Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS
End-to-End Speech-Driven Facial Animation with Temporal GANs
Random Neural Network Texture Model
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk Demiroglu2 Bogazici University1, Ozyegin University2, Istanbul, TURKEY Abstract Results Deep neural networks (DNNs) have been successfully deployed for acoustic modelling in statistical parametric speech synthesis (SPSS) systems. Moreover, DNN-based postfilters (PF) have also been shown to outperform conventional postfilters that are widely used in SPSS systems for increasing the quality of synthesized speech. However, existing DNN-based postfilters are trained with speaker-dependent databases. Given that SPSS systems can rapidly adapt to new speakers from generic models, there is a need for DNN-based postfilters that can adapt to new speakers with minimal adaptation data. Here, we compare DNN-, RNN-, and CNN-based postfilters together with adversarial (GAN) training and cluster-based initialization (CI) for rapid adaptation. Results indicate that the feedforward (FF) DNN, together with GAN and CI, significantly outperforms the other recently proposed postfilters. Figure 1: Generation of enhanced acoustic features. Table 1: Mel cepstral distortion (MCD) scores of the speaker-independent postfilters with and without cluster-based initialization (CI) are shown. Scores for tandem use of FF- and RNN-based postfilters with the CNN-based postfilter are also shown. Previously proposed DNN-based postfilters were trained for speaker independent systems where large amount of data is available for a single speaker. A major advantage of the SPSS approach is its ability to adapt to different speakers with limited amounts of data. To retain the rapid adaptation advantage of SPSS, post-filters that perform well with few shots of data is needed. 3 architectures are compared: FF, CNN, and RNN. Networks are trained with and without the GAN approach. Moreover, a cluster-based initialization method is used where the postfilter is initialized with a model that is trained with speakers similar to the target speaker. Introduction 1.3. CNN Network Batch normalization layers followed by ReLU nonlinearities are used between the convolutional layers Different from DNN-based and RNN-based postfilters, state and phoneme features are not used as inputs. A T×M feature matrix is used at the input producing a reconstructed T×M matrix at the output. DNN Architectures ( Postfilters ) Adaptation Cluster-Based Initialization Because adaptation is performed with very limited data, the optimization algorithm can quickly fall into a nearby local optima. Thus, good initialization is important to improve the performance. We hypothesized that a PF model that is pre-adapted with the speakers in the training set whose voice characteristics are similar to a target speaker can be a better initialization for the target speaker’s PF model than a more general initialization. We clustered the reference speakers into 5 groups using i-vectors with the k-means method. Adversarial Training When GAN is not used, the PF is trained using the standard Mean Squared Error (MSE) loss as below: In the adversarial approach, in addition to the MSE loss, a binary cross entropy loss function is used. The loss function with the adversarial approach is below where ELMSE and ELBCE are the expected values of MSE and binary cross entropy losses. apcpf is the prediction of the adversarial network when its input is the parameters generated by the PF network. Figure 2: (a) CNN-based postfilter, (b) Feedforward postfilter,(c) RNN-based postfilter Table 2: Mel cepstral distortion (MCD) scores of the speaker-adapted postfilters with 5, 10, and 15 seconds of adaptation data. Methods DNN Architectures ( of Postfilters ) Let c= [cT,∆cT,∆2cT]T be the output vector of the baseline network model where cT,∆cT,∆2cT are the cepstral,delta-cepstral and delta-delta-cepstral coefficients, and pn and st be the phoneme and the state information of utterance. We used T, M, P, S to denote the number of frames in an utterance,the number of MGC coefficients, the size of pn vector, and the size of st vector, respectively. Feedforward Network Fully-connected model with one hidden layer having 64 units. Since the PF structure does not include any recurrent layer, the context information is provided to the PF model by giving the previous,ci−1, and the next frame’s,ci+1, feature vectors together with the current feature vector ci. In addition, the PF model also takes pni and sti as inputs, resulting in a (3M+P+S) dimensional input vector Ii= [ci-1T, ciT,ci+1T, pniT, stiT]T whereas the corresponding output vector is M dimensional cpfi. RNN Network A fully connected layer on top of a long short-term memory (LSTM). T×(M+P+S) matrix including state, phoneme and MGC features is used sequentially, one frame at a time, at the input producing an M-dimensional vector at the output for each input frame. Experimental Setup All experiments were conducted on the Wall Street Journal(WSJ) speech database 154 features including 25 Mel-Generalized Cepstrum Coefficients (MGC), 1 log of fundamental frequency (LF0) and 25 Band Aperiodicity (BAP) . Together with their delta and delta-delta features were extracted from speech data at a sampling rate of 16 KHz and 5 msec frame rate. 5-state HMM model was applied to model and align phonemes. Total of 156 speakers, 135 of them were used for training, whereas the test and adaptation processes were performed on the remaining 21 target speakers’ data. Adaptation was performed with 5, 10, and 15 seconds of data. Figure 3: In the top figure, SI-baseline system (A) is compared with the RNN-SI postfilter (B) using the AB test. Significance(p-value) is 0.01. In the middle figure, SI-baseline system (A) is compared with the RNN-SI postfilter (B) using the AB test. Significance (p-value) is 0.01. In the bottom figure, RNN-SI postfilter (A) is compared with the CNN-SI postfilter (B)using the AB test. Significance (p-value) is 0.55