Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.

Slides:



Advertisements
Similar presentations
Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Angelo Dalli Department of Intelligent Computing Systems
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Supervised Learning Recap
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003.
A Bayesian Approach to HMM-Based Speech Synthesis Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Takashi Masuko, and Keiichi Tokuda Nagoya Institute of.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
EE-148 Expectation Maximization Markus Weber 5/11/99.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Speaker Adaptation for Vowel Classification
Phylogenetic Trees Presenter: Michael Tung
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Part 4 c Baum-Welch Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Optimal Adaptation for Statistical Classifiers Xiao Li.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Isolated-Word Speech Recognition Using Hidden Markov Models
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
7-Speech Recognition Speech Recognition Concepts
Prepared by: Waleed Mohamed Azmy Under Supervision:
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Jacob Zurasky ECE5526 – Spring 2011
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.
Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
In-car Speech Recognition Using Distributed Microphones Tetsuya Shinde Kazuya Takeda Fumitada Itakura Center for Integrated Acoustic Information Research.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Variational Bayesian Methods for Audio Indexing
Agenda TTS Introduction HTS Q & A.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Mr. Darko Pekar, Speech Morphing Inc.
Online Multiscale Dynamic Topic Models
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
SMEM Algorithm for Mixture Models
Presentation transcript:

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011

Background HMM-based speech synthesis Quality of synthesized speech depends on acoustic models Model estimation is one of the most important problem Appropriate training algorithm is required Deterministic annealing EM (DAEM) algorithm To overcome the local maxima problem Step-wise model selection To perform the joint optimization of model structures and state sequences 2

Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 3

Overview of HMM-based system 4 Speech signal Label Contest-dependent HMMs & duration models Training part Label Spectral parameters Synthesis part Synthesized speech Speech database Excitation parameters extraction Spectral parameters extraction Training of HMM Parameter generation from HMM Text analysis Excitation parameters Synthesis filter Excitation generation TEXT

Base techniques Hidden semi-Markov Model (HSMM) HMM with explicit state duration probability dist. Estimate state output and duration probability dists. STRAIGHT A high quality speech vocoding method Spectrum, F0, and aperiodicity measures Parameter generation considering GV Calculate GV features from only speech region excluding silence and pause Context dependent GV models 5

Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 6

EM algorithm Maximum likelihood (ML) criterion Expectation Maximization (EM) algorithm 7 : Model parameter : Training data : HMM state seq. ・ E-step: ・ M-step: Occur the local maxima problem

DAEM algorithm Posterior probability Model update process 8 : Temperature parameter ・ E-step: ・ M-step: ・ Increase temperature parameter

Optimization of state sequence Likelihood function in the DAEM algorithm 9 Time All state sequences have uniform probability State output probabilityState transition probability

Optimization of state sequence Likelihood function in the DAEM algorithm 10 Time Change from uniform to sharp State output probabilityState transition probability

Optimization of state sequence Likelihood function in the DAEM algorithm 11 Time State output probability Estimate reliable acoustic models State transition probability

Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 12

Problem of context clustering Context-dependent model Appropriate model structures are required Decision tree based context clustering Assumption: state occupancies are not changed State occupancies depend on model structures State sequences and model structures should be optimized simultaneously 13 /a/?Silence? Vowel?

Step-wise model selection Gradually change the size of decision tree Perform joint optimization of model structures and state sequences Minimum Description Length (MDL) criterion 14 : Dimension of feature vec. : Number of nodes : Amount of training data assigned to the root node : Tuning parameter

Model training process 1. Estimate monophone models (DAEM) # of temperature parameter updates is 10 # of EM-steps at each temperature is 5 2. Select decision trees by the MDL criterion using the tuning parameter 3. Estimate context-dependent models (EM) # of EM-steps is 5 4. Decrease the tuning parameter Tuning parameter decreases as 4, 2, 1 5. Repeat from step. 2 15

Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 16

Speech analysis conditions 17 Training data 10,000 utterances (pruned by the alignment likelihood) Sampling rate48 kHz WindowF0-adaptive Gaussian window Frame shift5 ms Feature vector 49-dim. STRAIGHT mel-cepstrum, log F0 26 band-filtered aperiodicity measure + Δ + ΔΔ (231 dimension) HMM 5-state left-to-right HSMM without skip transition

Likelihood & model structure Average log likelihood of monophone model Number of leaf nodes Phone set: Unilex (58 phoneme) Number of leaf nodes (Full-context): 6,175, EMDAEM Ave. Log Likelihood Tuning parameterMel-Cep.Log F0Dur.Sum Monophone ,9343, ,302 23,2707,8991,76012, ,72124,8973,92340,541

Experimental results Compare with the benchmark HMM-based system NIT system achieved the same performance High intelligibility Compare with the benchmark unit-selection system Worse in speaker similarity Better in intelligibility 19 HMM-based (16kHz) HMM-based (48kHz) Unit-selection Naturalness――― Speaker similarity――× Intelligibility――○

Speech samples Generate high intelligible speech Include voiced/unvoiced errors Need to improve feature extraction and excitation 20 Speech samples Original NIT system

Conclusion NIT HMM-based speech synthesis system DAEM algorithm Overcome the local maxima problem Step-wise model selection Perform joint optimization of state sequences and model structures Generate high intelligible speech Future work Improve feature extraction and excitation Investigate the schedule of temperature parameters and step-wise model selection 21

Thank you

Experimental result Naturalness 23

Experimental result Speaker similarity 24

Experimental result Intelligibility 25