Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.

Similar presentations


Presentation on theme: "Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya."— Presentation transcript:

1

2 Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011

3 Background HMM-based speech synthesis Quality of synthesized speech depends on acoustic models Model estimation is one of the most important problem Appropriate training algorithm is required Deterministic annealing EM (DAEM) algorithm To overcome the local maxima problem Step-wise model selection To perform the joint optimization of model structures and state sequences 2

4 Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 3

5 Overview of HMM-based system 4 Speech signal Label Contest-dependent HMMs & duration models Training part Label Spectral parameters Synthesis part Synthesized speech Speech database Excitation parameters extraction Spectral parameters extraction Training of HMM Parameter generation from HMM Text analysis Excitation parameters Synthesis filter Excitation generation TEXT

6 Base techniques Hidden semi-Markov Model (HSMM) HMM with explicit state duration probability dist. Estimate state output and duration probability dists. STRAIGHT A high quality speech vocoding method Spectrum, F0, and aperiodicity measures Parameter generation considering GV Calculate GV features from only speech region excluding silence and pause Context dependent GV models 5

7 Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 6

8 EM algorithm Maximum likelihood (ML) criterion Expectation Maximization (EM) algorithm 7 : Model parameter : Training data : HMM state seq. ・ E-step: ・ M-step: Occur the local maxima problem

9 DAEM algorithm Posterior probability Model update process 8 : Temperature parameter ・ E-step: ・ M-step: ・ Increase temperature parameter

10 Optimization of state sequence Likelihood function in the DAEM algorithm 9 Time All state sequences have uniform probability State output probabilityState transition probability

11 Optimization of state sequence Likelihood function in the DAEM algorithm 10 Time Change from uniform to sharp State output probabilityState transition probability

12 Optimization of state sequence Likelihood function in the DAEM algorithm 11 Time State output probability Estimate reliable acoustic models State transition probability

13 Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 12

14 Problem of context clustering Context-dependent model Appropriate model structures are required Decision tree based context clustering Assumption: state occupancies are not changed State occupancies depend on model structures State sequences and model structures should be optimized simultaneously 13 /a/?Silence? Vowel?

15 Step-wise model selection Gradually change the size of decision tree Perform joint optimization of model structures and state sequences Minimum Description Length (MDL) criterion 14 : Dimension of feature vec. : Number of nodes : Amount of training data assigned to the root node : Tuning parameter

16 Model training process 1. Estimate monophone models (DAEM) # of temperature parameter updates is 10 # of EM-steps at each temperature is 5 2. Select decision trees by the MDL criterion using the tuning parameter 3. Estimate context-dependent models (EM) # of EM-steps is 5 4. Decrease the tuning parameter Tuning parameter decreases as 4, 2, 1 5. Repeat from step. 2 15

17 Outline HMM-based speech synthesis system Deterministic annealing EM (DAEM) algorithm Step-wise model selection Experiments Conclusion & future work 16

18 Speech analysis conditions 17 Training data 10,000 utterances (pruned by the alignment likelihood) Sampling rate48 kHz WindowF0-adaptive Gaussian window Frame shift5 ms Feature vector 49-dim. STRAIGHT mel-cepstrum, log F0 26 band-filtered aperiodicity measure + Δ + ΔΔ (231 dimension) HMM 5-state left-to-right HSMM without skip transition

19 Likelihood & model structure Average log likelihood of monophone model Number of leaf nodes Phone set: Unilex (58 phoneme) Number of leaf nodes (Full-context): 6,175,466 18 EMDAEM Ave. Log Likelihood227.716229.174 Tuning parameterMel-Cep.Log F0Dur.Sum Monophone290 58638 41,9343,4549146,302 23,2707,8991,76012,929 111,72124,8973,92340,541

20 Experimental results Compare with the benchmark HMM-based system NIT system achieved the same performance High intelligibility Compare with the benchmark unit-selection system Worse in speaker similarity Better in intelligibility 19 HMM-based (16kHz) HMM-based (48kHz) Unit-selection Naturalness――― Speaker similarity――× Intelligibility――○

21 Speech samples Generate high intelligible speech Include voiced/unvoiced errors Need to improve feature extraction and excitation 20 Speech samples Original NIT system

22 Conclusion NIT HMM-based speech synthesis system DAEM algorithm Overcome the local maxima problem Step-wise model selection Perform joint optimization of state sequences and model structures Generate high intelligible speech Future work Improve feature extraction and excitation Investigate the schedule of temperature parameters and step-wise model selection 21

23 Thank you

24 Experimental result Naturalness 23

25 Experimental result Speaker similarity 24

26 Experimental result Intelligibility 25


Download ppt "Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya."

Similar presentations


Ads by Google