Forschungszentrum Telekommunikation Wien [Telecommunications Research Center Vienna] Michael Pucher Speech Synthesis Overview.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Building an ASR using HTK CS4706
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Angelo Dalli Department of Intelligent Computing Systems
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Dynamic Time Warping Applications and Derivation
A PRESENTATION BY SHAMALEE DESHPANDE
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Introduction to Automatic Speech Recognition
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Isolated-Word Speech Recognition Using Hidden Markov Models
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Speech Processing Laboratory
7-Speech Recognition Speech Recognition Concepts
HMM - Basics.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Performance Comparison of Speaker and Emotion Recognition
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
A NONPARAMETRIC BAYESIAN APPROACH FOR
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Statistical Models for Automatic Speech Recognition
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
Statistical Models for Automatic Speech Recognition
Handwritten Characters Recognition Based on an HMM Model
LECTURE 15: REESTIMATION, EM AND MIXTURES
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Forschungszentrum Telekommunikation Wien [Telecommunications Research Center Vienna] Michael Pucher Speech Synthesis Overview

Conversational speech  Y. Liu, E. Shriberg, A. Stolcke (2003), Automatic disfluency identification in conversational speech using multiple knowledge sources  E. Shriberg (2005), Spontaneous Speech: How People Really Talk, and Why Engineers Should Care  Recovering of hidden punctuation  Speaker overlap (multi-party speech)  Speaker state and emotion  N. Campbell (2006), Conversational speech synthesis and the need for some laughter  Conversational speech synthesis, based on a very large corpus of conversational everyday speech Conversational speech Phonetic transcription of „Investition“ sil (silence) GS (glottal stop) (e-schwa) n v E 0.4 s t_cl (closure) t i: 0.59 ts_cl (closure) ts i: o: n sp (short-pause) …. Non-lexical particles Para-linguistic: laughing, whispering, … Reflex: breathing,.. Disfluency: filled pauses (äh, ähm),..

Automatic speech segmentation  F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation  DTW based forced alignment  L. Wang, Y. Zhao, M. Chu, J. Zhou, Z. Cao (2004), Refining segmental boundaries for TTS database using fine contextual- dependent boundary models  GMM-based boundary correction  A. Park, J. R. Glass (2005), Towards Unsupervised Pattern Discovery in Speech  Segmental DTW for pattern discovery Finding matching sub-patternsBoundary feature vector is 0∞∞ K∞ E∞ n∞1.1 a∞ n∞1.8 is K0.5 E n a DTW alignment of Viennese „ist“ [i s] and „können“ [k E n a n] Phonetic distance

Synthesis of singing  K. Saino, H. Zen, Y. Nankaku, A. Lee, K. Tokuda (2006), An HMM-based singing voice synthesis system  T. Saitou, M. Goto, M. Unoki, M. Akagi (2007), Vocal Conversion from Speaking Voice to Singing Voice using STRAIGHT  Convert speech input signal into singing voice output signal Time-lag Features

Basics and history of unit selection speech synthesis Y.  Sagisaka (1988), Speech synthesis by rule using an optimal selection of non-uniform synthesis units  Non-uniform unit selection synthesis  A. Hunt, A. Black (1996), Unit selection in a concatenative speech synthesis system using a large speech database  Target cost (emission prob.) and concatenation cost (state trans. prob.)  Viterbi decoding of state transition network Possible units for the English word „no“ [n ou]Unit Selection Costs

Concatenation costs and target costs Unit Selection Costs Y.  A. Black, P. Taylor (1997), Automatically clustering similar units for unit selection in speech synthesis  Decision tree-based clustering of target units  Mean acoustic distance (melcep, F0, power, delta) between all cluster members as impurity measure 1.Split data points using all possible questions 2. 3.Remove question 4.Continue with datapoints in  Pantazis, Y., Stylianou, Y., and Klabbers, E. (2005), Discontinuity detection in concatenated speech synthesis based on nonlinear speech analysis  New non-stationary features for measuring discontinuity  Discrimination of continuous and discontinuous signals Phonetic decision tree

Basics of HMM-based speech synthesis  K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura (2000), Speech parameter generation algorithms for HMM-based speech synthesis  Maximize  Take into account dynamic features (derivative of cepstral coefficient), otherwise only means are selected  Explicit modeling of state durations  T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura (1999), Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis  F0 and duration modeling by multi-space probability distribution HMMs  H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Hidden Semi- Markov Model Based Speech Synthesis  Explicit modeling of state durations  K. Tokuda, T. Mausko, N. Miyazaki, T. Kobayashi (2002), Multi-space probability distribution HMM  Modeling of observation vectors with variable dimensionality (discrete symbols are 0- dimensional)  Useful for F0 modeling with voiced/unvoiced distinction

Speaker interpolation  T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura, Speaker interpolation in HMM-based speech synthesis system  M. Tachibana, J. Yamagishi, T. Masuko, T. Kobayashi, Speech synthesis with various emotional expressions and speaking styles by style Interpolation and morphing  Defines and evaluates 3 different interpolation methods  Interpolation among observations  Interpolation among output distributions  Interpolation based on KL-divergence  For emotional speech synthesis  generating mixed emotions, between happy and sad  For variant modeling  Generating variant between American English and Indian English or Austrian German and Viennese  Only directly applicable if tying structure of models is the same

Speaker adaptation  J. Yamagishi, T. Kobayashi, Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training  J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, T. Kobayashi, A training method of average voice model for HMM-based speech synthesis  M. Tamura, T. Masuko, K. Tokuda, T. Kobayashi, Text-to-speech synthesis with arbitrary speaker's voice from average voice

Signal generation  S. Imai (1983), Cepstral analysis synthesis on the mel frequency scale  Classical approach to signal generation from MFCC and F0  T. Fukada, K. Tokuda, T. Kobayashi and S. Imai (1992), An adaptive algorithm for melcepstral analysis of speech  Hideki Kawahara, Ikuyo Masuda-Katsuse and Alain de Cheveigné (1999), Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds  H. Zen, T. Toda, M. Nakamura, K. Tokuda (2007), Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005  R. Maia, T. Toda, H. Zen, Y. Nankaku, K. Tokuda (2007), An excitation model for HMM- based speech synthesis based on residual modeling

Context clustering  S. J. Young, J.J. Odell, P.C. Woodland (1994) Tree-Based State Tying for High Accuracy Modelling  Create and train monophone HMMs  Clone and re-estimate untied context dependent triphone HMMs  Cluster (tree-based) corresponding states  Put all i th state of „iy“ models in one cluster  Find clustering that maximizes the log likelihood  …  Increase mixture components  Shinoda, Koichi / Watanabe, Takao (1997) Acoustic modeling based on the MDL principle for speech recognition  Junichi Yamagishi, Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi (2003), A context clustering technique for average voice models

Forschungszentrum Telekommunikation Wien [Telecommunications Research Center Vienna] Michael Pucher, Volker Strom, Gregor Hofer, Friedrich Neubarth, Sylvia Moosmüller, Gudrun Schuchmann, Christian Kranzler Viennese Sociolect and Dialect Synthesis (Wiener Soziolekt und Dialektsynthese)

Test voices (speaker selection) and talking clocks  Recorded approx. 100 training sentences per speaker (9 speakers)  Generated 10 test sentences -Austrian German -Viennese  Talking clocks

Forschungszentrum Telekommunikation Wien [Telecommunications Research Center Vienna] Michael Pucher Speech Synthesis: Automatic speech segmentation

Automatic speech segmentation  F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation  DTW based forced alignment  L. Wang, Y. Zhao, M. Chu, J. Zhou, Z. Cao (2004), Refining segmental boundaries for TTS database using fine contextual- dependent boundary models  GMM-based boundary correction  A. Park, J. R. Glass (2005), Towards Unsupervised Pattern Discovery in Speech  Segmental DTW for pattern discovery Finding matching sub-patternsBoundary feature vector is 0∞∞ K∞ E∞ n∞1.1 a∞ n∞1.8 is K0.5 E n a DTW alignment of Viennese „ist“ [i s] and „können“ [k E n a n] Phonetic distance

Automatic speech segmentation – Dynamic Time Warping (DTW) based forced alignment  F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation  DTW based forced alignment is 0∞∞ K∞ E∞ n∞1.1 a∞ n∞1.8 is K0.5 E n a DTW alignment of Viennese „ist“ [i s] and „können“ [k E n a n] Phonetic distance ∞∞∞∞∞∞∞ K6 (0.1)∞ 5 (0.2)∞ E4 (0.3)∞ 6 (0.4)∞ 7 (0.5)∞ n8 (0.6)∞ 5 (0.7)∞ a3 (0.8)∞ 4 (0.9)∞ N2 (1.0)∞ DTW initialization for alignment of „können“ [k E n a n] with unknown sequence

Automatic speech segmentation – Dynamic Time Warping (DTW) based forced alignment  F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation  DTW based forced alignment ∞∞∞∞∞∞∞ K6 (0.1)∞ (0.2)∞ E4 (0.3)∞ (0.4)∞ (0.5)∞ n8 (0.6)∞ (0.7)∞ a3 (0.8)∞ (0.9)∞ n2 (1.0)∞ DTW initialization for alignment of „können“ [k E n a n] with unknown sequence  DTW[i,j] := cost + minimum(DTW[ i-1, j ], DTW[ i, j-1 ], DTW[ i-1, j-1 ])  One possible pattern, but other patterns for constraining the search space are possible (Rabiner, 1943) ∞∞∞∞∞∞∞ K6 (0.1)∞ (0.2)∞ E4 (0.3)∞ (0.4)∞ (0.5)∞ n8 (0.6)∞ (0.7)∞ a3 (0.8)∞ (0.9)∞ n2 (1.0)∞

Automatic speech segmentation – Boundary correction Unit Selection Costs Y.  Extract boundary vector from labeled database  Do decision-tree based clustering on feature vectors using questions based on left and right phoneme context 1.Split data points using all possible questions 2. 3.Remove question 4.Continue with datapoints in Phonetic decision tree Y.  Train GMM on each class  Use GMM to move the segmentation boundary within a certain range  L. Wang, Y. Zhao, M. Chu, J. Zhou, Z. Cao (2004), Refining segmental boundaries for TTS database using fine contextual-dependent boundary models

Automatic speech segmentation – Segmental DTW  A. Park, J. R. Glass (2005), Towards Unsupervised Pattern Discovery in Speech  Segmental DTW for pattern discovery Finding matching sub-patterns  Divide distance matrix into diagonal subbands of width W  Search for best alignment within each band  Find least average subsequence with minimum length L  Least average subsequence is aligned path that exhibits good alignment

Automatic speech segmentation – HMM-based forced alignment  Initialize flat-start HMM (sample global mean and variance)  Do HMM embedded training (don‘t care about overfitting for speech segmentation)  uses the forward-backward probabilities used in Baum-Welch re-estimation  But all models are trained in parallel  Do Viterbi decoding to get alignments (including multiple pronunciations)

Automatic speech segmentation – HMM-based forced alignment  Embedded training (initial conditions for forward probabilities)  Forward probability to be in state 1 of model q at time 1 (seeing frame 1)  1 if q =1 e.g. for first model of concatenated models  Otherwise prob of being in model q -1 (previous model) in state 1 and transition to end of model q-1  Forward prob of being in state j of model q at time 1  Prob of transition from state i of model q to state j times probability of observing observation 1 in state j of model q  Forward prob of being in final state of model q at time 1  Sum of prob of being in one of the emitting states of model q at time 1 times trans probability from this state to the final state of model q