Download presentation
Presentation is loading. Please wait.
Published bySolomon Fleming Modified over 9 years ago
2
Forschungszentrum Telekommunikation Wien [Telecommunications Research Center Vienna] Michael Pucher Speech Synthesis Overview
3
Conversational speech Y. Liu, E. Shriberg, A. Stolcke (2003), Automatic disfluency identification in conversational speech using multiple knowledge sources E. Shriberg (2005), Spontaneous Speech: How People Really Talk, and Why Engineers Should Care Recovering of hidden punctuation Speaker overlap (multi-party speech) Speaker state and emotion N. Campbell (2006), Conversational speech synthesis and the need for some laughter Conversational speech synthesis, based on a very large corpus of conversational everyday speech Conversational speech Phonetic transcription of „Investition“ 0.108 sil (silence) 0.126 GS (glottal stop) 0.178 @ (e-schwa) 0.244 n 0.274 v 0.336 E 0.4 s 0.422 t_cl (closure) 0.464 t 0.514 i: 0.59 ts_cl (closure) 0.686 ts 0.724 i: 0.904 o: 1.074 n 1.118 sp (short-pause) …. Non-lexical particles Para-linguistic: laughing, whispering, … Reflex: breathing,.. Disfluency: filled pauses (äh, ähm),..
4
Automatic speech segmentation F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation DTW based forced alignment L. Wang, Y. Zhao, M. Chu, J. Zhou, Z. Cao (2004), Refining segmental boundaries for TTS database using fine contextual- dependent boundary models GMM-based boundary correction A. Park, J. R. Glass (2005), Towards Unsupervised Pattern Discovery in Speech Segmental DTW for pattern discovery Finding matching sub-patternsBoundary feature vector is 0∞∞ K∞0.51.0 E∞0.61.0 n∞1.1 a∞1.31.6 n∞1.8 is K0.5 E0.10.5 n a0.20.5 DTW alignment of Viennese „ist“ [i s] and „können“ [k E n a n] Phonetic distance
5
Synthesis of singing K. Saino, H. Zen, Y. Nankaku, A. Lee, K. Tokuda (2006), An HMM-based singing voice synthesis system T. Saitou, M. Goto, M. Unoki, M. Akagi (2007), Vocal Conversion from Speaking Voice to Singing Voice using STRAIGHT Convert speech input signal into singing voice output signal Time-lag Features
6
Basics and history of unit selection speech synthesis Y. Sagisaka (1988), Speech synthesis by rule using an optimal selection of non-uniform synthesis units Non-uniform unit selection synthesis A. Hunt, A. Black (1996), Unit selection in a concatenative speech synthesis system using a large speech database Target cost (emission prob.) and concatenation cost (state trans. prob.) Viterbi decoding of state transition network Possible units for the English word „no“ [n ou]Unit Selection Costs
7
Concatenation costs and target costs Unit Selection Costs Y. A. Black, P. Taylor (1997), Automatically clustering similar units for unit selection in speech synthesis Decision tree-based clustering of target units Mean acoustic distance (melcep, F0, power, delta) between all cluster members as impurity measure 1.Split data points using all possible questions 2. 3.Remove question 4.Continue with datapoints in Pantazis, Y., Stylianou, Y., and Klabbers, E. (2005), Discontinuity detection in concatenated speech synthesis based on nonlinear speech analysis New non-stationary features for measuring discontinuity Discrimination of continuous and discontinuous signals Phonetic decision tree
8
Basics of HMM-based speech synthesis K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura (2000), Speech parameter generation algorithms for HMM-based speech synthesis Maximize Take into account dynamic features (derivative of cepstral coefficient), otherwise only means are selected Explicit modeling of state durations T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura (1999), Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis F0 and duration modeling by multi-space probability distribution HMMs H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Hidden Semi- Markov Model Based Speech Synthesis Explicit modeling of state durations K. Tokuda, T. Mausko, N. Miyazaki, T. Kobayashi (2002), Multi-space probability distribution HMM Modeling of observation vectors with variable dimensionality (discrete symbols are 0- dimensional) Useful for F0 modeling with voiced/unvoiced distinction
9
Speaker interpolation T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura, Speaker interpolation in HMM-based speech synthesis system M. Tachibana, J. Yamagishi, T. Masuko, T. Kobayashi, Speech synthesis with various emotional expressions and speaking styles by style Interpolation and morphing Defines and evaluates 3 different interpolation methods Interpolation among observations Interpolation among output distributions Interpolation based on KL-divergence For emotional speech synthesis generating mixed emotions, between happy and sad For variant modeling Generating variant between American English and Indian English or Austrian German and Viennese Only directly applicable if tying structure of models is the same
10
Speaker adaptation J. Yamagishi, T. Kobayashi, Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, T. Kobayashi, A training method of average voice model for HMM-based speech synthesis M. Tamura, T. Masuko, K. Tokuda, T. Kobayashi, Text-to-speech synthesis with arbitrary speaker's voice from average voice
11
Signal generation S. Imai (1983), Cepstral analysis synthesis on the mel frequency scale Classical approach to signal generation from MFCC and F0 T. Fukada, K. Tokuda, T. Kobayashi and S. Imai (1992), An adaptive algorithm for melcepstral analysis of speech Hideki Kawahara, Ikuyo Masuda-Katsuse and Alain de Cheveigné (1999), Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds H. Zen, T. Toda, M. Nakamura, K. Tokuda (2007), Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005 R. Maia, T. Toda, H. Zen, Y. Nankaku, K. Tokuda (2007), An excitation model for HMM- based speech synthesis based on residual modeling
12
Context clustering S. J. Young, J.J. Odell, P.C. Woodland (1994) Tree-Based State Tying for High Accuracy Modelling Create and train monophone HMMs Clone and re-estimate untied context dependent triphone HMMs Cluster (tree-based) corresponding states Put all i th state of „iy“ models in one cluster Find clustering that maximizes the log likelihood … Increase mixture components Shinoda, Koichi / Watanabe, Takao (1997) Acoustic modeling based on the MDL principle for speech recognition Junichi Yamagishi, Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi (2003), A context clustering technique for average voice models
13
Forschungszentrum Telekommunikation Wien [Telecommunications Research Center Vienna] Michael Pucher, Volker Strom, Gregor Hofer, Friedrich Neubarth, Sylvia Moosmüller, Gudrun Schuchmann, Christian Kranzler Viennese Sociolect and Dialect Synthesis (Wiener Soziolekt und Dialektsynthese)
14
Test voices (speaker selection) and talking clocks Recorded approx. 100 training sentences per speaker (9 speakers) Generated 10 test sentences -Austrian German -Viennese Talking clocks http://magog.ftw.athttp://magog.ftw.at
15
Forschungszentrum Telekommunikation Wien [Telecommunications Research Center Vienna] Michael Pucher Speech Synthesis: Automatic speech segmentation
16
Automatic speech segmentation F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation DTW based forced alignment L. Wang, Y. Zhao, M. Chu, J. Zhou, Z. Cao (2004), Refining segmental boundaries for TTS database using fine contextual- dependent boundary models GMM-based boundary correction A. Park, J. R. Glass (2005), Towards Unsupervised Pattern Discovery in Speech Segmental DTW for pattern discovery Finding matching sub-patternsBoundary feature vector is 0∞∞ K∞0.51.0 E∞0.61.0 n∞1.1 a∞1.31.6 n∞1.8 is K0.5 E0.10.5 n a0.20.5 DTW alignment of Viennese „ist“ [i s] and „können“ [k E n a n] Phonetic distance
17
Automatic speech segmentation – Dynamic Time Warping (DTW) based forced alignment F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation DTW based forced alignment is 0∞∞ K∞0.51.0 E∞0.61.0 n∞1.1 a∞1.31.6 n∞1.8 is K0.5 E0.10.5 n a0.20.5 DTW alignment of Viennese „ist“ [i s] and „können“ [k E n a n] Phonetic distance 6743821 0∞∞∞∞∞∞∞ K6 (0.1)∞ 5 (0.2)∞ E4 (0.3)∞ 6 (0.4)∞ 7 (0.5)∞ n8 (0.6)∞ 5 (0.7)∞ a3 (0.8)∞ 4 (0.9)∞ N2 (1.0)∞ DTW initialization for alignment of „können“ [k E n a n] with unknown sequence
18
Automatic speech segmentation – Dynamic Time Warping (DTW) based forced alignment F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation DTW based forced alignment 6743821 0∞∞∞∞∞∞∞ K6 (0.1)∞013681217 5 (0.2)∞122471014 E4 (0.3)∞34237912 6 (0.4)∞34455914 7 (0.5)∞436861015 n8 (0.6)∞6471161217 5 (0.7)∞76579913 a3 (0.8)∞10 65 11 4 (0.9)∞12136691113 n2 (1.0)∞16178712910 DTW initialization for alignment of „können“ [k E n a n] with unknown sequence DTW[i,j] := cost + minimum(DTW[ i-1, j ], DTW[ i, j-1 ], DTW[ i-1, j-1 ]) One possible pattern, but other patterns for constraining the search space are possible (Rabiner, 1943) 6743821 0∞∞∞∞∞∞∞ K6 (0.1)∞013681217 5 (0.2)∞122471014 E4 (0.3)∞34237912 6 (0.4)∞34455914 7 (0.5)∞436861015 n8 (0.6)∞6471161217 5 (0.7)∞76579913 a3 (0.8)∞10 65 11 4 (0.9)∞12136691113 n2 (1.0)∞16178712910
19
Automatic speech segmentation – Boundary correction Unit Selection Costs Y. Extract boundary vector from labeled database Do decision-tree based clustering on feature vectors using questions based on left and right phoneme context 1.Split data points using all possible questions 2. 3.Remove question 4.Continue with datapoints in Phonetic decision tree Y. Train GMM on each class Use GMM to move the segmentation boundary within a certain range L. Wang, Y. Zhao, M. Chu, J. Zhou, Z. Cao (2004), Refining segmental boundaries for TTS database using fine contextual-dependent boundary models
20
Automatic speech segmentation – Segmental DTW A. Park, J. R. Glass (2005), Towards Unsupervised Pattern Discovery in Speech Segmental DTW for pattern discovery Finding matching sub-patterns Divide distance matrix into diagonal subbands of width W Search for best alignment within each band Find least average subsequence with minimum length L Least average subsequence is aligned path that exhibits good alignment
21
Automatic speech segmentation – HMM-based forced alignment Initialize flat-start HMM (sample global mean and variance) Do HMM embedded training (don‘t care about overfitting for speech segmentation) uses the forward-backward probabilities used in Baum-Welch re-estimation But all models are trained in parallel Do Viterbi decoding to get alignments (including multiple pronunciations)
22
Automatic speech segmentation – HMM-based forced alignment Embedded training (initial conditions for forward probabilities) Forward probability to be in state 1 of model q at time 1 (seeing frame 1) 1 if q =1 e.g. for first model of concatenated models Otherwise prob of being in model q -1 (previous model) in state 1 and transition to end of model q-1 Forward prob of being in state j of model q at time 1 Prob of transition from state i of model q to state j times probability of observing observation 1 in state j of model q Forward prob of being in final state of model q at time 1 Sum of prob of being in one of the emitting states of model q at time 1 times trans probability from this state to the final state of model q
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.