Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang
2 Outline Members Theme of Sub-project I Research Roadmap Current Achievements Research Infrastructure Future Direction
3 Members Sin-Horng Chen Professor (PI) NCTU Yih-Ru Wang, Associate Professor (Co-PI), NCTU Lin-shan Lee Professor, NTU Chiu-yu Tseng Professor & Research Fellow (Co-PI) Academia Sinica Yuan-Fu Liao Assistant Professor (Co-PI), NTUT Hsin-min Wang Associate Research Fellow Academia Sinica
4 Fast speakers Slow speakers More breaks Less breaks Tone Behavior and Modeling Applications in Speech/Speaker Recognition Applications in Speech/Speaker Recognition Applications in Text-to-speech Synthesis Applications in Text-to-speech Synthesis Theme of Sub-Project I Prosody Analysis and Modeling Latent Factor-based pitch contour model Mean model: Shape model: Tone Sandhi Hierarchical modeling of fluent prosody High performance TTS Speaker recognition Prosodic model-based tone recognizer
5 Research Focus How to analyze and model fluent speech prosody –Approach 1: Hierarchical modeling of fluent speech prosody Develop a hierarchical prosody framework of fluent speech Construct modular acoustic models for: (1) F0 contours, (2) duration patterns, (3) Intensity distribution and (4) boundary breaks –Approach 2: Latent factor analysis-based modeling Assume there are some latent affecting factors Latent factor analysis for syllable duration, pitch contour, energy and Inter- syllable coarticulation Explore the relation between latent factors and syntactic information How to integrate these two approaches and apply them to –Text-to-speech synthesis –Speech/tone/speaker recognition
6 Research Roadmap Automatic prosodic labeling Prosodic phrase analysis High performance TTS Mandarin, Min-south, Hakka Current Achievements Future Direction Eigen prosody analysis-based speaker recognition RNN/VQ-based prosodic modeling COSPRO corpus/Toolkits Hierarchical modeling of fluent speech prosody Corpus-based TTSModel-based TTS Language model+pause, PM Tone modeling and recognition, MLP/RNN HMMModel-based tone recognizer Prosodic model-based speaker recognition Prosodic cues-dependent LM Latent factor analysis duration, pitch mean, shape, inter-syllable coarticulation Investigation in relation to prosody organization: F0 range and reset, naturalness and measurement, voice quality
7 Hierarchical Prosody Framework of Fluent Speech (1/4) Hierarchical framework of fluent speech prosody for multi- phrase speech paragraphs –Hierarchical cross-phrase patterns and contributions are found in all 4 acoustic dimensions. –Acoustic templates are derived for each prosody level F0 template Syllable duration templates and temporal allocation patterns Intensity distribution patterns Boundary break patterns
8 Breath Group Initial PPFinal PPMiddle Prosodic Phrase PW.. PW B2 B3 B4 Prosodic Group B4 B5 Hierarchical Prosody Framework of Fluent Speech (2/4) The Prosody Hierarchy with Prosodic Boundaries
9 Hierarchical Prosody Framework of Fluent Speech (3/4) F0 cadence of multi-phrase PG (Prosodic Phrase Group ) Tide over Wave and Ripple Syllable duration cadence of multi- phrase PG the PW level the PPh level PG-initial PPh l PG-medial PPh l PG-final PPh l
10 Hierarchical Prosody Framework of Fluent Speech (4/4) Duration Re-synthesis, F054C F0 Re-synthesis, F054C Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s Original Modified Original
11 Syllable Duration Model –Multiplicative model –Additive model Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models mean: 42.3 frames 43.9 frames variance: 180 frame 2 2.52 frame 2 RMSE: 1.93 frames (5ms/frame) Latent Factor Analysis-based Prosody Modeling (1/3)
12 Syllable Pitch Contour Model –Mean model –Shape model The patterns of x-3-3 Latent Factor Analysis-based Prosody Modeling (2/3) Reconstructed pitch mean
13 Inter-syllable coarticulation pitch contour model The relationship of syllable pitch contours and affecting factors Reconstructed pitch contour Latent Factor Analysis-based Prosody Modeling (3/3)
14 Mandarin/Taiwanese TTS Block diagram of TTS system TTS samples Model- based TTS Corpus- based TTS female 1 female 2 female 3 female 4 female 5 female 1 female 2 female 3 female 4 female 5 Taiwanese-
15 Tone Behavior Modeling and Recognition with Inter-Syllabic Features Gabor-IFAS-based pitch detection Four inter-syllabic features –Ratio of duration of adjacent syllables –Averaged pitch value over a syllable –Maximum pitch difference within a syllable –Averaged slope of the pitch contour over a syllable Context-dependent tone behavior modeling
16 Eigen-Prosody Analysis-based Robust Speaker Recognition Use latent semantic analysis (LSA) to efficiently extract useful speaker cues to resist handset mismatch from few training/test data –Step 1: Automatic prosodic state labeling and speaker-keyword statistics –Step 2: Eigen-prosody space construction using Latent semantic analysis prosodic features Prosody State Labeling Prosody keyword parsing prosody keywords A … ……. …… Co-occurrence Matrix speakers dictionary VQ-based Prosody modeling sequences of prosody states eigen- prosody space A U VTVT S high dimensional prosody space Eigen-prosody analysis (SVD) Fast speakers Slow speakers More breaks Less breaks Experimental results on HTIMIT corpus –Ten different handsets –302 speakers –7/3 utterances for training/test respectively
17 Research Infrastructure (1/2) Sinica COSPRO and Toolkits: –9 sets of Mandarin Chinese fluent speech corpora collected –Platform developed –Each corpus was designed to bring out different prosody features involved in fluent speech. –Annotation processes include labeling and tagging perceived units and boundaries in fluent speech, especially the ultimate unit the multiple phrase speech paragraph. –Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship characteristic to narrative or discourse organization.
18 Tree-Bank Speech Database –Uttered by a single female speaker –Short paragraphs, 110,000 syllables –Sentence-based syntactic tree annotated manually –Pitch contour and syllable segmentation corrected manually Research Infrastructure (2/2)
19 Future Direction (1/5) Automatic prosodic labeling of Mandarin speech corpus Analysis of prosodic phrase structure Model-based tone recognition High performance TTS Speech recognition/language modeling using prosodic cues Prosodic modeling-based robust speaker recognition
20 Future Direction (2/5) Automatic prosodic labeling of Mandarin Speech corpus –Goal: To construct a prosody-syntax model by exploiting the relationship of prosodic features and linguistic features and use it to automatic labeling of various acoustic cues: Prosodic phrase boundary detection Inter-syllable/inter-word coarticulation classification Full/half/sandhi tone labeling for Tone 3 Syllable pronunciation clustering Homograph determination The grouping of monosyllabic words with their neighboring words
21 Future Direction (3/5) Analysis of prosodic phrase structure –4-level prosody hierarchy: PW, PPh, BG, PG –Issues to be studied Detection and classification of prosodic phrases Relation between syntactic phrase structure and prosodic phrase structure Other affecting factors: speaking rate, speaking style, emotion type, spontaneity of speech Model-based tone recognition –Current approach Acoustic feature normalization Context-dependent tone modeling –Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour
22 High performance TTS –Applying the sophisticated prosody models Modular model of fluent speech prosody Latent factor analysis-based modeling –Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient. Consider both linguistic information and acoustic cues Specially treat to monosyllabic words –Use the above prosody-syntax models to assist in the generation of prosodic information Future Direction (4/5)
23 Future Direction (5/5) Speech recognition/language modeling using prosodic cues –Automatic prosodic states labeling –Prosodic state-dependent acoustic modeling –Prosodic state-dependent language modeling Prosodic modeling-based robust speaker recognition –Automatic prosodic cues labeling –N-gram language model to learn the prosodic behavior of speakers –Applying principle component analysis (PCA) to N-gram to find a compact prosodic speaker space