Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Speech Synthesis Markup Language SSML. Introduced in September 2004 XML based Assists the generation of synthetic speech Specifies the way speech is outputted.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Analyzing Students’ Pronunciation and Improving Tonal Teaching Ropngrong Liao Marilyn Chakwin Defense.
FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Discourse Prosodic Attributes, Boundary Information and Prosodic Highlight Speaker: Jr-Feng Huang PI: Chiu-yu Tseng Phonetics Lab, Institute of Linguistics,
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
A PRESENTATION BY SHAMALEE DESHPANDE
Producing Emotional Speech Thanks to Gabriel Schubiner.
Prosody and NLP Seminar by Nikhil: Adith: Prachur: 06D05011 We have a presentation this Friday ?
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
1 A Study on Implementation of Southern-Min Taiwanese Tone Sandhi System Iu n Un-gian Lau Kiat-gak Li Sheng-an.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
On Different Perspectives of Utilizing the Fujisaki Model to Mandarin Speech Prosody Zhao-yu Su Phonetics Lab, Institute of Linguistics, Academia Sinica.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Modeling and Generation of Accentual Phrase F 0 Contours Based on Discrete HMMs Synchronized at Mora-Unit Transitions Atsuhiro Sakurai (Texas Instruments.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
1 Determining query types by analysing intonation.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Using Webcast Text for Semantic Event Detection in Broadcast Sports Video IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 7, NOVEMBER 2008.
National Taiwan University, Taiwan
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Subproject II: Robustness in Speech Recognition. Members (1/2) Hsiao-Chuan Wang (PI) National Tsing Hua University Jeih-Weih Hung (Co-PI) National Chi.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
Predicting Voice Elicited Emotions
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Speaker: Hsiao-Tsung Hung.
Linguistic knowledge for Speech recognition
Investigating Pitch Accent Recognition in Non-native Speech
3.0 Map of Subject Areas.
汉语连续语音识别 年1月4日访北京工业大学 973 Project 2019/4/17 汉语连续语音识别 年1月4日访北京工业大学 郑 方 清华大学 计算机科学与技术系 语音实验室
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Artificial Intelligence 2004 Speech & Natural Language Processing
Low Level Cues to Emotion
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Automatic Prosodic Event Detection
Presentation transcript:

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang

2 Outline Members Theme of Sub-project I Research Roadmap Current Achievements Research Infrastructure Future Direction

3 Members Sin-Horng Chen Professor (PI) NCTU Yih-Ru Wang, Associate Professor (Co-PI), NCTU Lin-shan Lee Professor, NTU Chiu-yu Tseng Professor & Research Fellow (Co-PI) Academia Sinica Yuan-Fu Liao Assistant Professor (Co-PI), NTUT Hsin-min Wang Associate Research Fellow Academia Sinica

4 Fast speakers Slow speakers More breaks Less breaks Tone Behavior and Modeling Applications in Speech/Speaker Recognition Applications in Speech/Speaker Recognition Applications in Text-to-speech Synthesis Applications in Text-to-speech Synthesis Theme of Sub-Project I Prosody Analysis and Modeling Latent Factor-based pitch contour model Mean model: Shape model: Tone Sandhi Hierarchical modeling of fluent prosody High performance TTS Speaker recognition Prosodic model-based tone recognizer

5 Research Focus How to analyze and model fluent speech prosody –Approach 1: Hierarchical modeling of fluent speech prosody Develop a hierarchical prosody framework of fluent speech Construct modular acoustic models for: (1) F0 contours, (2) duration patterns, (3) Intensity distribution and (4) boundary breaks –Approach 2: Latent factor analysis-based modeling Assume there are some latent affecting factors Latent factor analysis for syllable duration, pitch contour, energy and Inter- syllable coarticulation Explore the relation between latent factors and syntactic information How to integrate these two approaches and apply them to –Text-to-speech synthesis –Speech/tone/speaker recognition

6 Research Roadmap Automatic prosodic labeling Prosodic phrase analysis High performance TTS Mandarin, Min-south, Hakka Current Achievements   Future Direction Eigen prosody analysis-based speaker recognition RNN/VQ-based prosodic modeling COSPRO corpus/Toolkits Hierarchical modeling of fluent speech prosody Corpus-based TTSModel-based TTS Language model+pause, PM Tone modeling and recognition, MLP/RNN HMMModel-based tone recognizer Prosodic model-based speaker recognition Prosodic cues-dependent LM Latent factor analysis duration, pitch mean, shape, inter-syllable coarticulation Investigation in relation to prosody organization: F0 range and reset, naturalness and measurement, voice quality

7 Hierarchical Prosody Framework of Fluent Speech (1/4) Hierarchical framework of fluent speech prosody for multi- phrase speech paragraphs –Hierarchical cross-phrase patterns and contributions are found in all 4 acoustic dimensions. –Acoustic templates are derived for each prosody level F0 template Syllable duration templates and temporal allocation patterns Intensity distribution patterns Boundary break patterns

8 Breath Group Initial PPFinal PPMiddle Prosodic Phrase PW.. PW B2 B3 B4 Prosodic Group B4 B5 Hierarchical Prosody Framework of Fluent Speech (2/4) The Prosody Hierarchy with Prosodic Boundaries

9 Hierarchical Prosody Framework of Fluent Speech (3/4) F0 cadence of multi-phrase PG (Prosodic Phrase Group ) Tide over Wave and Ripple Syllable duration cadence of multi- phrase PG the PW level the PPh level PG-initial PPh l PG-medial PPh l PG-final PPh l

10 Hierarchical Prosody Framework of Fluent Speech (4/4) Duration Re-synthesis, F054C F0 Re-synthesis, F054C Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s Original Modified Original

11 Syllable Duration Model –Multiplicative model –Additive model Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models mean: 42.3 frames  43.9 frames variance: 180 frame 2  2.52 frame 2 RMSE: 1.93 frames (5ms/frame) Latent Factor Analysis-based Prosody Modeling (1/3)

12 Syllable Pitch Contour Model –Mean model –Shape model The patterns of x-3-3 Latent Factor Analysis-based Prosody Modeling (2/3) Reconstructed pitch mean

13 Inter-syllable coarticulation pitch contour model The relationship of syllable pitch contours and affecting factors Reconstructed pitch contour Latent Factor Analysis-based Prosody Modeling (3/3)

14 Mandarin/Taiwanese TTS Block diagram of TTS system TTS samples Model- based TTS Corpus- based TTS female 1 female 2 female 3 female 4 female 5 female 1 female 2 female 3 female 4 female 5 Taiwanese-

15 Tone Behavior Modeling and Recognition with Inter-Syllabic Features Gabor-IFAS-based pitch detection Four inter-syllabic features –Ratio of duration of adjacent syllables –Averaged pitch value over a syllable –Maximum pitch difference within a syllable –Averaged slope of the pitch contour over a syllable Context-dependent tone behavior modeling

16 Eigen-Prosody Analysis-based Robust Speaker Recognition Use latent semantic analysis (LSA) to efficiently extract useful speaker cues to resist handset mismatch from few training/test data –Step 1: Automatic prosodic state labeling and speaker-keyword statistics –Step 2: Eigen-prosody space construction using Latent semantic analysis prosodic features Prosody State Labeling Prosody keyword parsing prosody keywords A … ……. …… Co-occurrence Matrix speakers dictionary VQ-based Prosody modeling sequences of prosody states eigen- prosody space A U VTVT S high dimensional prosody space Eigen-prosody analysis (SVD) Fast speakers Slow speakers More breaks Less breaks Experimental results on HTIMIT corpus –Ten different handsets –302 speakers –7/3 utterances for training/test respectively

17 Research Infrastructure (1/2) Sinica COSPRO and Toolkits: –9 sets of Mandarin Chinese fluent speech corpora collected –Platform developed –Each corpus was designed to bring out different prosody features involved in fluent speech. –Annotation processes include labeling and tagging perceived units and boundaries in fluent speech, especially the ultimate unit the multiple phrase speech paragraph. –Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship characteristic to narrative or discourse organization.

18 Tree-Bank Speech Database –Uttered by a single female speaker –Short paragraphs, 110,000 syllables –Sentence-based syntactic tree annotated manually –Pitch contour and syllable segmentation corrected manually Research Infrastructure (2/2)

19 Future Direction (1/5) Automatic prosodic labeling of Mandarin speech corpus Analysis of prosodic phrase structure Model-based tone recognition High performance TTS Speech recognition/language modeling using prosodic cues Prosodic modeling-based robust speaker recognition

20 Future Direction (2/5) Automatic prosodic labeling of Mandarin Speech corpus –Goal: To construct a prosody-syntax model by exploiting the relationship of prosodic features and linguistic features and use it to automatic labeling of various acoustic cues: Prosodic phrase boundary detection Inter-syllable/inter-word coarticulation classification Full/half/sandhi tone labeling for Tone 3 Syllable pronunciation clustering Homograph determination The grouping of monosyllabic words with their neighboring words

21 Future Direction (3/5) Analysis of prosodic phrase structure –4-level prosody hierarchy: PW, PPh, BG, PG –Issues to be studied Detection and classification of prosodic phrases Relation between syntactic phrase structure and prosodic phrase structure Other affecting factors: speaking rate, speaking style, emotion type, spontaneity of speech Model-based tone recognition –Current approach Acoustic feature normalization Context-dependent tone modeling –Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour

22 High performance TTS –Applying the sophisticated prosody models Modular model of fluent speech prosody Latent factor analysis-based modeling –Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient. Consider both linguistic information and acoustic cues Specially treat to monosyllabic words –Use the above prosody-syntax models to assist in the generation of prosodic information Future Direction (4/5)

23 Future Direction (5/5) Speech recognition/language modeling using prosodic cues –Automatic prosodic states labeling –Prosodic state-dependent acoustic modeling –Prosodic state-dependent language modeling Prosodic modeling-based robust speaker recognition –Automatic prosodic cues labeling –N-gram language model to learn the prosodic behavior of speakers –Applying principle component analysis (PCA) to N-gram to find a compact prosodic speaker space