Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.

Slides:



Advertisements
Similar presentations
The Role of F0 in the Perceived Accentedness of L2 Speech Mary Grantham O’Brien Stephen Winters GLAC-15, Banff, Alberta May 1, 2009.
Advertisements

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
Niebuhr, D‘Imperio, Gili Fivela, Cangemi 1 Are there “Shapers” and “Aligners” ? Individual differences in signalling pitch accent category.
FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Linguistic Phonetics in the UCLA Phonetics Lab Pat Keating Sound to Sense / June 11, 2004.
Language Acquisition Species-specific, species-universal accomplishment Central issue for cognitive science Important distinction between language comprehension.
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Sound and Speech. The vocal tract Figures from Graddol et al.
Chapter three Phonology
Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Ivan Bulyko and Mari Ostendorf Electrical Engineering Department University.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Intonation September 18, 2014 The Plan for Today Also: I have posted a couple of readings on TOBI (an intonation transcription system) to the course.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
Kishore Prahallad IIIT Hyderabad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
LE 460 L Acoustics and Experimental Phonetics L-13
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Infant Speech Perception & Language Processing. Languages of the World Similar and Different on many features Similarities –Arbitrary mapping of sound.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Comprehension of Grammatical and Emotional Prosody is Impaired in Alzheimer’s Disease Vanessa Taler, Shari Baum, Howard Chertkow, Daniel Saumier and Reported.
Tone sensitivity & the Identification of Consonant Laryngeal Features by KFL learners 15 th AATK Annual Conference Hye-Sook Lee -Presented by Hi-Sun Kim-
Prepared by: Waleed Mohamed Azmy Under Supervision:
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Results 1.Boundary shift Japanese vs. English perceptions Korean vs. English perceptions 1.Category boundary was shifted toward boundaries in listeners’
The role of prosody in dialect synthesis and authentication Kyuchul Yoon Division of English Kyungnam University Spring 2008 Joint Conference of KSPS.
Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial.
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Acoustic Properties of Taiwanese High School Students ’ Stress in English Intonation Advisor: Dr. Raung-Fu Chung Student: Hong-Yao Chen.
Speech Perception 4/4/00.
Chapter 15 Recording and Editing Sound. 2Practical PC 5 th Edition Chapter 15 Getting Started In this Chapter, you will learn: − How sound capability.
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
Ch 3 Slide 1 Is there a connection between phonemes and speakers’ perception of phonetic differences? (audibility of fine distinctions) Due to phonology,
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
1 Current Interests 2007~2008 (Unfinished papers & Premature ideas) 1.Identifying frication & aspiration noise in the frequency domain: The case of Korean.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
HMM training strategy for incremental speech synthesis.
Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.
The role of prosody in dialect authentication Simulating Masan dialect with Seoul speech segments Kyuchul Yoon Division of English, Kyungnam University.
Dialect Simulation through Prosody Transfer: A preliminary study on simulating Masan dialect with Seoul dialect Kyuchul Yoon Division of English, Kyungnam.
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.
영어교육에 있어서의 영어억양의 역할 (The role of prosody in English education) Korea Nazarene University Kyuchul Yoon English Division Kyungnam University.
The 157th Meeting of Acoustical Society of America in Portland, Oregon, May 21, pSW35. Confusion Direction Differences in Second Language Production.
Investigating Pitch Accent Recognition in Non-native Speech
Text-To-Speech System for English
Studying Intonation Julia Hirschberg CS /21/2018.
Studying Intonation Julia Hirschberg CS /21/2018.
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
The American School and ToBI
Speech and Language Processing
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Representing Intonational Variation
Tools for Speech Analysis
Presentation transcript:

Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew

August 19, 2005Segmental encoding of prosodic categories 2 Contents An overview Allophonic variations Segmental positions Word-initial vs. word-internal positions in K-ToBI framework Allophonic variations: an extended view Production studies on Korean and other languages Need for a perception study, but how? A diphone-based speech synthesis system for Korean Conventional diphones vs. prosodically-sensitive ones A listening experiment Design and synthesis of test stimuli Results & conclusion

August 19, 2005Segmental encoding of prosodic categories 3 Allophonic variations Defined mostly in terms of neighboring segments. e.g. Allophones of /t/ in English /t/ [t] [t h ][ ʔ ][ ɾ ] “stop” “top” “kitten” “little” An overview

August 19, 2005Segmental encoding of prosodic categories 4 Segmental positions Determined in most cases within a word by its 1. neighboring segments and 2. word boundaries, i.e. word-initial/final An overview

August 19, 2005Segmental encoding of prosodic categories 5 Word-initial vs. word-internal positions in K (orean)- ToBI (Prosody labeling conventions) IP: Intonational PhraseH: high tone AP: Accentual PhraseL: low tone W: Prosodic Word (PW)T: tone (could be H or L) σ: syllable%: boundary tone (e.g. H%, L%, HL%, etc.) An overview Tone & Break Indices

August 19, 2005Segmental encoding of prosodic categories 6 Word-initial positions in K-ToBI An overview

August 19, 2005Segmental encoding of prosodic categories 7 Word-initial positions in K-ToBI An overview word-initial word-final

August 19, 2005Segmental encoding of prosodic categories 8 Word-initial positions in K-ToBI An overview PW-initial AP-initial IP-initial  PW-initial AP-initial  PW-initial  PW-medial  Three types of word-initial positions in K-ToBI !

August 19, 2005Segmental encoding of prosodic categories 9 Allophonic variations: an extended view Defined mostly in terms of neighboring segments. Need to be examined with respect to its prosodic constituency in K-ToBI. An overview

August 19, 2005Segmental encoding of prosodic categories 10 Productions studies on Korean and other languages Korean Jun (’93,’98): lenis stop voicing, obstruent nasalization, VOT of /p h / Cho & Keating (’01): segmental properties of /t, t h, t*, n/ Kim (’01): segmental properties of /s h, s*/ Yoon (’03): subsegmental durations of /s h, s*/ Other languages Smith (’97): American /z/ Pierrehumbert & Talkin (’92), Pierrehumbert (’95): English /h/ and / ʔ / Fougeron (’01): French segments /t, k, s, l, n, i, a/ Keating et al. (’98): /t, n/ of Korean, English, French & Taiwanese An overview

August 19, 2005Segmental encoding of prosodic categories 11 Productions studies on Korean and other languages – summary of results Korean AP is the domain of lenis stop voicing, post-obstruent tensing (Jun). IP is the domain of obstruent nasalization (Jun). VOT of /p h /: AP-initial > PW-initial > PW-medial (Jun). Consonants initial to higher prosodic domains are ‘stronger’ (Cho, Keating, Kim). Non-uniform variations in durations of subsegmental units (Yoon). Other languages American English /z/ is devoiced differently in different positions (Smith). English /h/ and / ʔ / produced differently in different word-/phrase-level prosody. (P & T) Articulation of initial segments varied depending on the prosodic level of the constituent, i.e. initial to an IP, AP, W or syllable. (Fougeron) There is phrasal/prosodic conditioning of articulation across the four languages. (Keating et al.) An overview

August 19, 2005Segmental encoding of prosodic categories 12 Need for a perception study, but how? As the production studies show, Korean speakers seem to encode prosodic categories, i.e. IP, AP, PW, etc., in domain-initial segments. Then what about listeners? Do they decode the encodings? Are the encodings perceptible? How do we test it? One way to test it is to use a concatenative TTS system so that one can synthesize sentences by manipulating phone-sized units, i.e. diphones. An overview

August 19, 2005Segmental encoding of prosodic categories 13 Need for a perception study, but how? Key idea: Synthesize a set of two sentences, differing only in terms of their domain-initial segment compositions. An overview IP-initial  AP-initial  PW-initial  PW-medial 

August 19, 2005Segmental encoding of prosodic categories 14 Need for a perception study, but how? Test stimuli: 1 st set: good AP: composed of prosodically appropriate synthetic units bad AP: composed of prosodically inappropriate units (Replace  with  ) 2 nd set: good PW: composed of prosodically appropriate synthetic units bad PW: composed of prosodically inappropriate units (Replace  with  ) An overview IP-initial  AP-initial  PW-initial  PW-medial 

August 19, 2005Segmental encoding of prosodic categories 15 Diphones Text-to-speech (TTS) synthesis systems Diphones, prosodically sensitive Festival Speech Synthesis System (University of Edinburgh) A diphone-based speech synthesis system

August 19, 2005Segmental encoding of prosodic categories 16 Text-to-speech (TTS) synthesis systems A system that automatically generates speech given a particular natural language text; the speech produced should be both comprehensible and natural sounding. Two main components; NLP module (natural language processing) DSP module (digital signal processing) NLP module: an elaborate text analysis system input text  sequences of phones + prosodic organization. DSP module: symbolic input from NLP  natural sounding speech A diphone-based speech synthesis system

August 19, 2005Segmental encoding of prosodic categories 17 Diphones Phone-sized synthesis units. Parametric representations of short chunks (usually extending from the middle of one phone to the middle of the immediately following one) of audio signal extracted from a cache of recorded sentences that can be re-combined by a TTS system to produce a novel synthesized word or sentence. Avoid the need for modeling phone-to-phone transitions of natural speech signal. For example, a diphone i-u contains the second half of [i] and the first half of [u]. A p-a diphone contains the second half of [p] and the first half of [a]. Prosodically sensitive diphones: Each diphone is stored as four different versions, i.e. three versions initial to an IP, AP or PW, and one version medial to a PW. (NB: A conventional diphone is stored as one version) A diphone-based speech synthesis system

August 19, 2005Segmental encoding of prosodic categories 18 Diphones A diphone-based speech synthesis system IP-initial <p-a  AP-initial [p-a  PW-initial {p-a  PW-medial p-a  6,503 prosodic diphones needed to synthesize any Korean utterance. 예 ) … #-< ㅂ, < ㅂ - ㅏ, ㅏ - ㄷ, ㄷ - ㅏ, ㅏ - ㄹ, ㄹ - ㅗ ], ㅗ ]-[ ㅂ, [ ㅂ - ㅏ, …

August 19, 2005Segmental encoding of prosodic categories 19 Festival Speech Synthesis System (University of Edinburgh, A free software multi-lingual speech synthesis workbench. An open architecture for research in speech synthesis. Primarily developed under Unix/Linux/FreeBSD/Solaris and ported to Windows. Developed for conventional diphone-based systems, but can be modified to accommodate our prosodically sensitive diphones. Consult Yoon (’05) for how we created a prototype system. A diphone-based speech synthesis system

August 19, 2005Segmental encoding of prosodic categories 20 Design & synthesis of test stimuli 96 stimuli (phrases) synthesized from the Festival system (Durations and F0 contours copied from natural utterances). All were composed of either two AP’s or two PW’s. All contained one target site, where an AP/PW-initial segment was replaced with a PW-medial segment. 24 good AP: phrases with intact diphones. 24 bad AP : phrases whose target site segment (AP-initial segment) was replaced with a PW-medial segment 24 good PW: phrases with intact diphones 24 bad PW : phrases whose target site segment (PW-initial segment) was replaced with a PW-medial segment A listening experiment

August 19, 2005Segmental encoding of prosodic categories 21 Design & synthesis of test stimuli Synthesis of a sample stimulus (Praat script)Praat script A listening experiment natural utterance diphone sequences from Festival fundamental frequency (F0) contour and segmental durations copied from natural utterance intensity contour copied from natural utterance Prototype system lacks duration & F0 generation module  Get help from natural utterances.

August 19, 2005Segmental encoding of prosodic categories 22 Design & synthesis of test stimuli Sample stimuli target site segment: /p/ A listening experiment

August 19, 2005Segmental encoding of prosodic categories 23 Design & synthesis of test stimuli More sample stimuli A listening experiment target segmentgood APbad APgood PWbad PW /p/ /t/ /k/ /p h / /t h / /t*/ /t ʃ / /t ʃ h / /s h /

August 19, 2005Segmental encoding of prosodic categories 24 Results & conclusion 80 listeners (37 women and 43 men): native speakers of Korean, average age of 30.6, grew up in Korea until at least 18 years old. Two types of tests in three tasks Intelligibility: dictation task  wrote down what they heard in hangul Naturalness: rating & preference task  rate one version wrt/ the other and choose one over the other Statistical analyses showed that listeners performed better in the dictation task with “good” versions of the stimuli. They also liked/rated better the “good” versions. Segmental encoding of prosodic domains/categories is perceptible to Korean listeners. A listening experiment