Download presentation
Presentation is loading. Please wait.
Published byJob Stephens Modified over 9 years ago
2
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew
3
August 19, 2005Segmental encoding of prosodic categories 2 Contents An overview Allophonic variations Segmental positions Word-initial vs. word-internal positions in K-ToBI framework Allophonic variations: an extended view Production studies on Korean and other languages Need for a perception study, but how? A diphone-based speech synthesis system for Korean Conventional diphones vs. prosodically-sensitive ones A listening experiment Design and synthesis of test stimuli Results & conclusion
4
August 19, 2005Segmental encoding of prosodic categories 3 Allophonic variations Defined mostly in terms of neighboring segments. e.g. Allophones of /t/ in English /t/ [t] [t h ][ ʔ ][ ɾ ] “stop” “top” “kitten” “little” An overview
5
August 19, 2005Segmental encoding of prosodic categories 4 Segmental positions Determined in most cases within a word by its 1. neighboring segments and 2. word boundaries, i.e. word-initial/final An overview
6
August 19, 2005Segmental encoding of prosodic categories 5 Word-initial vs. word-internal positions in K (orean)- ToBI (Prosody labeling conventions) IP: Intonational PhraseH: high tone AP: Accentual PhraseL: low tone W: Prosodic Word (PW)T: tone (could be H or L) σ: syllable%: boundary tone (e.g. H%, L%, HL%, etc.) An overview Tone & Break Indices
7
August 19, 2005Segmental encoding of prosodic categories 6 Word-initial positions in K-ToBI An overview
8
August 19, 2005Segmental encoding of prosodic categories 7 Word-initial positions in K-ToBI An overview word-initial word-final
9
August 19, 2005Segmental encoding of prosodic categories 8 Word-initial positions in K-ToBI An overview PW-initial AP-initial IP-initial PW-initial AP-initial PW-initial PW-medial Three types of word-initial positions in K-ToBI !
10
August 19, 2005Segmental encoding of prosodic categories 9 Allophonic variations: an extended view Defined mostly in terms of neighboring segments. Need to be examined with respect to its prosodic constituency in K-ToBI. An overview
11
August 19, 2005Segmental encoding of prosodic categories 10 Productions studies on Korean and other languages Korean Jun (’93,’98): lenis stop voicing, obstruent nasalization, VOT of /p h / Cho & Keating (’01): segmental properties of /t, t h, t*, n/ Kim (’01): segmental properties of /s h, s*/ Yoon (’03): subsegmental durations of /s h, s*/ Other languages Smith (’97): American /z/ Pierrehumbert & Talkin (’92), Pierrehumbert (’95): English /h/ and / ʔ / Fougeron (’01): French segments /t, k, s, l, n, i, a/ Keating et al. (’98): /t, n/ of Korean, English, French & Taiwanese An overview
12
August 19, 2005Segmental encoding of prosodic categories 11 Productions studies on Korean and other languages – summary of results Korean AP is the domain of lenis stop voicing, post-obstruent tensing (Jun). IP is the domain of obstruent nasalization (Jun). VOT of /p h /: AP-initial > PW-initial > PW-medial (Jun). Consonants initial to higher prosodic domains are ‘stronger’ (Cho, Keating, Kim). Non-uniform variations in durations of subsegmental units (Yoon). Other languages American English /z/ is devoiced differently in different positions (Smith). English /h/ and / ʔ / produced differently in different word-/phrase-level prosody. (P & T) Articulation of initial segments varied depending on the prosodic level of the constituent, i.e. initial to an IP, AP, W or syllable. (Fougeron) There is phrasal/prosodic conditioning of articulation across the four languages. (Keating et al.) An overview
13
August 19, 2005Segmental encoding of prosodic categories 12 Need for a perception study, but how? As the production studies show, Korean speakers seem to encode prosodic categories, i.e. IP, AP, PW, etc., in domain-initial segments. Then what about listeners? Do they decode the encodings? Are the encodings perceptible? How do we test it? One way to test it is to use a concatenative TTS system so that one can synthesize sentences by manipulating phone-sized units, i.e. diphones. An overview
14
August 19, 2005Segmental encoding of prosodic categories 13 Need for a perception study, but how? Key idea: Synthesize a set of two sentences, differing only in terms of their domain-initial segment compositions. An overview IP-initial AP-initial PW-initial PW-medial
15
August 19, 2005Segmental encoding of prosodic categories 14 Need for a perception study, but how? Test stimuli: 1 st set: good AP: composed of prosodically appropriate synthetic units bad AP: composed of prosodically inappropriate units (Replace with ) 2 nd set: good PW: composed of prosodically appropriate synthetic units bad PW: composed of prosodically inappropriate units (Replace with ) An overview IP-initial AP-initial PW-initial PW-medial
16
August 19, 2005Segmental encoding of prosodic categories 15 Diphones Text-to-speech (TTS) synthesis systems Diphones, prosodically sensitive Festival Speech Synthesis System (University of Edinburgh) A diphone-based speech synthesis system
17
August 19, 2005Segmental encoding of prosodic categories 16 Text-to-speech (TTS) synthesis systems A system that automatically generates speech given a particular natural language text; the speech produced should be both comprehensible and natural sounding. Two main components; NLP module (natural language processing) DSP module (digital signal processing) NLP module: an elaborate text analysis system input text sequences of phones + prosodic organization. DSP module: symbolic input from NLP natural sounding speech A diphone-based speech synthesis system
18
August 19, 2005Segmental encoding of prosodic categories 17 Diphones Phone-sized synthesis units. Parametric representations of short chunks (usually extending from the middle of one phone to the middle of the immediately following one) of audio signal extracted from a cache of recorded sentences that can be re-combined by a TTS system to produce a novel synthesized word or sentence. Avoid the need for modeling phone-to-phone transitions of natural speech signal. For example, a diphone i-u contains the second half of [i] and the first half of [u]. A p-a diphone contains the second half of [p] and the first half of [a]. Prosodically sensitive diphones: Each diphone is stored as four different versions, i.e. three versions initial to an IP, AP or PW, and one version medial to a PW. (NB: A conventional diphone is stored as one version) A diphone-based speech synthesis system
19
August 19, 2005Segmental encoding of prosodic categories 18 Diphones A diphone-based speech synthesis system IP-initial <p-a AP-initial [p-a PW-initial {p-a PW-medial p-a 6,503 prosodic diphones needed to synthesize any Korean utterance. 예 ) … #-< ㅂ, < ㅂ - ㅏ, ㅏ - ㄷ, ㄷ - ㅏ, ㅏ - ㄹ, ㄹ - ㅗ ], ㅗ ]-[ ㅂ, [ ㅂ - ㅏ, …
20
August 19, 2005Segmental encoding of prosodic categories 19 Festival Speech Synthesis System (University of Edinburgh, http://www.festvox.org) A free software multi-lingual speech synthesis workbench. An open architecture for research in speech synthesis. Primarily developed under Unix/Linux/FreeBSD/Solaris and ported to Windows. Developed for conventional diphone-based systems, but can be modified to accommodate our prosodically sensitive diphones. Consult Yoon (’05) for how we created a prototype system. A diphone-based speech synthesis system
21
August 19, 2005Segmental encoding of prosodic categories 20 Design & synthesis of test stimuli 96 stimuli (phrases) synthesized from the Festival system (Durations and F0 contours copied from natural utterances). All were composed of either two AP’s or two PW’s. All contained one target site, where an AP/PW-initial segment was replaced with a PW-medial segment. 24 good AP: phrases with intact diphones. 24 bad AP : phrases whose target site segment (AP-initial segment) was replaced with a PW-medial segment 24 good PW: phrases with intact diphones 24 bad PW : phrases whose target site segment (PW-initial segment) was replaced with a PW-medial segment A listening experiment
22
August 19, 2005Segmental encoding of prosodic categories 21 Design & synthesis of test stimuli Synthesis of a sample stimulus (Praat script)Praat script A listening experiment natural utterance diphone sequences from Festival fundamental frequency (F0) contour and segmental durations copied from natural utterance intensity contour copied from natural utterance Prototype system lacks duration & F0 generation module Get help from natural utterances.
23
August 19, 2005Segmental encoding of prosodic categories 22 Design & synthesis of test stimuli Sample stimuli target site segment: /p/ A listening experiment
24
August 19, 2005Segmental encoding of prosodic categories 23 Design & synthesis of test stimuli More sample stimuli A listening experiment target segmentgood APbad APgood PWbad PW /p/ /t/ /k/ /p h / /t h / /t*/ /t ʃ / /t ʃ h / /s h /
25
August 19, 2005Segmental encoding of prosodic categories 24 Results & conclusion 80 listeners (37 women and 43 men): native speakers of Korean, average age of 30.6, grew up in Korea until at least 18 years old. Two types of tests in three tasks Intelligibility: dictation task wrote down what they heard in hangul Naturalness: rating & preference task rate one version wrt/ the other and choose one over the other Statistical analyses showed that listeners performed better in the dictation task with “good” versions of the stimuli. They also liked/rated better the “good” versions. Segmental encoding of prosodic domains/categories is perceptible to Korean listeners. A listening experiment
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.