Download presentation
Presentation is loading. Please wait.
Published byNatalie Gibbs Modified over 9 years ago
1
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan Niu Center for Spoken Language Understanding OGI School of Science & Technology at OHSU
2
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 2 OVERVIEW 1.IMPORTANCE OF SPECTRAL BALANCE 2.MEASUREMENT OF SPECTRAL BALANCE 3.ANALYSIS METHODS 4.RESULTS 5.SYNTHESIS 6.CONCLUSIONS
3
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 3 1. IMPORTANCE OF SPECTRAL BALANCE Linguistic Control Factors –Stress-like factors –Positional factors –Phonemic factors Acoustic Correlates –Traditionally TTS-controlled: Pitch, timing, amplitude –Demonstrated in natural speech, but usually not TTS-controlled: Spectral tilt, balance Formant dynamics …
4
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 4 2. MEASUREMENT OF SPECTRAL BALANCE Data: –472 greedily selected sentences Genre: newspaper Greedy features: linguistic control factors –One female speaker –Manual segmentation –Accent: independent rating by 3 judges 0-3 score
5
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 5 2. MEASUREMENT OF SPECTRAL BALANCE Energy in 5 formant-range frequency bands –B 0 :100-300 Hz [~F0] –B 1 :300-800 Hz [~F1] –B 2 :800-2500 Hz [~F2] –B 3 :2500-3500 Hz [~F3] –B 4 :3500- max Hz [~fricative noise] In other words, multidimensional measure Filter bank Square Average [1 ms rect.] 20 log 10 (B i ) Subtract estimated per-utterance means
6
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 6 2. MEASUREMENT OF SPECTRAL BALANCE Details: –Confounding with F 0 Measure pitch-corrected and raw –For certain wave shapes, pitch directly related to fixed-frame energy –Why do both: wave shapes may change in unknown ways F 0 not confined to B 0 [female speech] –Vowel formants not quite confined to bands [e.g., F 1 for /EE/ and F 3 for /ER/]
7
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 7 2. MEASUREMENT OF SPECTRAL BALANCE Why not more or different bands? –Multiple interacting Linguistic Control Factors Need measurements that minimize interactions –5 bands Different vowels “behave similarly” Can model vowels as a class Why not simply spectral tilt? –5 bands more information than single measure –Supply more information for synthesis
8
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 8 3. ANALYSIS METHODS Measures likely to behave like segmental duration: –Multiple interacting, confounded factors: Interaction: Magnitude of effects on one factor may depend on other factors Confounding: Unequal frequencies of control factor combinations –“Directional Invariance” Direction of effects on one factor independent of other factors
9
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 9 3. ANALYSIS METHODS Need method that –can handle multiple interacting, confounded factors and –takes advantage of Directional Invariance: Used: Sums of Products Model:
10
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 10 3. ANALYSIS METHODS Special cases: –Multiplicative model: K = {1}, I 1 = {0,…,n} –Additive model: K = {0,…,n}, I i = {i}
11
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 11 3. ANALYSIS METHODS Used additive model Note: Parameter estimates are: –Estimates of marginal means … –… in balanced design:
12
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 12 3. ANALYSIS METHODS Pitch correction: Confounding with F 0 : Show both and:
13
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 13 4. RESULTS: (A) POSITIONAL EFFECTS 5 Bands, not pitch-corrected Solid: right position, dashed: left position. Y-axis: corrected mean
14
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 14 4. RESULTS: (A) POSITIONAL EFFECTS 5 Bands, pitch-corrected
15
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 15 4. RESULTS: (A) POSITIONAL EFFECTS 4 Bands, not pitch-corrected
16
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 16 4. RESULTS: (A) POSITIONAL EFFECTS 4 Bands, pitch-corrected
17
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 17 4. RESULTS: (B) STRESS/ACCENT EFFECTS 5 Bands, not pitch-corrected Solid: stressed syllable, dashed: unstressed. Y-axis: corrected mean
18
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 18 4. RESULTS: (B) STRESS/ACCENT EFFECTS 5 Bands, pitch-corrected
19
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 19 4. RESULTS: (B) STRESS/ACCENT EFFECTS 4 Bands, not pitch-corrected
20
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 20 4. RESULTS: (B) STRESS/ACCENT EFFECTS 4 Bands, pitch-corrected
21
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 21 4. RESULTS: (C) TILT EFFECTS
22
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 22 5. SYNTHESIS Use ABS/OLA sinusoidal model: s[n] = sum of overlapped short-time signal frames s k [n] s k [n] = sum of quasi-harmonic sinusoidal components: s k [n] l A k,l cos( k,l n + k,l Each frame of unit is represented by a set of quasi-harmonic sinusoidal parameters; Given the desired F0 contour, pitch shift is applied to the sinusoidal parameter component of the unit to obtain the target parameter A k,l ;
23
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 23 5. SYNTHESIS Considering the differences of prosody factors between original and target unit, band differences: Transform the band difference into weights applying to the sinusoidal parameters:,when the j’th harmonic is located in the i'th band; Spectral smoothing across unit boundaries.
24
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 24 5. SYNTHESIS 5 Bands modification example [i:]
25
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 25 CONCLUSIONS Described simple methods for predicting and synthesizing spectral balance But: Spectral balance is only one “non-standard acoustic correlate” Others that remain to be addressed: –Spectral dynamics –Phase
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.