On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06
9/18/20062 Talk Outline Introduction to Pitch Accent Previous Work Contribution and Approach Corpus Results and Discussion Conclusion Future Work
9/18/20063 Introduction Pitch Accent is the way a word is made to “stand out” from its surrounding utterance. –As opposed to lexical stress which refers to the most prominent syllable within a word. Accurate detection of pitch accent is particularly important to many NLU tasks. –Identification of salient or “important” words. –Indication of Information Status. –Disambiguation of Syntax/Semantics. Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent deaccentedaccented
9/18/20064 Previous Work (Sluijter and van Heuven 96,97): Accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz. (Heldner, et al. 99,01) and (Fant, et al. 00) found that high- frequency emphasis or spectral tilt strongly correlates with accent in Swedish. A lot of research attention has been given to the automatic identification of prominent or accented words. –(Tamburini 03,05) used the energy component of the 500Hz- 2000Hz band. –(Tepperman 05) used the RMS energy from the 60Hz-400Hz band –And many more...
9/18/20065 Contribution and Approach There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information. We set up a battery of analysis-by-classification experiments varying: –The frequency band: lower bound frequency ranged from 0 to 19 bark bandwidth ranged from 1 to 20 bark –upper bound was 20 bark by the 8KHz Nyquist rate Also, analyzed the first and/or second formants. –The region of analysis: Full word, only vowels, longest syllable, longest vowel –Speaker: Each of 4 speakers separately, and all together. We performed the experiments using J48 -- a java implementation of C4.5.
9/18/20066 Contribution and Approach Local Features: –minimum, maximum, mean, standard deviation and RMS of energy –z score (x – mean / std.dev) of max energy within the word Context-based Features: –Using 6 windows: –The max and mean energy were normalized by z score (x – mean / std.dev) and the energy range within the window (x / (max-min)) word i word i+1 word i+2 word i-1 word i-2
9/18/20067 Corpus Boston Directions Corpus (BDC) [Hirschberg&Nakatani96] –Speech elicited from a direction-giving task. –Used only the read portion. –50 minutes –Fully ToBI labeled –10825 words Manually segmented –4 Speakers: 3 male, 1 female
9/18/20068 Variation across subbands Energy from different frequency regions predict pitch accent differently –Across experiment configurations mean relative improvement of best region over worst: 14.8%
9/18/20069 The most predictive subband The single most predictive subband for all speakers was 3-18bark over full words –Classification Accuracy: 76% ( P=71.6,R=73.4) 57.6% majority class baseline (no accent) –However, performs significantly worse than the best when analyzing the speech of one speaker in particular. Speaker h2, not the female speaker
9/18/ The most robust subband The subband from 2-20bark performs as well as the most discriminative subband in all but one configuration [h1-longest vowel] –Accuracy: 75.5% (P=70.5, R=72.5) –Due to its robustness we consider this band the “best” The formant-based energy features perform worse than fixed bands –6.4% mean accuracy reduction from 2-20bark –Attributable to: Errors in the formant tracking algorithm The presence of discriminative information in higher formants
9/18/ Contextual windows Most predictive features were z-score normalized maximum energy relative to three contextual windows 1 previous and 1 following word 2 previous and 1 following word 2 previous and 2 following words word i word i+1 word i+2 word i-1 word i-2
9/18/ Combining predictions There is a relatively small intersection of correct predictions even among similar subbands of words were correctly classified by at least one classifier. Using a majority voting scheme: –Accuracy: 81.9% (p=76.7, r=82.5)
9/18/ Region of analysis How do the regioning strategies perform? Full Word > Only Vowels > Longest Syllable ~ Longest Vowel Why does analysis of the full word outperform other regioning strategies? –Syllable/Vowel segmentation algorithms are imperfect –Pitch accents are not neatly placed –Duration is a crude measure of lexical stress
9/18/ Conclusion Using an analysis-by-classification approach we showed: –Energy from different frequency bands correlate with pitch accent differently. –The “best” (high accuracy, most robust) frequency region to be 2-20bark (>2bark?) –A voting classifier based exclusively on energy can predict accent reliably.
9/18/ Future Work Can we automatically identify which bands will predict accent best for a given word? We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features. We plan on repeating these experiments on spontaneous speech data.
Thank you {amaxwell,