Automatic Prosodic Event Detection Julia Hirschberg GALE PI Meeting March 23, 2007
Introduction Acoustic/Prosodic features improve speech distillation performance (Maskey & Hirschberg, 2005) Can categorical features make a contribution? Pitch accent Intonational Phrase boundaries
Material 20 minutes of manually annotated material from TDT4. 20010131_1830_1900_ABC_WNT 25 hypothesized speaker IDs 3326 words 1658 (49.8%) accented 556 (16.7%) preceed IP boundaries
Pitch Accent: Energy Based Voting Classifier Extract energy features from a set of 210 frequency regions Frequency ranges from 0-19 bark Bandwidth ranges from 1 to 20 bark Construct 210 pitch accent decision tree classifiers using only energy features Voting yields 83.7% accuracy
Pitch Accent: “Corrected” Voting Classifier Classify each prediction as ‘correct’ or ‘incorrect’ using pitch and duration features. Requires 210 “correcting” decision tree classifiers If an energy prediction is hypothesized to be ‘incorrect’, invert it. “Corrected” Voting yields 88.5% accuracy
Intonational Phrase Boundary Detection Decision Tree Classifier with pitch, intensity and duration features 89.1% accuracy (68.3% Precision, 64.7% recall) Most predictive features: Long following pause length Descending change in energy over the final 3/4 of the word Lower minimum energy relative to the 2 preceding words Decreased standard deviation of pitch
Conclusion and Future Work We can detect pitch accent with high accuracy, but can this information be used to improve distillation? While we do not detect them with very high accuracy, can even noisy IP boundaries be used to segment BN for extractive summarization? Are hypothesized IP boundaries useful candidate story segmentation points?
Thank You
Energy Features Min, max, mean, stdev, rms of energy For IP boundary detection only: Min, max, mean, stdev, rms of energy The above extracted over the final 3/4 of the word The above extracted over the final 200ms Range and Z-score normalized max and mean raw energy by contextual window All combinations of 2,1,0 previous words, and 2,1,0 following
Pitch Features Min, max, mean, stdev, rms of raw and speaker normalized F0 and F0 For IP boundary detection only: The above extracted over the final 3/4 of the word The above extracted over the final 200ms Pitch reset following the current word Difference between the mean of the last 10 pitch points (10ms frame) of the current word and the first 10 of the following Range and Z-score normalized max and mean raw and speaker normalized F0 by contextual window All combinations of 2,1,0 previous words, and 2,1,0 following
Duration Features Length of word (in seconds) Pause preceding the word Pause following the word