Download presentation
Presentation is loading. Please wait.
1
Automatic Prosodic Event Detection
Julia Hirschberg GALE PI Meeting March 23, 2007
2
Introduction Acoustic/Prosodic features improve speech distillation performance (Maskey & Hirschberg, 2005) Can categorical features make a contribution? Pitch accent Intonational Phrase boundaries
3
Material 20 minutes of manually annotated material from TDT4.
_1830_1900_ABC_WNT 25 hypothesized speaker IDs 3326 words 1658 (49.8%) accented 556 (16.7%) preceed IP boundaries
4
Pitch Accent: Energy Based Voting Classifier
Extract energy features from a set of 210 frequency regions Frequency ranges from 0-19 bark Bandwidth ranges from 1 to 20 bark Construct 210 pitch accent decision tree classifiers using only energy features Voting yields 83.7% accuracy
5
Pitch Accent: “Corrected” Voting Classifier
Classify each prediction as ‘correct’ or ‘incorrect’ using pitch and duration features. Requires 210 “correcting” decision tree classifiers If an energy prediction is hypothesized to be ‘incorrect’, invert it. “Corrected” Voting yields 88.5% accuracy
6
Intonational Phrase Boundary Detection
Decision Tree Classifier with pitch, intensity and duration features 89.1% accuracy (68.3% Precision, 64.7% recall) Most predictive features: Long following pause length Descending change in energy over the final 3/4 of the word Lower minimum energy relative to the 2 preceding words Decreased standard deviation of pitch
7
Conclusion and Future Work
We can detect pitch accent with high accuracy, but can this information be used to improve distillation? While we do not detect them with very high accuracy, can even noisy IP boundaries be used to segment BN for extractive summarization? Are hypothesized IP boundaries useful candidate story segmentation points?
8
Thank You
9
Energy Features Min, max, mean, stdev, rms of energy
For IP boundary detection only: Min, max, mean, stdev, rms of energy The above extracted over the final 3/4 of the word The above extracted over the final 200ms Range and Z-score normalized max and mean raw energy by contextual window All combinations of 2,1,0 previous words, and 2,1,0 following
10
Pitch Features Min, max, mean, stdev, rms of raw and speaker normalized F0 and F0 For IP boundary detection only: The above extracted over the final 3/4 of the word The above extracted over the final 200ms Pitch reset following the current word Difference between the mean of the last 10 pitch points (10ms frame) of the current word and the first 10 of the following Range and Z-score normalized max and mean raw and speaker normalized F0 by contextual window All combinations of 2,1,0 previous words, and 2,1,0 following
11
Duration Features Length of word (in seconds) Pause preceding the word
Pause following the word
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.