Automatic Prosodic Event Detection

Automatic Prosodic Event Detection
Julia Hirschberg GALE PI Meeting March 23, 2007

Introduction Acoustic/Prosodic features improve speech distillation performance (Maskey & Hirschberg, 2005) Can categorical features make a contribution? Pitch accent Intonational Phrase boundaries

Material 20 minutes of manually annotated material from TDT4.
_1830_1900_ABC_WNT 25 hypothesized speaker IDs 3326 words 1658 (49.8%) accented 556 (16.7%) preceed IP boundaries

Pitch Accent: Energy Based Voting Classifier
Extract energy features from a set of 210 frequency regions Frequency ranges from 0-19 bark Bandwidth ranges from 1 to 20 bark Construct 210 pitch accent decision tree classifiers using only energy features Voting yields 83.7% accuracy

Pitch Accent: “Corrected” Voting Classifier
Classify each prediction as ‘correct’ or ‘incorrect’ using pitch and duration features. Requires 210 “correcting” decision tree classifiers If an energy prediction is hypothesized to be ‘incorrect’, invert it. “Corrected” Voting yields 88.5% accuracy

Intonational Phrase Boundary Detection
Decision Tree Classifier with pitch, intensity and duration features 89.1% accuracy (68.3% Precision, 64.7% recall) Most predictive features: Long following pause length Descending change in energy over the final 3/4 of the word Lower minimum energy relative to the 2 preceding words Decreased standard deviation of pitch

Conclusion and Future Work
We can detect pitch accent with high accuracy, but can this information be used to improve distillation? While we do not detect them with very high accuracy, can even noisy IP boundaries be used to segment BN for extractive summarization? Are hypothesized IP boundaries useful candidate story segmentation points?

Thank You

Energy Features Min, max, mean, stdev, rms of energy
For IP boundary detection only: Min, max, mean, stdev, rms of energy The above extracted over the final 3/4 of the word The above extracted over the final 200ms Range and Z-score normalized max and mean raw energy by contextual window All combinations of 2,1,0 previous words, and 2,1,0 following

Pitch Features Min, max, mean, stdev, rms of raw and speaker normalized F0 and F0 For IP boundary detection only: The above extracted over the final 3/4 of the word The above extracted over the final 200ms Pitch reset following the current word Difference between the mean of the last 10 pitch points (10ms frame) of the current word and the first 10 of the following Range and Z-score normalized max and mean raw and speaker normalized F0 by contextual window All combinations of 2,1,0 previous words, and 2,1,0 following

Duration Features Length of word (in seconds) Pause preceding the word
Pause following the word

Automatic Prosodic Event Detection

Similar presentations

Presentation on theme: "Automatic Prosodic Event Detection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Prosodic Event Detection

Similar presentations

Presentation on theme: "Automatic Prosodic Event Detection"— Presentation transcript:

Similar presentations

About project

Feedback