On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4 1 International Computer Science Institute, Berkeley, USA 2 University of West Bohemia in Pilsen, Czech Republic 3 SRI International, USA 4 University of Texas at Dallas, USA
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...2 Why automatic DA segmentation? Standard STT systems output a raw stream of words leaving out structural information such as sentence and Dialog Act (DA) boundaries Problems for human readability Problems when applying downstream natural language processing techniques requiring formatted input
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...3 Goal and Task Definition Goal: Dialog Act (DA) segmentation of meetings Task definition: 2-way classification in which each inter-word boundary is labeled as within-DA boundary or boundary between DAs e.g. “no jobs are still running ok” 3 DAs: “No.” + “Jobs are still running.” + “OK.” Evaluation metric – “Boundary error rate”
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...4 Approach: Explore Speaker-Specific Prosody Past work has used both lexical and prosodic features, but collapsing over speakers Speakers appear to differ, however, in both feature types, especially in spontaneous speech Meeting applications: speaker is often known or at least recorded on one channel; often participates in ongoing meetings good opportunity for modeling Speaker adaptation used successfully in cepstral domain for ASR This study takes a first look specifically at prosodic features for the DA boundary task
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...5 Three Questions 1) Do individual speakers benefit from modeling more than simply pause information? 2) Do individual speakers differ enough from the overall speaker model to benefit from a prosodic model trained on only their speech? 3) How do speakers differ in terms of prosody usage in marking DA boundaries?
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...6 Data and Experimental Setup ICSI meeting corpus – multichannel conversational speech annotated for DAs Baseline speaker-independent model trained on 567k words For speaker-specific experiments – 20 most frequent speakers in terms of total words (7.5k – 165k words) 17 males, 3 females 12 natives, 8 nonnatives
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...7 Data and Experimental Setup II. Each speaker’s data: ~70% training, ~30% testing Jackknife instead of separate development set using 1 st half of test data to tune weights for the 2 nd half and vice versa Tested on forced alignments rather than on ASR hypotheses
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...8 Prosodic Features and Classifiers Features: 32 for each interword boundary Pause – (after current, previous and follow. word) Duration – (phone-normalized dur of vowels, final rhymes and words; no raw durations) Pitch – (F0 min, max, mean, slopes, and diffs and ratios across word boundaries; raw values + PWL stylized contour) Energy – (max, min, mean frame-level RMS values, both raw and normalized) Classifiers: CART-style decision trees with ensemble bagging
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...9 Pause-only vs. Richer Set of Prosodic Features Compare speaker-independent (SI) model with pause only (SI-Pau) with SI model with all 32 prosodic features (SI-All) SI-All significantly better for 19 of 20 speakers Relative error rate reduction by prosody not correlated with the amount of training data
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...10 Pause-only vs. Rich Prosody: Relative Error Reduction
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...11 Speaker-Independent (SI) vs. Speaker-Dependent (SD) Models We compare SI, SD, and interpolated SI+SD models SI+SD defined as: Significantly improved result would suggest prosodic marking of boundaries differs from baseline SI model
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...12 Effects of Adding SD Information SD models much smaller than SI model; as expected SI better than SD alone for most subjects (though for some SD better!) Many subjects, no gain by adding SD information (no SD info or not enough data?) For 7 of 20 speakers, however, SD or SI+SD is better than SI, 5 improvements statistically significant Improvement by SD not correlated with amount of data, error rate, chance error, proficiency in English, or gender SD often helps in “unusual” prosody situations – hesitation, lip smack, long pause, emotions SD helps more in preventing false alarms than misses
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...13 Audio Examples: SD Helps Example of preventing a FALSE ALARM: “and another thing that we did also is that |FA| we have all this training data … ” SD does not false alarm after 2nd “that” because it ‘knows’ this nonnative speaker has limited F0 range and often falls in pitch before hesitations Example of preventing a MISS: “this is one |.| and I think that's just fine |.|” SD finds DA boundary after “one”, despite the short pause, probably based on the speaker’s prototypical pitch reset
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...14 Feature Usage, Natives vs. Nonnatives Feature usage – how many times a feature is queried in the tree weighted by the number of samples it affects 5 groups of features: Pause at boundary Near pause Duration Pitch Energy Compare the SD feature usage of improved speakers with the SI distribution
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...15 Feature Usage: Natives vs. Nonnatives
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...16 Summary Prosodic features beyond pause provides improvement for 19 of 20 frequent speakers For ~30% speakers studied, simply interpolating large SI prosodic model with small SD model yielded improvement Amount of data error rate, chance error, proficiency in English, or gender not correlated with improvement by SD Some interesting observations – nonnative speakers differ from native in feature usage patterns, SD information helps in “unusual” prosody situations and preventing false alarms
09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...17 Conclusions and Future Work Results are interesting and suggestive, but as of yet inconclusive SD prosody modeling significantly benefits some speakers, but predicting who they will be is still an open question Many issues still to address, especially joint modeling with lexical features, and better integration approach Approach interesting to explore for other domains like broadcast news, where segmentation important and some speakers occur repeatedly
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4 1 International Computer Science Institute, Berkeley, USA 2 University of West Bohemia in Pilsen, Czech Republic 3 SRI International, USA 4 University of Texas at Dallas, USA