On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
1 The Effect of Pitch Span on the Alignment of Intonational Peaks and Plateaux Rachael-Anne Knight University of Cambridge.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.
Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
1 Back Channel Communication Antoine Raux Dialogs on Dialogs 02/25/2005.
Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01 1 Liz Shriberg* Andreas Stolcke* Jeremy Ang + * SRI International International.
Today Evaluation Measures Accuracy Significance Testing
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {
Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
The Effect of Pitch Span on Intonational Plateaux Rachael-Anne Knight University of Cambridge Speech Prosody 2002.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.
Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science.
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Elizabeth Shriberg Andreas Stolcke Speech Technology and Research.
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
An i-Vector PLDA based Gender Identification Approach for Severely Distorted and Multilingual DARPA RATS Data Shivesh Ranjan, Gang Liu and John H. L. Hansen.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Investigating Pitch Accent Recognition in Non-native Speech
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Detecting Prosody Improvement in Oral Rereading
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Liz Shriberg* Andreas Stolcke* Jeremy Ang+ * SRI International
Chapter 12 Power Analysis.
Recognizing Structure: Dialogue Acts and Segmentation
Speaker Identification:
Low Level Cues to Emotion
Automatic Prosodic Event Detection
Presentation transcript:

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4 1 International Computer Science Institute, Berkeley, USA 2 University of West Bohemia in Pilsen, Czech Republic 3 SRI International, USA 4 University of Texas at Dallas, USA

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...2 Why automatic DA segmentation? Standard STT systems output a raw stream of words leaving out structural information such as sentence and Dialog Act (DA) boundaries Problems for human readability Problems when applying downstream natural language processing techniques requiring formatted input

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...3 Goal and Task Definition Goal: Dialog Act (DA) segmentation of meetings Task definition: 2-way classification in which each inter-word boundary is labeled as within-DA boundary or boundary between DAs e.g. “no jobs are still running ok” 3 DAs: “No.” + “Jobs are still running.” + “OK.” Evaluation metric – “Boundary error rate”

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...4 Approach: Explore Speaker-Specific Prosody Past work has used both lexical and prosodic features, but collapsing over speakers Speakers appear to differ, however, in both feature types, especially in spontaneous speech Meeting applications: speaker is often known or at least recorded on one channel; often participates in ongoing meetings  good opportunity for modeling Speaker adaptation used successfully in cepstral domain for ASR This study takes a first look specifically at prosodic features for the DA boundary task

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...5 Three Questions 1) Do individual speakers benefit from modeling more than simply pause information? 2) Do individual speakers differ enough from the overall speaker model to benefit from a prosodic model trained on only their speech? 3) How do speakers differ in terms of prosody usage in marking DA boundaries?

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...6 Data and Experimental Setup ICSI meeting corpus – multichannel conversational speech annotated for DAs Baseline speaker-independent model trained on 567k words For speaker-specific experiments – 20 most frequent speakers in terms of total words (7.5k – 165k words) 17 males, 3 females 12 natives, 8 nonnatives

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...7 Data and Experimental Setup II. Each speaker’s data: ~70% training, ~30% testing Jackknife instead of separate development set  using 1 st half of test data to tune weights for the 2 nd half and vice versa Tested on forced alignments rather than on ASR hypotheses

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...8 Prosodic Features and Classifiers Features: 32 for each interword boundary Pause – (after current, previous and follow. word) Duration – (phone-normalized dur of vowels, final rhymes and words; no raw durations) Pitch – (F0 min, max, mean, slopes, and diffs and ratios across word boundaries; raw values + PWL stylized contour) Energy – (max, min, mean frame-level RMS values, both raw and normalized) Classifiers: CART-style decision trees with ensemble bagging

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...9 Pause-only vs. Richer Set of Prosodic Features Compare speaker-independent (SI) model with pause only (SI-Pau) with SI model with all 32 prosodic features (SI-All) SI-All significantly better for 19 of 20 speakers Relative error rate reduction by prosody not correlated with the amount of training data

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...10 Pause-only vs. Rich Prosody: Relative Error Reduction

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...11 Speaker-Independent (SI) vs. Speaker-Dependent (SD) Models We compare SI, SD, and interpolated SI+SD models SI+SD defined as: Significantly improved result would suggest prosodic marking of boundaries differs from baseline SI model

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...12 Effects of Adding SD Information SD models much smaller than SI model; as expected SI better than SD alone for most subjects (though for some SD better!) Many subjects, no gain by adding SD information (no SD info or not enough data?) For 7 of 20 speakers, however, SD or SI+SD is better than SI, 5 improvements statistically significant Improvement by SD not correlated with amount of data, error rate, chance error, proficiency in English, or gender SD often helps in “unusual” prosody situations – hesitation, lip smack, long pause, emotions SD helps more in preventing false alarms than misses

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...13 Audio Examples: SD Helps Example of preventing a FALSE ALARM: “and another thing that we did also is that |FA| we have all this training data … ” SD does not false alarm after 2nd “that” because it ‘knows’ this nonnative speaker has limited F0 range and often falls in pitch before hesitations Example of preventing a MISS: “this is one |.| and I think that's just fine |.|” SD finds DA boundary after “one”, despite the short pause, probably based on the speaker’s prototypical pitch reset

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...14 Feature Usage, Natives vs. Nonnatives Feature usage – how many times a feature is queried in the tree weighted by the number of samples it affects 5 groups of features: Pause at boundary Near pause Duration Pitch Energy Compare the SD feature usage of improved speakers with the SI distribution

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...15 Feature Usage: Natives vs. Nonnatives

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...16 Summary Prosodic features beyond pause provides improvement for 19 of 20 frequent speakers For ~30% speakers studied, simply interpolating large SI prosodic model with small SD model yielded improvement Amount of data error rate, chance error, proficiency in English, or gender not correlated with improvement by SD Some interesting observations – nonnative speakers differ from native in feature usage patterns, SD information helps in “unusual” prosody situations and preventing false alarms

09/20/2006Kolář et al.: On Speaker-Specific Prosodic Models for...17 Conclusions and Future Work Results are interesting and suggestive, but as of yet inconclusive SD prosody modeling significantly benefits some speakers, but predicting who they will be is still an open question Many issues still to address, especially joint modeling with lexical features, and better integration approach Approach interesting to explore for other domains like broadcast news, where segmentation important and some speakers occur repeatedly

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4 1 International Computer Science Institute, Berkeley, USA 2 University of West Bohemia in Pilsen, Czech Republic 3 SRI International, USA 4 University of Texas at Dallas, USA