Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb 26- 28, 2002 1 Elizabeth Shriberg Andreas Stolcke Speech Technology and Research.

Slides:

Advertisements

Similar presentations

1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998.

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

1 The Effect of Pitch Span on the Alignment of Intonational Peaks and Plateaux Rachael-Anne Knight University of Cambridge.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.

Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Using Emotion Recognition and Dialog Analysis to Detect Trouble in Communication in Spoken Dialog Systems Nathan Imse Kelly Peterson.

Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.

Emotional Grounding in Spoken Dialog Systems Jackson Liscombe Giuseppe Riccardi Dilek Hakkani-Tür

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.

Detecting missrecognitions Predicting with prosody.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.

Producing Emotional Speech Thanks to Gabriel Schubiner.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01 1 Liz Shriberg* Andreas Stolcke* Jeremy Ang + * SRI International International.

(Slides modified from D. Jurafsky) Emotion CS 3710 / ISSP 3565.

® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.

Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

User Study Evaluation Human-Computer Interaction.

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Turn-taking Discourse and Dialogue CS 359 November 6, 2001.

1 Determining query types by analysing intonation.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Introduction to Computational Linguistics

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.

National Taiwan University, Taiwan

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

Lexical, Prosodic, and Syntactics Cues for Dialog Acts.

Adapting Dialogue Models Discourse & Dialogue CMSC November 19, 2006.

1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

INTONATION Islam M. Abu Khater.

Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.

Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.

Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Automatic Speech Recognition

Conditional Random Fields for ASR

Spoken Dialogue Systems

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Spoken Dialogue Systems

Liz Shriberg* Andreas Stolcke* Jeremy Ang+ * SRI International

Emotional Speech Julia Hirschberg CS /16/2019.

Recognizing Structure: Dialogue Acts and Segmentation

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Speaker Identification:

Automatic Prosodic Event Detection

Presentation transcript:

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory SRI International, Menlo Park, CA International Computer Science Institute Berkeley, CA Harnessing Speech Prosody for Human-Computer Interaction

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Collaborators  Lee Stone (NASA); Beth Ann Hockey, John Dowding, Jim Hieronymous (RIACS)  Luciana Ferrer (SRI postdoc)  Harry Bratt, Kemal Sonmez (SRI)  Jeremy Ang (ICSI/UC Berkeley)  Emotion labelers: Raj Dhillon, Ashley Krupski, Kai Filion, Mercedes Carter, Kattya Baltodano

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Introduction  Prosody = melody, rhythm, “tone” of speech  Not what words are said, but how they are said  Human languages use prosody to convey:  phrasing and structure (e.g. sentence boundaries)  disfluencies (e.g. false starts, repairs, fillers)  sentence mode (statement vs question)  emotional attitudes (urgency, surprise, anger)  Currently largely unused in speech systems

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Talk Outline  Project goal and impact for NASA  Sample research tasks:  Task 1: Endpointing  Task 2: Emotion classification  General method  language model, prosodic model, combination  data and annotations  Results  Endpointing: error trade-offs & user waiting time  Emotion: error trade-offs & class definition effects  Conclusions and future directions

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Project Goal  Most dialog systems don’t use prosody in input; they view speech simply as “noisy” text.  Our goal: add prosodic information to system input. Prosody DialogSystemTextTextSpeechSpeech AS R Synthe sis

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Today: Two Sample Tasks  Task 1: Endpointing (detecting end of input)  current ASR systems rely on pause duration measure temperature at... cargo bay...  causes premature cut-off during hesitations  wastes time waiting after actual boundaries  Task 2: Emotion detection  word transcripts don’t indicate user state measure the -- STOP!! GO BACK!!  alert computer to immediately change course  alert other humans to danger, fatigue, etc.

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Other Tasks in Project  Automatic sentence punctuation: Don’t go to flight deck! Don’t! Go to flight deck! (DO go to flight deck)  Detection of utterance mode: Computer: Confirm opening of hatch number 2 Human: Number 2. /? (confirmation or question?)  Detection of disfluencies: Item three one five one two (item or 512?)

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Method: Prosodic Modeling  Pitch is extracted from acoustic signal  Speech recognizer identifies phones, words, and their durations  Pitch and duration information is combined to compute distinctive prosodic features (e.g., Was there a pitch fall/rise in last word?)  Decision trees are trained to detect desired events from features  Separate test set used to evaluate classifier performance

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Method: Language Models  Words can also predict events of interest, using N-gram language models.  Endpointing -- predict endpoint probability from last two words: P(endpoint | word -1, word -2 )  Emotion detection -- predict from all words in sentence: P(word 1, word 2, …, word n | emotion)  P > threshold  system detects event  Prosodic classifier and LM predictions can be combined for better results (multiply predictions)

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Task 1: Endpointing in ATIS  Air Travel Information System = Dialog task defined by DARPA to drive research in spoken dialog systems  Users talk to a (simulated) air travel system  Simulated endpointing “after the fact”  About 18,000 utterances, 10 words/utterance  Test set of 1974 unseen utterances  5.9% word error rate on test set

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Endpointing Algorithms  Baseline algorithm:  Pick pause threshold for decision  Detect endpoint when pause duration > threshold  Endpointing with prosody and/or LM:  Pick probability threshold for decision  Train separate classifiers for pause values >.03,.06,.09,.12,.25,.50,.80 seconds  For each pause threshold: Dectect endpoint if classifiers predicts probability > threshold Otherwise wait until next higher pause threshold is reached  Detect endpoint when pause > 1s

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Endpointing Metrics  Performance metrics:  False alarms: system detects false endpoint  Misses: system fails to detect true endpoint  Recall = % of true endpoints detected= 1 – Miss rate  Error trade-off  System can be set more or less "trigger-happy"  Fewer false negatives  More false positives  Equal error rate (EER): error rate at which false alarms = misses  ROC curve: graphs recall vs. false alarms

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , ATIS Endpointing Results  Endpointer used automatic recognition output ( 5.9% WER. Note: LM degrades with WER).  Equal Error Rates  Baseline: pause threshold8.9 %  Prosodic decision tree only6.7 %  Language model only6.0 %  Prosody + LM combined5.3 %  Prosody alone beats baseline  Combined classifier better than LM alone

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , ROC for Endpointing in ATIS

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , ATIS Examples  Do you have  a flight  between  Philadelphia  and San Francisco?   Baseline makes false endpoints at the  locations (so would cut speaker off prematurely)  Prosody model waits, despite the pause, because pitch doesn’t move much, stays high (hesitation)  I would like to find the cheapest  one-way fare from Philadelphia to Denver.   Prosody mistakenly predicts endpoint (“?” rise)  Combined prosody and LM endpointer avoids false endpoint (rare to end on “cheapest”).

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Prosodic Cues for Endpointing  Pitch range  speaker close to his/her estimated F0 “baseline” or “topline” (logratio of fitted F0 in previous word to that measure)  baseline/topline estimated by LTM model of pitch  Phone and syllable durations  last vowel or syllable rhyme is extended  normalized for both the segmental content (intrinsic duration) and the speaker

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Endpointing Speed-Up  User Waiting Time = average pause delay needed for system to detect true endpoints  In addition to preventing false alarms, prosody reduces UWT for any given false alarm rate: False Alarms 2% 4% 6% 8% 10% Baseline Tree only Tree + LM  Result: zippier interaction with system

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Endpointing in a NASA Domain  Personal Satellite Assistant: Dialog system controlling a (simulated) on-board robot  Developed at NASA Ames/RIACS  Data courtesy of Beth Ann Hockey  Endpointer trained on ATIS, tested on 3200 utterances recorded at RIACS  Used transcribed words  "Blind test": no training on PSA data!

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Endpointing in PSA Data  ATIS language model not applicable, not used for endpointing  PSA data had no utterance-internal pauses  baseline and prosodic model had same EER = 3.1% (no opportunity for false alarms)  However: prosody still saves time: UWT (in seconds) at 2% false positive rate  Baseline0.170  Prosodic tree0.135  Prosodic model is portable to new domains!

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , PSA Example  Move to commander’s seat and measure radiation   Baseline and prosody system both configured (via decision thresh.) for 2% false alarm rate  As noted earlier, no error diffs for this corpus  But baseline system takes 0.17s to endpoint after last word.  Prosody system takes only 0.04s to endpoint!

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Task 2: Emotion Detection  Issue of data: used corpus of HC telephone dialogs labeled for emotion for DARPA project  Would like more realistic data, with fear, etc.  DARPA data: main emotion = frustration  Each dialog labeled by 2+ people independently  2nd “Consensus” pass for all disagreements, by two of the same labelers.

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Labeled Classes  Emotion: neutral, annoyed, frustrated, tired/disappointed, amused/surprised,  Speaking style: hyperarticulation, perceived pausing between words or syllables, raised voice  Repeats and corrections: repeat/rephrase, repeat/rephrase with correction, correction only  Miscellaneous useful events: self-talk, noise, non-native speaker, speaker switches, etc.

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Emotion Samples  Neutral  July 30  Yes  Disappointed/tired  No  Amused/surprised  No  Annoyed  Yes  Late morning (HYP)  Frustrated  Yes  No  No, I am … (HYP)  There is no Manila

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Results: Annoy/Frust vs All Others

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Results (cont.)  H-H labels agree 72%, complex decision task  inherent continuum  speaker differences  relative vs. absolute judgements  H labels agree 84% with “consensus” (biased)  Tree model agrees 76% with consensus-- better than original labelers with each other  Prosodic model makes use of a dialog state feature, but without it it’s still better than H-H  Language model features alone are not good predictors (dialog feature alone is better)

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Baseline Prosodic Tree duration feature pitch feature other feature REPCO in ec2,rr1,rr2,rex2,inc,ec1,rex1 : AF | MAXF0_IN_MAXV_N < : N | MAXF0_IN_MAXV_N >= : AF | | MAXPHDUR_N < : AF | | | UTTPOS < 5.5: N | | | UTTPOS >= 5.5: AF | | MAXPHDUR_N >= : AF REPCO in 0 : N | UTTPOS < 7.5: N | UTTPOS >= 7.5: N | | VOWELDUR_DNORM_E_5 < : N | | | MINF0TIME < 0.875: N | | | MINF0TIME >= 0.875: AF | | | | SYLRATE < : AF | | | | | MAXF0_TOPLN < : N | | | | | MAXF0_TOPLN >= : AF | | | | SYLRATE >= : N | | VOWELDUR_DNORM_E_5 >= : AF | | | MAXPHDUR_N < : N | | | | MINF0TIME < 0.435: N | | | | MINF0TIME >= 0.435: N | | | | | RISERATIO_DNORM_E_5 < : N | | | | | RISERATIO_DNORM_E_5 >= : AF | | | MAXPHDUR_N >= : AF

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Predictors of Annoyed/Frustrated  Prosodic: Pitch features:  high maximum fitted F0 in longest normalized vowel  high speaker-norm. (1st 5 utts) ratio of F0 rises/falls  maximum F0 close to speaker’s estimated F0 “topline”  minimum fitted F0 late in utterance (no “?” intonation)  Prosodic: Duration and speaking rate features  long maximum phone-normalized phone duration  long max phone- & speaker- norm.(1st 5 utts) vowel  low syllable-rate (slower speech)  Other:  utterance is repeat, rephrase, explicit correction  utterance is after 5-7th in dialog

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Effect of Class Definition For less ambiguous or more extreme tokens, performance is significantly better than our baseline

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Error trade-offs (ROC)

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Results Summary  Prosody allows significantly more accurate (fewer false cut-offs) and faster endpointing in spoken input to dialog systems.  Prosodic endpointer is portable to new applications. (Note: language model is not!)  Prosody significantly improves detection of frustration over (cheating) language model.  Prosody is of further value when combined with lexical information, regardless of which model is better on its own.

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Impact and Future Work  Prosody enables more accurate spoken language processing by capturing information “beyond the words”.  Prosody creates new capabilities for systems (e.g., emotion detection)  Prosody can speed up HCI (e.g., endpointing).  Prosody presents potential for fusion with other communication modalities, such as vision.

Shriberg & Stolcke: Harnessing Prosody for HCI NASA IS-HCC Meeting, Feb , Thank You