Spoken Dialogue Systems

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
U1, Speech in the interface:2. Dialogue Management1 Module u1: Speech in the Interface 2: Dialogue Management Jacques Terken HG room 2:40 tel. (247) 5254.
Error detection in spoken dialogue systems GSLT Dialogue Systems, 5p Gabriel Skantze TT Centrum för talteknologi.
What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
Detecting missrecognitions Predicting with prosody.
Evaluating with experts
Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002.
Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
10 Usability Heuristics for User Interface Design.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.
Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino Gonzalez, Ronald DeMara University of Central Florida.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
Towards a Method For Evaluating Naturalness in Conversational Dialog Systems Victor Hung, Miguel Elvir, Avelino Gonzalez & Ronald DeMara Intelligent Systems.
Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,
circle Spoken Dialogue for the Why2 Intelligent Tutoring System Diane J. Litman Learning Research and Development Center & Computer Science Department.
1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.
Grounding and Repair Joe Tepperman CS 599 – Dialogue Modeling Fall 2005.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 Spoken Dialogue Systems Dialogue and Conversational Agents (Part III) Chapter 19: Draft of May 18, 2005 Speech and Language Processing: An Introduction.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
The Audio-Lingual Method
Techniques and Principles in Language Teaching
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Chapter 11: Usability © Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons Attribution License.
Conditional Random Fields for ASR
For Evaluating Dialog Error Conditions Based on Acoustic Information
Dialogue Systems Julia Hirschberg CS /17/2018.
Building and Evaluating SDS
Error Detection and Correction in SDS
Human Factors Issues Chapter 8 Paul King.
Spoken Dialogue Systems
Spoken Dialogue Systems
Studying Intonation Julia Hirschberg CS /21/2018.
Studying Intonation Julia Hirschberg CS /21/2018.
Issues in Spoken Dialogue Systems
Intonational and Its Meanings
Prosody in Recognition/Understanding
Automatic Speech Recognition
Intonational Variation in Spoken Dialogue Systems
Chapter 12: Automated data collection methods
Dialogue Acts Julia Hirschberg CS /18/2018.
Meanings of Intonational Contours
Turn-taking and Disfluencies
Advanced NLP: Speech Research and Technologies
The Audio-Lingual Method
Dialogue Acts and Information State
Turn-taking and Disfluencies
Nigel G. Ward, Anais G. Rivera, Karen Ward, David G. Novick
High Frequency Word Entrainment in Spoken Dialogue
Advanced NLP: Speech Research and Technologies
Spoken Dialogue Systems
Spoken Dialogue Systems
Systems Analysis and Design in a Changing World, 6th Edition
Spoken Dialogue Systems
Provide Effective Feedback and Guidance and Assistance
Spoken Dialogue Systems
Speaker Identification:
Low Level Cues to Emotion
Evaluation of a multimodal Virtual Personal Assistant Glória Branco
Presentation transcript:

Spoken Dialogue Systems Julia Hirschberg CS 4706 9/22/2018

Issues Error avoidance Error detection From the system side: how likely is it the system made an error? From the user side: what cues does the user provide to indicate an error? Error handling: what can the system do when it thinks an error has occurred? Evaluation: how do you know what needs fixing most? 9/22/2018

Avoiding misunderstandings By imitating human performance Timing and grounding (Clark ’03) 9/22/2018

Recognizing Problematic Dialogues Hastie et al, “What’s the Trouble?” ACL 2002. 9/22/2018

Recognizing Problematic Utterances (Hirschberg et al ’99--) Collect corpus from interactive voice response system Identify speaker ‘turns’ incorrectly recognized where speakers first aware of error that correct misrecognitions Identify prosodic features of turns in each category and compare to other turns Use Machine Learning techniques to train a classifier to make these distinctions automatically 9/22/2018

Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. misrecognition Here are examples of the 3 turn types we focus on. correction aware site 9/22/2018

Results Reduced error in predicting misrecognized turns to 8.64% Error in predicting ‘awares’ (12%) Error in predicting corrections (18-21%) 9/22/2018

Evidence from Human Performance Users provide explicit positive and negative feedback Corpus-based vs. laboratory experiments – do these tell us different things? Bell & Gustafson ’00 What do we learn from this? What functions does feedback serve? Krahmer et al ‘go on’ and ‘go back’ signals in grounding situations (implicit/explicit verification) Signalling whether information is grounded or not (Clark & Wilkes-Gibbs ‘86, Clark & Schaeffer ‘89): presentation/acceptance 120 dialogue for Dutch train info; one version uses explicit verification and oneimplicit; 20 users given 3 tasks; analyzed 443 verification q/a pairs predicted that responses to correct verifications would be shorter, with unmarked word order, not repeating or correcting information but presenting new information (positive cues) -- principle of least effort findings: where problems, subjects use more words (or say nothing), use marked word order (especially after implicit verifs), contain more disconfirmations (duh), with more repeated and corrected info ML experiments (memory based learning) show 97% correct prediction from these features (>8 words or marked word order or corrects info -> 92%) Krahmer et al ‘99b predicted additional prosodic cues for neg signals: high boundary tone, high pitch range, long duration of ‘nee’ and entire utterance, long pause after ‘nee’, long delay before ‘no’, from 109 negative answers to ynqs of 7 speakers; hyp 9/22/2018

Hypotheses supported but… Pos: short turns, unmarked word order, confirmation, answers, no corrections or repetitions, new info Neg: long turns, marked word order, disconfirmation, no answer, corrections, repetitions, no new info Hypotheses supported but… Can these cues be identified automatically? How might they affect the design of SDS? 9/22/2018

Error Handling Strategies Goldberg et al ’03: how should systems best inform the user that they don’t understand? System rephrasing vs. repetitions vs. statement of not understanding Apologies What behaviors might these produce? Hyperarticulation User frustration User repetition or *rephrasing 9/22/2018

What lessons do we learn? What produces least frustration? Best recognized input? 9/22/2018

Evaluating Dialogue Systems PARADISE framework (Walker et al ’00) “Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplished Maximize Task Success Minimize Costs Efficiency Measures Qualitative Measures 9/22/2018

Task Success Task goals seen as Attribute-Value Matrix ELVIS e-mail retrieval task (Walker et al ‘97) “Find the time and place of your meeting with Kim.” Attribute Value Selection Criterion Kim or Meeting Time 10:30 a.m. Place 2D516 Task success defined by match between AVM values at end of with “true” values for AVM 9/22/2018

Metrics Efficiency of the Interaction:User Turns, System Turns, Elapsed Time Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests User Satisfaction Task Success: perceived completion, information extracted 9/22/2018

Experimental Procedures Subjects given specified tasks Spoken dialogues recorded Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled Users specify task solution via web page Users complete User Satisfaction surveys Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors 9/22/2018

User Satisfaction: Sum of Many Measures Was Annie easy to understand in this conversation? (TTS Performance) In this conversation, did Annie understand what you said? (ASR Performance) In this conversation, was it easy to find the message you wanted? (Task Ease) Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) In this conversation, did you know what you could say at each point of the dialog? (User Expertise) How often was Annie sluggish and slow to reply to you in this conversation? (System Response) Did Annie work the way you expected her to in this conversation? (Expected Behavior) From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) 9/22/2018

Performance Functions from Three Systems ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET TOOT User Sat.= .35* COMP + .45* MRS - .14*ET ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help COMP: User perception of task completion (task success) MRS: Mean recognition accuracy (cost) ET: Elapsed time (cost) Help: Help requests (cost) 9/22/2018

Performance Model Perceived task completion and mean recognition score are consistently significant predictors of User Satisfaction Performance model useful for system development Making predictions about system modifications Distinguishing ‘good’ dialogues from ‘bad’ dialogues But can we also tell on-line when a dialogue is ‘going wrong’ 9/22/2018

Next Week Speech summarization and data mining 9/22/2018