Download presentation
Presentation is loading. Please wait.
1
Spoken Dialogue Systems
Julia Hirschberg CS 4706 9/22/2018
2
Issues Error avoidance Error detection
From the system side: how likely is it the system made an error? From the user side: what cues does the user provide to indicate an error? Error handling: what can the system do when it thinks an error has occurred? Evaluation: how do you know what needs fixing most? 9/22/2018
3
Avoiding misunderstandings
By imitating human performance Timing and grounding (Clark ’03) 9/22/2018
4
Recognizing Problematic Dialogues
Hastie et al, “What’s the Trouble?” ACL 2002. 9/22/2018
5
Recognizing Problematic Utterances (Hirschberg et al ’99--)
Collect corpus from interactive voice response system Identify speaker ‘turns’ incorrectly recognized where speakers first aware of error that correct misrecognitions Identify prosodic features of turns in each category and compare to other turns Use Machine Learning techniques to train a classifier to make these distinctions automatically 9/22/2018
6
Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. misrecognition Here are examples of the 3 turn types we focus on. correction aware site 9/22/2018
7
Results Reduced error in predicting misrecognized turns to 8.64%
Error in predicting ‘awares’ (12%) Error in predicting corrections (18-21%) 9/22/2018
8
Evidence from Human Performance
Users provide explicit positive and negative feedback Corpus-based vs. laboratory experiments – do these tell us different things? Bell & Gustafson ’00 What do we learn from this? What functions does feedback serve? Krahmer et al ‘go on’ and ‘go back’ signals in grounding situations (implicit/explicit verification) Signalling whether information is grounded or not (Clark & Wilkes-Gibbs ‘86, Clark & Schaeffer ‘89): presentation/acceptance 120 dialogue for Dutch train info; one version uses explicit verification and oneimplicit; 20 users given 3 tasks; analyzed 443 verification q/a pairs predicted that responses to correct verifications would be shorter, with unmarked word order, not repeating or correcting information but presenting new information (positive cues) -- principle of least effort findings: where problems, subjects use more words (or say nothing), use marked word order (especially after implicit verifs), contain more disconfirmations (duh), with more repeated and corrected info ML experiments (memory based learning) show 97% correct prediction from these features (>8 words or marked word order or corrects info -> 92%) Krahmer et al ‘99b predicted additional prosodic cues for neg signals: high boundary tone, high pitch range, long duration of ‘nee’ and entire utterance, long pause after ‘nee’, long delay before ‘no’, from 109 negative answers to ynqs of 7 speakers; hyp 9/22/2018
9
Hypotheses supported but…
Pos: short turns, unmarked word order, confirmation, answers, no corrections or repetitions, new info Neg: long turns, marked word order, disconfirmation, no answer, corrections, repetitions, no new info Hypotheses supported but… Can these cues be identified automatically? How might they affect the design of SDS? 9/22/2018
10
Error Handling Strategies
Goldberg et al ’03: how should systems best inform the user that they don’t understand? System rephrasing vs. repetitions vs. statement of not understanding Apologies What behaviors might these produce? Hyperarticulation User frustration User repetition or *rephrasing 9/22/2018
11
What lessons do we learn? What produces least frustration?
Best recognized input? 9/22/2018
12
Evaluating Dialogue Systems
PARADISE framework (Walker et al ’00) “Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplished Maximize Task Success Minimize Costs Efficiency Measures Qualitative Measures 9/22/2018
13
Task Success Task goals seen as Attribute-Value Matrix
ELVIS retrieval task (Walker et al ‘97) “Find the time and place of your meeting with Kim.” Attribute Value Selection Criterion Kim or Meeting Time 10:30 a.m. Place 2D516 Task success defined by match between AVM values at end of with “true” values for AVM 9/22/2018
14
Metrics Efficiency of the Interaction:User Turns, System Turns, Elapsed Time Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests User Satisfaction Task Success: perceived completion, information extracted 9/22/2018
15
Experimental Procedures
Subjects given specified tasks Spoken dialogues recorded Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled Users specify task solution via web page Users complete User Satisfaction surveys Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors 9/22/2018
16
User Satisfaction: Sum of Many Measures
Was Annie easy to understand in this conversation? (TTS Performance) In this conversation, did Annie understand what you said? (ASR Performance) In this conversation, was it easy to find the message you wanted? (Task Ease) Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) In this conversation, did you know what you could say at each point of the dialog? (User Expertise) How often was Annie sluggish and slow to reply to you in this conversation? (System Response) Did Annie work the way you expected her to in this conversation? (Expected Behavior) From your current experience with using Annie to get your , do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) 9/22/2018
17
Performance Functions from Three Systems
ELVIS User Sat.= .21* COMP * MRS * ET TOOT User Sat.= .35* COMP + .45* MRS - .14*ET ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help COMP: User perception of task completion (task success) MRS: Mean recognition accuracy (cost) ET: Elapsed time (cost) Help: Help requests (cost) 9/22/2018
18
Performance Model Perceived task completion and mean recognition score are consistently significant predictors of User Satisfaction Performance model useful for system development Making predictions about system modifications Distinguishing ‘good’ dialogues from ‘bad’ dialogues But can we also tell on-line when a dialogue is ‘going wrong’ 9/22/2018
19
Next Week Speech summarization and data mining 9/22/2018
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.