Download presentation
Presentation is loading. Please wait.
Published byJesse Hawkins Modified over 9 years ago
1
Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001
2
Agenda How to get started –System bootstrapping “Wizard-of-Oz” design Strengths & Limitations How to tell if you succeeded –System evaluation What you do & how you do it Performance = Task success - Task cost
3
System Bootstrapping Question: How should we design a system? –What should it be able to understand? Key: How would people talk to it? Suggestion 1: Like people talk to each other? –Collect human-human interactions, same task –But, computers NOT like people, act differently –Politeness, assumed knowledge, style, complexity –Adapt to needs of hearer –Balance need for understanding, reduce effort
4
“Wizard-of-Oz” Studies Suggestion 2: Like people talk to computer! –Get application/domain specific language But, system NOT built yet! –Simulate system mediated thru human wizard Fast, rigid/consistent, no small errors/typos –Structured simulations Automate as much as possible –E.g. response editor - hierarchical menus/templates, access to different apps, query creator, time-stamped logging
5
Good Wizard Studies Requirements: –Background system: Fully implemented or simulated Allows some user initiative –Task: Somewhat open “scenario” Not too complex or private –Must be piloted: Task scenario/simulation
6
Comparing Styles Human-human versus human-computer –H-H: more complex; H-C: simpler structure –Domain variability greater than individual –Vocabulary choice –Use of anaphora Question: Should you lie to the user? –Only way to get realistic behavior –Debrief: explain protocol, offer to destroy data
7
System Evaluation Question: Which design is better? Approach 1: Content-based measures –Task-completion –Concept accuracy –Reference answer Query result versus key Limited: Only one strategy –Many alternatives
8
System Evaluation (cont’d) Not just accuracy, but efficiency Approach 2: Cost-based measures –Time to completion: # of utterances # of turns Duration in seconds –Error measures: # corrections, # repetitions
9
Combining Measures Issues: –Generalization: Factors affecting performance –Sub-dialogues: not just WHOLE task PARADISE: –Separate what agent does from how does it –Performance = task success & dialogue costs Performance => Usability => User satisfaction Task success = operationalized as K-coefficient Costs = efficiency, qualitative measures
10
Measuring Task Success AVM: Attribute Value Matrix –Capture info to be exchanged b/t user & system –“Key”: AVM instantiation for scenario K-coefficient calculated from confusion matrix –on-diagonal: match key; off-diagonal: misunderstood K = P(A) - P(E)/ (1-P(E)) –P(A): Proportion agreement; P(E): Proportion expect –Actual - chance agreement Pros: corrects for chance; compare across tasks
11
Measuring Task Costs Define cost measures: –E.g. # utterances, # repairs Can compute across sub-dialogues –Match segment to purpose Hierarchical structure - link to subtasks Tag by AVM info goals
12
Estimating Performance Fn Predicted measure: Performance –User satisfaction rating: Rating: 1-6 on some question or average of questions Predictor measures: Success & Costs –Normalize each to z-score Handle varying scales –Apply multiple linear regression to compute weights Calculate for sub-dialogue: restrict K, costs
13
Evaluation Applied to multiple tasks –Travel, Reservation/Purchase, Circuit-Fix-It –Define new AVM attributes Match discourse structure Compare dialogue strategies –Explicit/Implicit confirmation –System/User/Mixed initiative
14
Summary Building for HCI –Human-human versus human-computer –Acquire vocabulary, structure, style Base on “Wizard-of-Oz” simulation Evaluating strategies –Performance = task success - dialogue cost –task success: agreement between response & key Success level compensates for chance –Costs: number of repairs, utterances
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.