ITCS 6010 VUI Evaluation Paradise & SUM
PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction
PARADISE Paradigm for Dialogue System Evaluation Performance is modeled as a weighted function of a task-based success measure and dialogue-based cost measures, where weights are computed by correlating user satisfaction with performance. Dialogue tasks are represented as Attribute Value Matrix (AVM) pairs.
Attribute Value Matrix (AVM) AVM represents dialogue task Information exchange between agent and user Represented as ordered pairs of attribute and possible values AVM instantiation of task Same regardless of dialogue
AVM (cont’d) A1: Hello this is Train Enquiry Service. DC,AC,DR,DT Please speak after the tone. Which information do you need? U1: I want to go from Torino to Milano. DC, AC A2: Do you want to go from Trento to DC, AC Milano? Yes or No? U2: No.DC, AC A3: Do you want to leave from Trento? Yes if No?DC U3: No.DC A4: Where do you want to leave from?DC U4: I want to leave from Torino.DC A5: Do you want to leave from Torino?DC Yes or No? U5: Yes.DC A6: Do you want to go to Milano?AC U6: Yes.AC A7: At which time do you want to leave?DR U7: I want to travel in the evening.DR A8: Do you want to leave between 6 and 9pm? Yes or No?DR U8: Yes.DR A9: There is a train leaving at 8:00pm.DT …. B1: Hello this is Train Enquiry Service. DC,AC,DR,DT Please speak after the tone. Which information do you need? U1: I want to travel from Torino to Milano. DC, AC B2: At which time do you want to leave DC, AC,DR from Merano to Milano? U2: No, I want to leave from Torino in the DC, DR evening. B3: Do you want to leave from Torino between 6 and 11pm?Please answer Yes DC, DR or No. U3: Yes. DC,DR B4: A train leaves at 8:00pm. DT …. attributeactual value depart-cityTorino arrival-cityMilano depart-rangeEvening depart-time8pm
PARADISE Paradigm for Dialogue System Evaluation Advantages PARADISE approach addresses performance and user satisfaction Disadvantages Too complex to compute. Need a large sample size up front
Alternative Approaches What’s important? Maximize User Satisfaction Maximize Task Success
User Satisfaction How do we measure user satisfaction? Questionnaires Interviews Focus Groups
Task Success How do we measure task success? Logging Actual Use Performance Measurement Walkthroughs Pilot Testing
Task Success For each dialogue and the entire conversation establish AVMs. Measure task success with respect to: Task completion time Accuracy or Errors (e.g. misinterpretations)
Conclusions PARADISE is good, but too complex! Measure user satisfaction and task success. What if user satisfaction not most relevant aspect?
Speech Usability Metric (SUM) Uses 3 metrics: User satisfaction Accuracy Task completion time Eliminates restriction of one factor to determine usability
Speech Usability Metric (SUM) SUM = X * User Satisfaction + Y * Accuracy + Z * Completion Time X + Y + Z = 1 X, Y, Z > 0 Weights determined by evaluator
User Satisfaction Surveys Questionnaires Interviews
Accuracy Misinterpretations System recognizes wrong word Out-of-vocabulary errors Words not in system grammar Wrong choice Correct word recognized, wrong path chosen
Task Completion Time Time to complete task Time for expert to complete task (ETCT) Maximum time to complete task (MTCT) Expected time to complete task (ExTCT)
Conclusion SUM determines usability of a speech application Utilizes 3 pre-defined metrics Allows for greater flexibility