Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

Slides:



Advertisements
Similar presentations
Probabilistic Adaptive Real-Time Learning And Natural Conversational Engine Seventh Framework Programme FP7-ICT
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
OpenDial Framework Svetlana Stoyanchev SDS seminar 3/23.
Roberto Linares, Ph.D. Sigmafine Group Lead
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
Statistical Issues in Research Planning and Evaluation
Seminar on Spoken Dialogue Systems
Error detection in spoken dialogue systems GSLT Dialogue Systems, 5p Gabriel Skantze TT Centrum för talteknologi.
MITRE © 2001 The MITRE Corporation. ALL RIGHTS RESERVED. What Works, What Doesn’t -- And What Needs to Work Lynette Hirschman Information Technology Center.
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation1 Module u1: Speech in the Interface 4: User-centered Design and Evaluation Jacques.
Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
Detecting missrecognitions Predicting with prosody.
Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus
“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department
System Management Network Environment Vehicle Characteristics Traveler Characteristics System Traveler Influencing Factors Traveler: traveler characteristics,
MUSCLE Multimodal e-team related activity Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Prof. Alex Potamianos Technical.
A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.
Talent Management in Microsoft Dynamics® AX 2012
Review an existing website Usability in Design. to begin with.. Meeting Organization’s objectives and your Usability goals Meeting User’s Needs Complying.
Towards Natural Clarification Questions in Dialogue Systems Svetlana Stoyanchev, Alex Liu, and Julia Hirschberg AISB 2014 Convention at Goldsmiths, University.
류 현 정류 현 정 Human Computer Interaction Introducing evaluation.
Program Evaluation Using qualitative & qualitative methods.
© 2012 ISACA. All Rights Reserved. Topic Leader Training 2012.
HCI Research Project. Research Paradigms Theoretical (in the style of mathematics) –Mathematical deduction –Simulation –Analysis of algorithms The researcher:
Clarification in Spoken Dialogue Systems: Modeling User Behaviors Julia Hirschberg Columbia University 1.
Evaluation Framework Prevention vs. Intervention CHONG POH WAN 21 JUNE 2011.
Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.
DQA meeting: : Learning more effective dialogue strategies using limited dialogue move features Matthew Frampton & Oliver Lemon, Coling/ACL-2006.
Effective Differentiated Instruction for All Students
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
Develop a fast semantic decoder for dialogue systems Capability to parse 10 – 100 ASR hypotheses in real time Robust to speech recognition noise Semantic.
VUI: Tuning the VUI, Not Tuning The Parameters Alex Massie, Director of Business Solutions, ClickFox Tuesday, August 8, 2006.
IGCSE Business Studies
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
GSA Expo 2010 DoD Travel Programs Customer Assistance Tools and Services Mr. Joe Ward and Ms. Margaret Hebert GSA Expo May 2010.
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.
Develop a fast semantic decoder Robust to speech recognition noise Trainable on different domains: Tourist information (TownInfo) Air travel information.
Chapter 13. Reviewing, Evaluating, and Testing © 2010 by Bedford/St. Martin's1 Usability relates to five factors of use: ease of learning efficiency of.
Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.
Microsoft Office Project 2003: Selling EPM in your Organization Matt Wilson Business Solutions Specialist LMR Solutions.
Welcome to the Usability Center Tour Since 1995, the Usability Center has been a learning environment that supports and educates in the process of usability.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
Dirk Van CompernolleAtranos Workshop, Leuven 12 April 2002 Automatic Transcription of Natural Speech - A Broader Perspective – Dirk Van Compernolle ESAT.
This study is a mixed 3 (Information Density) X 2 (Structural Complexity) x 4 (Message) x 4 (Order) design. Except for Order, all are within subject factors.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.
Data Mining and Decision Support
1 Chapter 18: Selection and training n Selection and Training: Last lines of defense in creating a safe and efficient system n Selection: Methods for selecting.
Develop a fast semantic decoder for dialogue systems Capability to parse 10 – 100 ASR hypothesis in real time Robust to speech recognition noise Trainable.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,
1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.
NATIONAL IT AUTHORITY MODULE 5 PROCESS HANDLING SKILLS AND KNOWLEDGE.
1 Spoken Dialogue Systems Dialogue and Conversational Agents (Part III) Chapter 19: Draft of May 18, 2005 Speech and Language Processing: An Introduction.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
2016 Maintenance Innovation Challenge
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Building and Evaluating SDS
Error Detection and Correction in SDS
Issues in Spoken Dialogue Systems
Spoken Dialogue Systems
Dialogue Acts and Information State
Spoken Dialogue Systems
Drill & Practice Programs
Presentation transcript:

Evaluation of SDS Svetlana Stoyanchev 3/2/2015

Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer designs rules but dialogues are not predictable – System action depends on user input – User input is unrestricted

Stakeholders Developers Business Operator End-user

Criteria for evaluation Key Criteria – Performance of SDS components ASR (WER) NLU (concept Error rate) DM/NLG (is the response appropriate) – Interaction time – User engagement Criteria may vary based on an application – Information access/query Minimize interaction time – Browsing museum guide Maximize user engagement

Evaluation measures/methods Evaluation measures – Turn correction ratio – Concept accuracy – Transaction success Evaluation methods – Recruit and pay humans to perform task in a lab Disadvantages of human evaluation: – High cost – * Unrealistic subject behavior

A typical questionnaire

PARADISE framework PARAdigm for Dialogue System Evaluation Framework goal: predict user performance using system features Performance measures: – User Satisfaction – Task Success – Dialogue Cost

Applying PARADISE Framework Walker, Kamm, Litman Collect data from users via controlled experiment (subjective rating of satisfaction) – Mark or automatically log system measures 2.Apply multivariate linear regression – User SAT is a dependent variable – Independent variables – logged 3.Predict user SAT using simpler metrics that can be automatically collected in a live system

Data collection for PARADISE framework Systems – ANNIE : voice dialing, employee directory look-up and voice and access Novice/expert – ELVIS: accessing Novice/expert – TOOT: finding a train with specified constraints

Automatically logged variables Efficiency – System turns – User turns Dialogue quality – Timeouts (when a user did not respond) – Rejects (when the system confidence is low leading to “I am sorry I did not understand”) – Help – number of times the system believes that a user said ‘help’ – Cancel - number of times the system believes that a user said ‘cancel’ – Barge-in

Method Train models using multivariate regression Test across different systems measuring – How much variance does the model predict R^2

Results: train and test on the same system

Results: train and test on all

Results: cross-system train/test

Results: cross-dialogue type

Which features were useful? Comp: task success/ dialogue completion Mrs: mean recognition score Et: elapsed time Reject%: % of utterances in a dialogue rejected by the system

Applying PARADISE Framework 2000 – 2001 DARPA communicator – 9 participating sites – Develop air reservation system “SDS in the wild” Over 6 months recruited users call to make airline reservation – Recruit frequent travellers

Communicator Result

Discussion Consistent contributors to User SAT – Negative effect of task duration – Negative effect of sentence errors Task Success vs. User Satisfaction – Not always the same Commercial systems vs. Research systems – Different goals Difficult to generalize across different system types

Next: other methods of evaluation F. Jurčíček and S. Keizer and F. Mairesse and B. Thomson and K. Yu and S. Young Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk. in Proceedings of Interstpeech, 2011 [ presenter: Mandi Wang ] K. Georgila, J. Henderson, and O. Lemon Learning User Simulations for Information State Update Dialogue Systems. In Proceedings of Interspeech.[ presenter: Xiaoqian Ma ]