ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.

Slides:



Advertisements
Similar presentations
DEVELOPING A METHODOLOGY FOR MS3305 CW2 Some guidance.
Advertisements

Methodology and Explanation XX50125 Lecture 1: Part I. Introduction to Evaluation Methods Part 2. Experiments Dr. Danaë Stanton Fraser.
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
Each individual person is working on a GUI subset. The goal is for you to create screens for three specific tasks your user will do from your GUI Project.
Primary and Secondary Data
Snejina Lazarova Senior QA Engineer, Team Lead CRMTeam Dimo Mitev Senior QA Engineer, Team Lead SystemIntegrationTeam Telerik QA Academy Telerik QA Academy.
ACTIVELY ENGAGING THE STAKEHOLDER IN DEFINING REQUIREMENTS FOR THE BUSINESS, THE STAKEHOLDER, SOLUTION OR TRANSITION Requirements Elicitation.
The art and science of measuring people l Reliability l Validity l Operationalizing.
SE 450 Software Processes & Product Metrics Survey Use & Design.
Evaluation Howell Istance. Why Evaluate? n testing whether criteria defining success have been met n discovering user problems n testing whether a usability-related.
HCI Methods for Pathway Visualization Tools Purvi Saraiya, Chris North, Karen Duca* Virginia Tech Dept. of Computer Science, Center for Human-Computer.
4/16/2017 Usability Evaluation Howell Istance 1.
Case study - usability evaluation Howell Istance.
Semester wrap-up …my final slides.. More on HCI Class on Ubiquitous Computing next spring Courses in visualization, virtual reality, gaming, etc. where.
Semester wrap-up …the final slides.. The Final  December 13, 3:30-4:45 pm  Closed book, one page of notes  Cumulative  Similar format and length to.
Data-collection techniques. Contents Types of data Observations Event logs Questionnaires Interview.
©2003 Prentice Hall Business Publishing, Accounting Information Systems, 9/e, Romney/Steinbart 18-1 Accounting Information Systems 9 th Edition Marshall.
Testing and Modeling Users Kristina Winbladh & Ramzi Nasr.
The art and science of measuring people l Reliability l Validity l Operationalizing.
Prototyping Teppo Räisänen
Chapter Three Research Design.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Semester wrap-up …the final slides.. The Final December 15, 3:30-6:30 pm Close book, one page of notes Cumulative Similar format to midterm (probably.
Ethics (bad) Appreciation (negative) Task-relevance (absent) Involvement.
User Interface Evaluation CIS 376 Bruce R. Maxim UM-Dearborn.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 14.
DR. DAWNE MARTIN MKTG 241 FEB. 15, 2011 Marketing Research.
Web Analytics and Social Media Monitoring CAM Diploma in Digital Marketing (Metrics and Analytics) Assignment Briefing December 2013 & March 2014 Papers.
1. Learning Outcomes At the end of this lecture, you should be able to: –Define the term “Usability Engineering” –Describe the various steps involved.
Business and Management Research
Introduction to SDLC: System Development Life Cycle Dr. Dania Bilal IS 582 Spring 2009.
Human Computer Interaction CS/ CM 348 Fall 2010 Class Camille Riviere.
– 4 th Workshop on Authoring of Adaptive and Adaptable Hypermedia, Dublin, 20 th of June, 2006 TU/e eindhoven university of technology Evaluation.
User Study Evaluation Human-Computer Interaction.
Usability Evaluation Yogi
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Overview of the rest of the semester Iteratively design interface to help people log their food intake over the long term.
OLAP tool for comparing time- based data 14 th May 2008 Proposed By Pimolmas Ponchaisakuldee Advisor Dr. Paul Janecek.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Copyright © 2015 by Educational Testing Service. 1 Feature Selection for Automated Speech Scoring Anastassia Loukina, Klaus Zechner, Lei Chen, Michael.
Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.
Usability Assessment Methods beyond Testing Chapter 7 Evaluating without users.
Dr. Engr. Sami ur Rahman Assistant Professor Department of Computer Science University of Malakand Research Methods in Computer Science Lecture: Data Generation.
Chapter 15: Analytical evaluation. Inspections Heuristic evaluation Walkthroughs.
Chapter 15: Analytical evaluation Q1, 2. Inspections Heuristic evaluation Walkthroughs Start Q3 Reviewers tend to use guidelines, heuristics and checklists.
Welcome to the Usability Center Tour Since 1995, the Usability Center has been a learning environment that supports and educates in the process of usability.
Towards a Method For Evaluating Naturalness in Conversational Dialog Systems Victor Hung, Miguel Elvir, Avelino Gonzalez & Ronald DeMara Intelligent Systems.
Chapter9 Survey Data Collection Methods. Advantages of Surveys Standardization Ease of administration Ability to tap the “unseen” Suitability to tabulation.
Marketing Research Approaches. Research Approaches Observational Research Ethnographic Research Survey Research Experimental Research.
Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001.
Artificial Intelligence Chapter 10 Planning, Acting, and Learning Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Fall 2002CS/PSY Empirical Evaluation Data collection: Subjective data Questionnaires, interviews Gathering data, cont’d Subjective Data Quantitative.
1 Human-Computer Interaction Usability Evaluation: 2 Expert and Empirical Methods.
Evaluating VR Systems. Scenario You determine that while looking around virtual worlds is natural and well supported in VR, moving about them is a difficult.
3-1 Copyright © 2010 Pearson Education, Inc. Chapter Three Research Design.
OCR Nationals Level 3 Unit 3.  To identify your project objectives  To state the criteria that will be used to identify if the objectives have been.
Types of method Quantitative: – Questionnaires – Experimental designs Qualitative: – Interviews – Focus groups – Observation Triangulation.
Utility or Futility? Marguerite Koole Dr. Mohamed Ally.
Questionnaire Design & Issues Lecture & Seminar. J.D. Power Asks: It’s Interesting, But Do You Really Want It? 15-2 Car makers have to evaluate what features.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
User Interface Evaluation
Human Computer Interaction Lecture 15 Usability Evaluation
Usability engineering
Usability engineering
Personal Goals and Objectives
Spoken Dialogue Systems
SY DE 542 User Testing March 7, 2005 R. Chow
Spoken Dialogue Systems
network of simple neuron-like computing elements
Experimental Evaluation
Presentation transcript:

ITCS 6010 VUI Evaluation Paradise & SUM

PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction

PARADISE Paradigm for Dialogue System Evaluation Performance is modeled as a weighted function of a task-based success measure and dialogue-based cost measures, where weights are computed by correlating user satisfaction with performance. Dialogue tasks are represented as Attribute Value Matrix (AVM) pairs.

Attribute Value Matrix (AVM) AVM represents dialogue task Information exchange between agent and user Represented as ordered pairs of attribute and possible values AVM instantiation of task Same regardless of dialogue

AVM (cont’d) A1: Hello this is Train Enquiry Service. DC,AC,DR,DT Please speak after the tone. Which information do you need? U1: I want to go from Torino to Milano. DC, AC A2: Do you want to go from Trento to DC, AC Milano? Yes or No? U2: No.DC, AC A3: Do you want to leave from Trento? Yes if No?DC U3: No.DC A4: Where do you want to leave from?DC U4: I want to leave from Torino.DC A5: Do you want to leave from Torino?DC Yes or No? U5: Yes.DC A6: Do you want to go to Milano?AC U6: Yes.AC A7: At which time do you want to leave?DR U7: I want to travel in the evening.DR A8: Do you want to leave between 6 and 9pm? Yes or No?DR U8: Yes.DR A9: There is a train leaving at 8:00pm.DT …. B1: Hello this is Train Enquiry Service. DC,AC,DR,DT Please speak after the tone. Which information do you need? U1: I want to travel from Torino to Milano. DC, AC B2: At which time do you want to leave DC, AC,DR from Merano to Milano? U2: No, I want to leave from Torino in the DC, DR evening. B3: Do you want to leave from Torino between 6 and 11pm?Please answer Yes DC, DR or No. U3: Yes. DC,DR B4: A train leaves at 8:00pm. DT …. attributeactual value depart-cityTorino arrival-cityMilano depart-rangeEvening depart-time8pm

PARADISE Paradigm for Dialogue System Evaluation Advantages PARADISE approach addresses performance and user satisfaction Disadvantages Too complex to compute. Need a large sample size up front

Alternative Approaches What’s important? Maximize User Satisfaction Maximize Task Success

User Satisfaction How do we measure user satisfaction? Questionnaires Interviews Focus Groups

Task Success How do we measure task success? Logging Actual Use Performance Measurement Walkthroughs Pilot Testing

Task Success For each dialogue and the entire conversation establish AVMs. Measure task success with respect to: Task completion time Accuracy or Errors (e.g. misinterpretations)

Conclusions PARADISE is good, but too complex! Measure user satisfaction and task success. What if user satisfaction not most relevant aspect?

Speech Usability Metric (SUM) Uses 3 metrics: User satisfaction Accuracy Task completion time Eliminates restriction of one factor to determine usability

Speech Usability Metric (SUM) SUM = X * User Satisfaction + Y * Accuracy + Z * Completion Time X + Y + Z = 1 X, Y, Z > 0 Weights determined by evaluator

User Satisfaction Surveys Questionnaires Interviews

Accuracy Misinterpretations System recognizes wrong word Out-of-vocabulary errors Words not in system grammar Wrong choice Correct word recognized, wrong path chosen

Task Completion Time Time to complete task Time for expert to complete task (ETCT) Maximum time to complete task (MTCT) Expected time to complete task (ExTCT)

Conclusion SUM determines usability of a speech application Utilizes 3 pre-defined metrics Allows for greater flexibility