5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

Slides:



Advertisements
Similar presentations
Tuning Jenny Burr August Discussion Topics What is tuning? What is the process of tuning?
Advertisements

Tivoli Service Request Manager
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Machine Learning: Intro and Supervised Classification
Linear Regression.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Descriptive Statistical Analyses Reliability Analyses Review of Last Class.
Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.
Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
ITCS 6010 VUI Evaluation Paradise & SUM. PARADISE Paradigm for Dialogue System Evaluation Goal: Maximize User Satisfaction.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Today Concepts underlying inferential statistics
Review an existing website Usability in Design. to begin with.. Meeting Organization’s objectives and your Usability goals Meeting User’s Needs Complying.
Working With Databases. Questions to Answer about a Database System What functions the marketing database is expected to perform? What is the initial.
Towards Natural Clarification Questions in Dialogue Systems Svetlana Stoyanchev, Alex Liu, and Julia Hirschberg AISB 2014 Convention at Goldsmiths, University.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
VUI: Tuning the VUI, Not Tuning The Parameters Alex Massie, Director of Business Solutions, ClickFox Tuesday, August 8, 2006.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
User Study Evaluation Human-Computer Interaction.
Event Management & ITIL V3
Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.
Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino Gonzalez, Ronald DeMara University of Central Florida.
PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Towards a Method For Evaluating Naturalness in Conversational Dialog Systems Victor Hung, Miguel Elvir, Avelino Gonzalez & Ronald DeMara Intelligent Systems.
1 CS 224S W2006 CS 224S LING 281 Speech Recognition and Synthesis Lecture 14: Dialogue and Conversational Agents (II) Dan Jurafsky.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.
Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.
1 Spoken Dialogue Systems Dialogue and Conversational Agents (Part III) Chapter 19: Draft of May 18, 2005 Speech and Language Processing: An Introduction.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Speech and Language Processing
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
CS 224S / LINGUIST 285 Spoken Language Processing
OPERATING SYSTEMS CS 3502 Fall 2017
COMP 135: Human-Computer Interface Design
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Conditional Random Fields for ASR
Building and Evaluating SDS
Spoken Dialogue Systems
Issues in Spoken Dialogue Systems
Spoken Dialogue Systems
Dialogue Acts and Information State
Spoken Dialogue Systems
Learning a Policy for Opportunistic Active Learning
Spoken Dialogue Systems
Low Level Cues to Emotion
Presentation transcript:

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706

5/10/20152 Dialogue System Evaluation Key point about SLP. Whenever we design a new algorithm or build a new application, need to evaluate it Two kinds of evaluation –Extrinsic: embedded in some external task –Intrinsic: some sort of more local evaluation. How to evaluate a dialogue system? What constitutes success or failure for a dialogue system?

5/10/20153 Dialogue System Evaluation Need evaluation metric because –1) Need metric to help compare different implementations Can’t improve it if we don’t know where it fails Can’t decide between two algorithms without a goodness metric –2) Need metric for “how well a dialogue went” as an input to reinforcement learning: Automatically improve our conversational agent performance via learning

5/10/20154 Evaluating Dialogue Systems PARADISE framework (Walker et al ’00) “Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplished Maximize Task Success MinimizeCosts EfficiencyMeasuresQualitativeMeasures

5/10/20155 Task Success % of subtasks completed Correctness of each questions/answer/error msg Correctness of total solution –Attribute-Value matrix (AVM) –Kappa coefficient Users’ perception of whether task was completed

5/10/20156 Task Success Attribute Value Selection CriterionKim or Meeting Time10:30 a.m. Place2D516 Task goals seen as Attribute-Value Matrix ELVIS retrieval task (Walker et al ‘97) “Find the time and place of your meeting with Kim.” Task success can be defined by match between AVM values at end of task with “true” values for AVM Slide from Julia Hirschberg

5/10/20157 Efficiency Cost Polifroni et al. (1992), Danieli and Gerbino (1995) Hirschman and Pao (1993) Total elapsed time in seconds or turns Number of queries Turn correction ratio: –Number of system or user turns used solely to correct errors, divided by total number of turns

5/10/20158 Quality Cost # of times ASR system failed to return any sentence # of ASR rejection prompts # of times user had to barge-in # of time-out prompts Inappropriateness (verbose, ambiguous) of system’s questions, answers, error messages

5/10/20159 Another Key Quality Cost “Concept accuracy” or “Concept error rate” % of semantic concepts that the NLU component returns correctly I want to arrive in Austin at 5:00 –DESTCITY: Boston –Time: 5:00 Concept accuracy = 50% Average this across entire dialogue “How many of the sentences did the system understand correctly”

5/10/ PARADISE: Regress against user satisfaction

5/10/ Regressing against User Satisfaction Questionnaire to assign each dialogue a “user satisfaction rating”: dependent measure Cost and success factors: independent measures Use regression to train weights for each factor

5/10/ Experimental Procedures Subjects given specified tasks Spoken dialogues recorded Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled Users specify task solution via web page Users complete User Satisfaction surveys Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors

5/10/ User Satisfaction: Sum of Many Measures Was the system easy to understand? (TTS Performance) Did the system understand what you said? (ASR Performance) Was it easy to find the message/plane/train you wanted? (Task Ease) Was the pace of interaction with the system appropriate? (Interaction Pace) Did you know what you could say at each point of the dialog? (User Expertise) How often was the system sluggish and slow to reply to you? (System Response) Did the system work the way you expected it to in this conversation? (Expected Behavior) Do you think you'd use the system regularly in the future? (Future Use)

5/10/ Performance Functions from Three Systems ELVIS User Sat.=.21* COMP +.47 * MRS -.15 * ET TOOT User Sat.=.35* COMP +.45* MRS -.14*ET ANNIE User Sat.=.33*COMP +.25* MRS +.33* Help –COMP: User perception of task completion (task success) –MRS: Mean (concept) recognition accuracy (cost) –ET: Elapsed time (cost) –Help: Help requests (cost)

5/10/ Performance Model Perceived task completion and mean recognition score (concept accuracy) are consistently significant predictors of User Satisfaction Performance model useful for system development –Making predictions about system modifications –Distinguishing ‘good’ dialogues from ‘bad’ dialogues –Part of a learning model

5/10/ Now that we have a Success Metric Could we use it to help drive automatic learning? –Methods for automatically evaluating system performance –Way of obtaining training data for further system development

5/10/ Recognizing `Problematic’ Dialogues Hastie et al, “What’s the Trouble?” ACL 2002Hastie et al Motivation: Identify a Problematic Dialogue Identifier (PDI) to classify dialogues What is a Problematic Dialogue –Task is not completed –User satisfaction is low Results: –Identify dialogues in which task not completed with 85% accuracy –Identify dialogues with low user satisfaction with 89% accuracy

Corpus 1242 recorded dialogues from DARPA Communicator Corpus –Logfiles with events for each user turn –ASR and hand transcriptions –User information: dialect –User Satisfaction survey –Task Completion labels Goal is to predict User Satisfaction (5-25 pts) Task Completion (0,1,2): none, airline task, airline+ground task

DATE Dialogue Act Extraction

5/10/ Features Used in Prediction

5/10/ Results

Summary Intrinsic vs. extrinsic evaluation methods –Key: can we find intrinsic evaluation metrics that correlate with extrinsic results? What other sorts of measures can you think of? 5/10/201522