Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus

Slides:

Advertisements

Similar presentations

Is This Conversation on Track? Utterance Level Confidence Annotation in the CMU Communicator spoken dialog system Presented by: Dan Bohus

Advertisements

Brief introduction on Logistic Regression

Learning Algorithm Evaluation

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Evaluation of segmentation. Example Reference standard & segmentation.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

Yehuda Koren , Joe Sill Recsys’11 best paper award

CMPUT 466/551 Principal Source: CMU

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.

ROC Statistics for the Lazy Machine Learner in All of Us Bradley Malin Lecture for COS Lab School of Computer Science Carnegie Mellon University 9/22/2005.

Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Classification and risk prediction

Model Evaluation Metrics for Performance Evaluation

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Detecting Misunderstandings in the CMU Communicator Spoken Dialog System Presented by: Dan Bohus Joint work with:Paul Carpenter, Chun Jin, Daniel Wilson,

A Data-Driven Approach to Quantifying Natural Human Motion SIGGRAPH ’ 05 Liu Ren, Alton Patrick, Alexei A. Efros, Jassica K. Hodgins, and James M. Rehg.

What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.

Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,

Decision Theory Naïve Bayes ROC Curves

1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Performance Evaluation in Computer Vision Kyungnam Kim Computer Vision Lab, University of Maryland, College Park.

Jeremy Wyatt Thanks to Gavin Brown

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.

A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.

Today Concepts underlying inferential statistics

Testing Intrusion Detection Systems: A Critic for the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory By.

Introduction to Machine Learning Approach Lecture 5.

Decision Tree Models in Data Mining

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

Anomaly detection Problem motivation Machine Learning.

Evaluation – next steps

Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing.

Performance measurement. Must be careful what performance metric we use For example, say we have a NN classifier with 1 output unit, and we code ‘1 =

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Evaluating Results of Learning Blaž Zupan

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.

Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Assessing Responsiveness of Health Measurements Ian McDowell, INTA, Santiago, March 20, 2001.

Evaluating Classification Performance

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.

Logistic Regression Saed Sayad 1www.ismartsoft.com.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

1 SSC 2006: Case Study #2: Obstructive Sleep Apnea Rachel Chu, Shuyu Fan, Kimberly Fernandes, and Jesse Raffa Department of Statistics, University of British.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.

Canadian Bioinformatics Workshops

Lecture 1.31 Criteria for optimal reception of radio signals.

Understanding Results

Asymmetric Gradient Boosting with Application to Spam Filtering

Data Mining Classification: Alternative Techniques

Features & Decision regions

Experiments in Machine Learning

Presentation transcript:

Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus Work by: Dan Bohus, Alex Rudnicky Carnegie Mellon University, 2001

Modeling the cost of misunderstanding …2 Outline  Quick overview of previous utterance-level confidence annotation work.  Modeling the cost of misunderstandings in spoken dialog systems.  Experiments & results.  Further analysis.  Summary, further work, conclusion

Modeling the cost of misunderstanding …3 Utterance-Level Confidence Annotation Overview  Confidence annotation = data-driven classification  Corpus: 2 months, 131 dialogs, 4550 utterances.  Features: 12 features from decoder, parsing, dialog management levels.  Classifiers: Decision Tree, ANN, BayesNet, AdaBoost, NaiveBayes, SVM + Logistic Regression model (later on).

Modeling the cost of misunderstanding …4 Confidence annotator performance  Baseline error rate:32 %  Garble baseline:25 %  Classifiers performance:16 %  Differences between classifiers are statistically insignificant except for Naïve Bayes  On a soft-metric, logistic regression model clearly outperformed the others  But is this the right way to evaluate performance?

Modeling the cost of misunderstanding …5 Judging Performance  Classification Error Rate (FP+FN).  Assumes implicitly that FP and FN errors have same cost  But cost of misunderstanding in dialog systems is presumably different for FPs and FNs.  Build an error function which take into account these costs, and optimize for that.  Cost also depends on  domain/system ~ not a problem  dialog state

Modeling the cost of misunderstanding …6 Problem Formulation  (1) Develop a cost model which allows us to quantitatively assess the costs of FP and FN errors.  (2) Use the costs to pick the optimal tradeoff point on the classifier ROC.

Modeling the cost of misunderstanding …7 The Cost Model  Model the impact of the FPs and FNs on the system performance  Identify a suitable performance metric P  Build a statistical regression model at the dialog session level:  P = f(FPs, FNs)  P = k + Cost FP *FP + Cost FN *FN (Linear Regr)  Then we can plot f, and implicitly optimize for P

Modeling the cost of misunderstanding …8 Measuring Performance  User Satisfaction (i.e. 5-point scale)  Hard to get  Very subjective ~ hard to make it consistent across users  Concept transfer efficiency:  CTC: correctly transferred concepts per turn  ITC: incorrectly transferred concepts per turn  Completion

Modeling the cost of misunderstanding …9 Detour : The Dataset  134 dialogs (2561 utterances), collected using 4 scenarios  Satisfaction scores only for 35 dialogs  Corpus manually labeled at the concept and level  4 labels: OK / RBAD / PBAD / OOD  Aggregate utterance labels generated  Confidence annotator decisions logged  Computed counts of FPs, FNs, CTCs, ITCs for each session

Modeling the cost of misunderstanding …10 Example  U: I want to fly from Pittsburgh to Boston  S: I want to fly from Pittsburgh to Austin  C: [I_want/OK] [Depart_Loc/OK] [Arrive_Loc/RBAD]  Only 2 relevantly expressed concepts  If Accept: CTC = 1, ITC = 1  If Reject: CTC = 0, ITC = 0

Modeling the cost of misunderstanding …11 Targeting Efficiency: Model 1  3 Successively refined models  CTC = FP + FN + TN + k  CTC - correctly transferred concepts / turn  TN – true negatives ModelR 2 allR 2 trainR 2 test CTC=FP+FN+TN

Modeling the cost of misunderstanding …12 Targeting Efficiency: Model 2  CTC - ITC = (REC +) FP + FN + TN + k  ITC - incorrectly transferred concepts / turn  REC – relevantly expressed concepts ModelR 2 allR 2 trainR 2 test CTC=FP+FN+TN CTC-ITC=FP+FN+TN CTC-ITC=REC+FP+FN+TN

Modeling the cost of misunderstanding …13 Targeting Efficiency: Model 3  CTC-ITC = REC+FPC+FPNC+FN+TN+k  2 types of FPs:  With concepts - FPC  Without concepts - FPNC ModelR 2 allR 2 trainR 2 test CTC=FP+FN+TN CTC-ITC=FP+FN+TN CTC-ITC=REC+FP+FN+TN CTC-ITC =REC+FPC+FPNC+FN+TN

Modeling the cost of misunderstanding …14 Model 3 - Results k0.41 C REC 0.62 C FPNC C FPC C FN C TN  CTC-ITC = REC+FPC+FPNC+FN+TN+k

Modeling the cost of misunderstanding …15 Other models  Completion (binary)  Logistic regression model  Estimated model does not indicate a good fit  User satisfaction (5-point scale)  Based on only 35 dialogs  R 2 = 0.61 (similar to literature – Walker et al)  Explanation: subjectivity of metric + limited dataset

Modeling the cost of misunderstanding …16 Problem Formulation  (1) Develop a cost model which allows us to quantitatively assess the costs of FP and FN errors.  (2) Use the costs to pick the optimal tradeoff point on the classifier ROC.

Modeling the cost of misunderstanding …17 Tuning the Confidence Annotator  Using Model 3  CTC-ITC = REC+FPNC+FPC+FN+TN+k  Drop k & REC, plug in the values  Cost = 0.48FPNC+2.12FPC+1.33FN+0.56TN  Minimize Cost instead of Classification Error Rate (FP+FN), and we’ll implicitly maximize concept transfer efficiency.

Modeling the cost of misunderstanding …18 Operating Characteristic

Modeling the cost of misunderstanding …19 Further Analysis  Is CTC-ITC really modeling dialog performance ?  Mean = 0.71, Std.Dev = 0.28  Mean for completed dialogs = 0.82  Mean for uncompleted dialogs = 0.57  Difference between means significant at very high level of confidence  P-value = 7.23*10 -9 (in t-test)  So, looks like CTC-ITC is okay, right ?

Modeling the cost of misunderstanding …20 Further Analysis (cont’d)  Can we reliably extrapolate to other areas of the operating characteristic ?

Modeling the cost of misunderstanding …21 Further Analysis (cont’d)  Can we reliably extrapolate to other areas of the operating characteristic ?  Yes, look at the distribution of the FP and FN ratios across dialogs.

Modeling the cost of misunderstanding …22 Further Analysis (cont’d)  Impact of baseline error rate ?  Compared models constructed based on high and low error rates:  For low error rate curve becomes monotonically increasing  This clearly indicates that “trust everything / have no confidence ” is the way to go in this setting

Modeling the cost of misunderstanding …23 Our explanation so far…  Ability to easily overwrite incorrectly captured information in the CMU Communicator  Relatively low error rates  Likelihood of repeated misrecognition is low

Modeling the cost of misunderstanding …24 Conclusion  Data-driven approach to quantitatively assess the costs of various types of misunderstandings.  Models based on efficiency fit data well; obtained costs confirm intuition.  For CMU Communicator, model predicts that total cost stays the same across a large range of the operating characteristic of the classifier.

Modeling the cost of misunderstanding …25 Further Experiments  But, of course, we can verify predictions experimentally  Collect new data with the system running with a very low threshold.  55 dialogs collected so far.  Thanks to those who have participated in these experiments.  “Help if you have the time” to the others …  Re-estimate models, verify predictions

Modeling the cost of misunderstanding …26 Confusion Matrix OKBAD System says OKTPFP System says BADFNTN FP = False acceptance FN = False detection/rejection Fallout = FP/(FP+TN) = FP/NBAD CDR = 1-Fallout = 1-(FP/NBAD)