Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Abstract We present a data-driven approach which allows us to quantitatively assess the costs of various types of errors that a confidence annotator commits in the CMU Communicator spoken dialog system. Knowing these costs we can determine the optimal tradeoff point between these errors, and fine-tune the confidence annotator accordingly. The cost models based on net concept transfer efficiency fit our data quite well, and the relative costs of false-positives and false-negatives are in accordance with our intuitions. We also find, surprisingly that for a mixed-initiative system such as the CMU Communicator, these errors trade-off equally over a wide operating range. ø. ø. 1. Motivation. Problem Formulation. Intro In previous work [1], we have cast the problem of utterance- level confidence annotation as a binary classification task, and have trained multiple classifiers for this purpose: Training corpus: 131 dialogs, 4550 utterances 12 Features from recognition, parsing and dialog level 7 Classifiers: Decision Tree, ANN, Bayesian Net, AdaBoost, Naïve Bayes, SVM, Logistic regression. Results (mean classification error rates in 10-fold cross-validation) * Most of the classifiers obtained statistically indistinguishable results (with the notable exception of Naïve Bayes). The logistic regression model obtained much better performance on a soft-metric Question: Is Classification Error Rate the Right Way to Evaluate Performance ? CER as a measure of performance implicitly assumes that the cost of false-positives and false-negatives is the same. But intuitively this assumption does not hold in most dialog systems: On FP, the system incorporates an will act on invalid info; On FN, the system will reject a valid user utterance. So optimally, we want to build an error function which takes into account these costs, and optimize for that. Problem Formulation 1. Develop a cost model which allows us to Quantitatively assess the costs of FP and FN errors 2. Use these costs to pick an optimal point on the classifier operating characteristic 2. Random baseline32% Previous “Garble” Baseline25% Classifiers*16% Cost Models: The Approach The Approach To model the impact of FPs and FNs on the system performance, we: Identify a suitable dialog performance metric (P) which we want to optimize for Build a statistical regression model on whole sessions using P as the response variable and the counts of FPs and FNs as predictors: -P = f(FPs, FNs) -P = k+Cost FP FP+Cost FNFN (Linear Regression) Performance metrics: User satisfaction (5-point scale): subjective, hard to obtain Completion (binary): too coarse Concept transmission efficiency CTC = correctly transferred concepts/turn ITC = incorrectly transferred concepts/turn REC = relevantly expressed concepts/turn The Dataset 134 dialogs collected using mostly 4 different scenarios utterances User satisfaction scores obtained for only 35 dialogs Corpus manually labeled at the concept level: -4 labels: OK/RBAD/PBAD/OOD -Aggregate utterance labels generated Confidence annotator decisions available in the logs We therefore could compute the counts of FPs, FNs and CTCs and ITCs for each session An Example User: I want to fly from Pittsburgh to Boston Decoder: I want to fly from Pittsburgh to Austin Parse: [I_want/OK] [Depart_Loc/OK] [Arrive_Loc/OK] Only 2 relevantly expressed concepts If Accept: CTC=1, ITC=1, REC=2 If Reject: CTC=0, ITC=0, REC=2
5. Further Analysis Is CPT-IPT an Adequate Metric ? Mean = 0.71; Standard Deviation = 0.28, Mean for Completed dialogs = 0.82, Mean for Uncompleted dialogs = 0.57, differences are statistically significant at a very high level of confidence (p = ) Can We Reliably Extrapolate the Model to Other Areas of the ROC ? The distribution of FPs and FNs across dialogs indicates that, although the data is obtained with the confidence annotator running with a threshold of 0.5, we have enough samples to reliably estimate the other areas of the ROC. How About the Impact of the Baseline Error Rate ? Cost models constructed based on sessions with a low baseline error rate indicate that the optimal point is with the threshold at 0 (no confidence annotator). Explanation: Ability to easily overwrite incorrectly captured information in the CMU Communicator. Relatively low baseline error rates. School of Computer Science, Carnegie Mellon University, 2001, Pittsburgh, PA, Cost Models: The Results Cost Models Targeting Efficiency 3 successively refined cost models were developed targeting efficiency as a response variable. The goodness of fit for this models (indicated by R 2 ), both on the training and in a 10-fold cross-validation process are illustrated in the table below. Model 1:CTC = FP + FN + TN + k Model 2: CTC–ITC = REC + FP + FN + TN + k added the ITC term so that we also minimize the number of incorrectly transferred concepts. REC captures a prior on the verbosity of the user both changes further improve performance Model 3:CTC–ITC = REC + FPC + FPNC + FN + TN + k The FP term was split into 2, since there are 2 different types of false positives in the system, which intuitively should have very different costs: FPC = false positives with relevant concepts FPNC = false positives without relevant concepts The resulting coefficients for model 3 are given below, together with their 95% confidence intervals: Other Models Targeting Completion (binary) Logistic regression model Estimated model does not indicate a good fit Targeting User Satisfaction (5-point scale) Based only on 35 dialogs R 2 =0.61, similar to literature (Walker et al) Explanation: subjectivity of metric + limited dataset ModelR 2 allR 2 trainR 2 test CTC=FP+FN+TN CTC-ITC=FP+FN+TN CTC-ITC=REC+FP+FN+TN CTC-ITC =REC+FPC+FPNC+FN+TN k0.41 C REC 0.62 C FPNC C FPC C FN C TN Fine-tuning the annotator We want to find the optimal trade-off point on the operating characteristic of the classifier. Implicitly we are minimizing classification error rate (FP + FN). So the problem translates to locating a point on the operating characteristic (by moving the classification threshold) which minimizes the total cost (and thus implicitly maximize the chosen performance metric), rather than the classification error rate. The cost, according to model 3 is: Cost = 0.48 FPNC FPC FN TN The fact that the cost function is almost constant across a wide range of thresholds, indicates that the efficiency of the dialog stays about the same, regardless of the ratios of FPs and FNs that the system makes.6. Conclusions Proposed a data-driven approach to quantitatively assess the costs of various types of errors committed by a confidence annotator. Models based on efficiency fit the data well; obtained costs confirm the intuition. For CMU Communicator, the models predict that the total cost stays the same across a large range of the operating characteristic of the confidence annotator.