Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus Work by: Dan Bohus, Alex Rudnicky Carnegie Mellon University, 2001
Modeling the cost of misunderstanding …2 Outline Quick overview of previous utterance-level confidence annotation work. Modeling the cost of misunderstandings in spoken dialog systems. Experiments & results. Further analysis. Summary, further work, conclusion
Modeling the cost of misunderstanding …3 Utterance-Level Confidence Annotation Overview Confidence annotation = data-driven classification Corpus: 2 months, 131 dialogs, 4550 utterances. Features: 12 features from decoder, parsing, dialog management levels. Classifiers: Decision Tree, ANN, BayesNet, AdaBoost, NaiveBayes, SVM + Logistic Regression model (later on).
Modeling the cost of misunderstanding …4 Confidence annotator performance Baseline error rate:32 % Garble baseline:25 % Classifiers performance:16 % Differences between classifiers are statistically insignificant except for Naïve Bayes On a soft-metric, logistic regression model clearly outperformed the others But is this the right way to evaluate performance?
Modeling the cost of misunderstanding …5 Judging Performance Classification Error Rate (FP+FN). Assumes implicitly that FP and FN errors have same cost But cost of misunderstanding in dialog systems is presumably different for FPs and FNs. Build an error function which take into account these costs, and optimize for that. Cost also depends on domain/system ~ not a problem dialog state
Modeling the cost of misunderstanding …6 Problem Formulation (1) Develop a cost model which allows us to quantitatively assess the costs of FP and FN errors. (2) Use the costs to pick the optimal tradeoff point on the classifier ROC.
Modeling the cost of misunderstanding …7 The Cost Model Model the impact of the FPs and FNs on the system performance Identify a suitable performance metric P Build a statistical regression model at the dialog session level: P = f(FPs, FNs) P = k + Cost FP *FP + Cost FN *FN (Linear Regr) Then we can plot f, and implicitly optimize for P
Modeling the cost of misunderstanding …8 Measuring Performance User Satisfaction (i.e. 5-point scale) Hard to get Very subjective ~ hard to make it consistent across users Concept transfer efficiency: CTC: correctly transferred concepts per turn ITC: incorrectly transferred concepts per turn Completion
Modeling the cost of misunderstanding …9 Detour : The Dataset 134 dialogs (2561 utterances), collected using 4 scenarios Satisfaction scores only for 35 dialogs Corpus manually labeled at the concept and level 4 labels: OK / RBAD / PBAD / OOD Aggregate utterance labels generated Confidence annotator decisions logged Computed counts of FPs, FNs, CTCs, ITCs for each session
Modeling the cost of misunderstanding …10 Example U: I want to fly from Pittsburgh to Boston S: I want to fly from Pittsburgh to Austin C: [I_want/OK] [Depart_Loc/OK] [Arrive_Loc/RBAD] Only 2 relevantly expressed concepts If Accept: CTC = 1, ITC = 1 If Reject: CTC = 0, ITC = 0
Modeling the cost of misunderstanding …11 Targeting Efficiency: Model 1 3 Successively refined models CTC = FP + FN + TN + k CTC - correctly transferred concepts / turn TN – true negatives ModelR 2 allR 2 trainR 2 test CTC=FP+FN+TN
Modeling the cost of misunderstanding …12 Targeting Efficiency: Model 2 CTC - ITC = (REC +) FP + FN + TN + k ITC - incorrectly transferred concepts / turn REC – relevantly expressed concepts ModelR 2 allR 2 trainR 2 test CTC=FP+FN+TN CTC-ITC=FP+FN+TN CTC-ITC=REC+FP+FN+TN
Modeling the cost of misunderstanding …13 Targeting Efficiency: Model 3 CTC-ITC = REC+FPC+FPNC+FN+TN+k 2 types of FPs: With concepts - FPC Without concepts - FPNC ModelR 2 allR 2 trainR 2 test CTC=FP+FN+TN CTC-ITC=FP+FN+TN CTC-ITC=REC+FP+FN+TN CTC-ITC =REC+FPC+FPNC+FN+TN
Modeling the cost of misunderstanding …14 Model 3 - Results k0.41 C REC 0.62 C FPNC C FPC C FN C TN CTC-ITC = REC+FPC+FPNC+FN+TN+k
Modeling the cost of misunderstanding …15 Other models Completion (binary) Logistic regression model Estimated model does not indicate a good fit User satisfaction (5-point scale) Based on only 35 dialogs R 2 = 0.61 (similar to literature – Walker et al) Explanation: subjectivity of metric + limited dataset
Modeling the cost of misunderstanding …16 Problem Formulation (1) Develop a cost model which allows us to quantitatively assess the costs of FP and FN errors. (2) Use the costs to pick the optimal tradeoff point on the classifier ROC.
Modeling the cost of misunderstanding …17 Tuning the Confidence Annotator Using Model 3 CTC-ITC = REC+FPNC+FPC+FN+TN+k Drop k & REC, plug in the values Cost = 0.48FPNC+2.12FPC+1.33FN+0.56TN Minimize Cost instead of Classification Error Rate (FP+FN), and we’ll implicitly maximize concept transfer efficiency.
Modeling the cost of misunderstanding …18 Operating Characteristic
Modeling the cost of misunderstanding …19 Further Analysis Is CTC-ITC really modeling dialog performance ? Mean = 0.71, Std.Dev = 0.28 Mean for completed dialogs = 0.82 Mean for uncompleted dialogs = 0.57 Difference between means significant at very high level of confidence P-value = 7.23*10 -9 (in t-test) So, looks like CTC-ITC is okay, right ?
Modeling the cost of misunderstanding …20 Further Analysis (cont’d) Can we reliably extrapolate to other areas of the operating characteristic ?
Modeling the cost of misunderstanding …21 Further Analysis (cont’d) Can we reliably extrapolate to other areas of the operating characteristic ? Yes, look at the distribution of the FP and FN ratios across dialogs.
Modeling the cost of misunderstanding …22 Further Analysis (cont’d) Impact of baseline error rate ? Compared models constructed based on high and low error rates: For low error rate curve becomes monotonically increasing This clearly indicates that “trust everything / have no confidence ” is the way to go in this setting
Modeling the cost of misunderstanding …23 Our explanation so far… Ability to easily overwrite incorrectly captured information in the CMU Communicator Relatively low error rates Likelihood of repeated misrecognition is low
Modeling the cost of misunderstanding …24 Conclusion Data-driven approach to quantitatively assess the costs of various types of misunderstandings. Models based on efficiency fit data well; obtained costs confirm intuition. For CMU Communicator, model predicts that total cost stays the same across a large range of the operating characteristic of the classifier.
Modeling the cost of misunderstanding …25 Further Experiments But, of course, we can verify predictions experimentally Collect new data with the system running with a very low threshold. 55 dialogs collected so far. Thanks to those who have participated in these experiments. “Help if you have the time” to the others … Re-estimate models, verify predictions
Modeling the cost of misunderstanding …26 Confusion Matrix OKBAD System says OKTPFP System says BADFNTN FP = False acceptance FN = False detection/rejection Fallout = FP/(FP+TN) = FP/NBAD CDR = 1-Fallout = 1-(FP/NBAD)