a principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department Carnegie Mellon University Pittsburgh, PA, 15217
2 understanding errors and rejection systems often misunderstand use confidence scores common design pattern compare input confidence against a threshold reject utterance if confidence is too low may lead to false rejections
rejection threshold 0% 25% 50% 75% misunderstandings vs. false rejections rejection tradeoff misunderstandings false rejections
rejection threshold misunderstandings vs. false rejections correctly vs. incorrectly transferred concepts rejection tradeoff correctly transferred concepts / turn incorrectly transferred
5 given this trade-off, how can we optimize the rejection threshold in a principled fashion? question
6 outline current solutions proposed approach data results conclusion
7 current solutions follow ASR manual [Nuance documentation] acknowledge the tradeoff + postulate costs “misunderstandings are X times more costly than false rejections” [Raymond et al 2004; Kawahara et al, 2000; Cuayahuitl et al, 2002] costs are likely to differ across domains / systems across dialog states within a system
8 proposed approach derive costs in a principled fashion 1.identify a set of variables involved in the tradeoff correctly and incorrectly transferred concepts per turn (CTC, ITC) CTC ITC 2.choose a dialog performance metric task completion (binary, kappa) – TC; 3.build a regression model logit(TC) ← C 0 + C CTC CTC + C ITC ITC 4.optimize threshold to maximize performance th* = argmax (C CTC CTC + C ITC ITC)
9 state-specific costs costs are different in different dialog states CTC and ITC on a per-state basis logit(TC) ← C 0 + C CTCstate1 CTC state1 + C ITCstate1 ITC state1 + C CTCstate2 CTC state2 + C ITCstate2 ITC state2 + C CTCstate3 CTC state3 + C ITCstate3 ITC state3 + … optimize separate threshold for each state th state_x * = argmax (C CTCstate_x CTC state_x + C ITCstate_x ITC state_x )
10 outline current solutions proposed approach data results conclusion
11 data collected using RoomLine phone-based, mixed-initiative spoken dialog system conference room reservations sphinx-2 utterance-level confidence annotator [0-1] 46 participants (first-time users) 10 scenario-driven interactions corpus 449 dialog sessions 8278 user turns manually labeled decoded concept “correctness”
12 roomline states 71 “dialog states” total clustered into 3 classes open-request How may I help you? request(bool) Would you like a reservation for this room? Would you like a room with a projector? request(non-bool) For what time would you like to reserve the room?
13 results: task success model BaselineTrainCross-Vp AVG-LL < HARD 17.62%11.66%11.75% model predicting binary task success sepCoeffVariable ITC / request(non-bool) CTC / request(non-bool) ITC / request(bool) CTC / request(bool) ITC / open-request CTC / open-request Const cost coefficients
14 results: threshold optimization correctly transferred concepts per turn incorrectly transferred concepts per turn utility = 0.55 x CTC – 0.40 x ITC open-request sepCoeffVariable ITC / request(non-bool) CTC / request(non-bool) ITC / request(bool) CTC / request(bool) ITC / open-request CTC / open-request Const cost coefficients
15 results: threshold optimization request(bool) utility = 3.31 x CTC – 0.60 x ITC utility profiles are different across the three states task duration models lead to similar results correctly transferred concepts per turn incorrectly transferred concepts per turn utility = 0.55 x CTC – 0.40 x ITC open-request request(non-bool) utility = 2.55 x CTC – 3.44 x ITC
16 conclusion principled method for optimizing rejection threshold determine costs for various types of understanding errors data-driven approach can derive state-specific costs bridge mismatches between off-the-shelf confidence annotators and domain
17 thank you
18 fit for task success model
19 CurrentNew EstimateDelta Open-request CTC ITC Request bool CTC ITC Request non-bool CTC ITC CurrentNew EstimateDelta Task success82.75%87.16%+4.41% Remains to be seen … expected changes in task success
20 task duration model VariableCoeffpse Const CTC / oreq ITC / oreq CTC / req(bool) ITC / req(bool) CTC / req(non-bool) ITC / req(non-bool)
21 Model 2: Resulting fit and coefficients R^2 = 0.56 intro : data collection : rejection threshold