Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005
2 ASR Errors & Spoken Dialog Call RoomLine! Call Let’s Go! intro : data collection : issues under investigation
3 Non-understandings and Misunderstandings Recognition errors can lead to 2 types of problems in a spoken dialog system: S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [BERLIN PM] System extracts incorrect information from the user’s turn MIS- understanding intro : data collection : issues under investigation
4 Non-understandings and Misunderstandings S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding System cannot extract any meaningful information from the user’s turn How can we prevent non-understandings? How can we recover from them? Detection Set of recovery strategies Policy for choosing between them intro : data collection : issues under investigation
5 Current State of Affairs Detection / Diagnosis Systems know when a non-understanding happens There was a detected user turn but no meaningful information Systems decide to reject because of low confidence Not much exists in terms of diagnosis Set of recovery strategies Repeat the question again “Can you repeat that?” “Sorry I didn’t catch that …” Policy for choosing between them Traditionally, simple heuristics are used intro : data collection : issues under investigation
6 Questions Under Investigation Detection / Diagnosis What are the main causes (sources) of non-understandings? What is their impact on global performance? Can we diagnose non-understandings at run-time? Can we optimize the rejection process in a more principled way? Set of recovery strategies What is the relative performance of different recovery strategies? Can we refine current strategies and find new ones? Policy for choosing between them Can we improve performance by making smarter choices? If so, can we learn how to make these smarter choices? Data Collection Experiment intro : data collection : issues under investigation
7 Data Collection: Experimental Design Subjects interact over the telephone with RoomLine Performed 10 of scenario-based tasks Between-subjects experiment, 2 groups: Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies Wizard: policy is determined at runtime by a human (wizard) 46 subjects, balanced gender x native intro : data collection : issues under investigation
8 REPROMPT NOTIFY MOVE-ON HELP REPEAT Non-understanding Strategies S: For when do you need the room? U: [non-understanding] 1. MOVE-ON Sorry, I didn’t catch that. For which day you need the room? 2. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 3. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 4. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … 5. ASK REPEAT (AREP) Could you please repeat that? 6. ASK REPHRASE (ARPH) Could you please try to rephrase that? 7. NOTIFY (NTFY) Sorry, I didn’t catch that YIELD TURN (YLD) … 9. REPROMPT (RP) For when do you need the conference room? 10. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … intro : data collection : issues under investigation
9 Experimental Design: Scenarios 10 scenarios, fixed order Presented graphically (explained during briefing) intro : data collection : issues under investigation
10 Experimental Design: Evaluation Participants filled in a SASSI evaluation questionnaire 35 questions, 1-7 Likert scale; 6 factors: response accuracy, likeability, cognitive demand, annoyance, habitability, speed Overall user satisfaction score: 1-7 What did you like best / least? What would you change first? intro : data collection : issues under investigation
11 Corpus Statistics / Characteristics 46 users; 449 sessions; 8278 user turns User utterances transcribed & checked Annotated with: Concept transfer & Misunderstandings Correctly, incorrectly, deleted, substituted concepts Correct concept values at each turn Transcript grammaticality labels OK, OOR, OOG, OOS, OOD, VOID, PART Corrections User response to non-understanding recovery Repeat, Rephrase, Contradict, Change, Other intro : data collection : issues under investigation
12 Corpus intro : data collection : issues under investigation
13 General corpus statistics TotalNativesNon-Natives # Users # Sessions # User turns Average turn length (# words) WER25.61%19.60%39.54% CER35.73%26.30%57.58% OOV rate0.35%0.33%0.39% % Non-understandings16.96%13.38%25.25% % Misunderstandings13.53%9.67%22.48% Task success75.06%85.21%44.14% User satisfaction (1-7 Likert) intro : data collection : issues under investigation
14 Back to the Issues Data Collection Detection / Diagnosis What are the main causes (sources) of non-understandings? What is their impact on global performance? Can we diagnose non-understandings at run-time? Can we optimize the rejection process in a more principled way? Set of recovery strategies What is the relative performance of different recovery strategies? Can we refine current strategies and find new ones Policy for choosing between them Can we improve performance by making smarter choices? If so, can we learn how to make these smarter choices? intro : data collection : issues under investigation
15 Next … Data Collection Detection / Diagnosis What are the main causes (sources) of non-understandings? What is their impact on global performance? Can we diagnose non-understandings at run-time? Can we optimize the rejection process in a more principled way? Set of recovery strategies What is the relative performance of different recovery strategies? Can we refine current strategies and find new ones? Policy for choosing between them Can we improve performance by making smarter choices? If so, can we learn how to make these smarter choices? intro : data collection : rejection threshold
16 Utterance Rejection Systems use confidence scores to assess reliability of inputs A widely used design pattern: If confidence is very low (i.e. below a certain threshold), reject the utterance altogether Genuine non-understandings + rejections This creates a tradeoff between non-understandings and misunderstandings intro : data collection : rejection threshold
17 Nonu- / Mis-understanding Tradeoff The nonu- vs. mis-understanding tradeoff intro : data collection : rejection threshold threshold
18 An Alternative, More Informative View Number of Concepts transferred Correctly(CTC) or Incorrectly (ITC) intro : data collection : rejection threshold threshold
19 Set the threshold like the ASR manual says: In all likelihood … Mismatch: ASR confidence optimization is probably for WER The tradeoff between misunderstandings and rejections probably varies across domains, and even across dialog states Go for the break-even point Acknowledge the tradeoff; solve it by postulating costs Misunderstandings cost twice as much as rejections Current solutions intro : data collection : rejection threshold
20 Proposed Approach Use a data-driven approach to establish the costs, then optimize threshold 1.Identify a set of variables involved in the tradeoff CTC(th) vs. ITC(th) 2.Choose a dialog performance metric TC – task completion (binary, kappa); TD – task duration (# turns), US – user satisfaction 3.Build a regression model m logit(TC) ← C 0 + C CTC CTC + C ITC ITC 4.Optimize threshold to maximize performance th* = argmax (C CTC CTC + C ITC ITC) th intro : data collection : rejection threshold
21 State-specific costs & thresholds The costs are potentially different at different points in the dialog Count CTC and ITC at different states with different variables logit(TC) ← C 0 + C CTCstate1 CTC state1 + C ITCstate1 ITC state1 + C CTCstate2 CTC state2 + C ITCstate2 ITC state2 + C CTCstate3 CTC state3 + C ITCstate3 ITC state3 + … Optimize separate threshold for each state th*/state_x = argmax (C CTCstate_x CTC state_x + C ITCstate_x ITC state_x ) intro : data collection : rejection threshold th
22 States Considered Open request How may I help you? Request(bool) Did you want a reservation for this room? Request(non-bool) Starting at what time do you need the room? Finer granularity is desirable Can be achieved given more data intro : data collection : rejection threshold
23 Model 1: Resulting fit and coefficients VariableCoeffpse Const CTC / oreq ITC / oreq CTC / req(bool) ITC / req(bool) CTC / req(non-bool) ITC / req(non-bool) BaselineTrainCross-V AVG-LL HARD 17.62%11.66%11.75% intro : data collection : rejection threshold
24 Model 1: Threshold optimization Open-request Request(bool) Request(non-bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool):61.00 intro : data collection : rejection threshold
25 Results do confirm expectations Anecdotal evidence from data collection indicates too many false rejections on open requests Data analysis confirms this view intro : data collection : rejection threshold
26 What would change? CurrentNew EstimateDelta Open-request CTC ITC Request bool CTC ITC Request non-bool CTC ITC CurrentNew EstimateDelta Task success82.75%87.16%+4.41% Remains to be seen … intro : data collection : rejection threshold
27 Global performance metric Task duration (successful tasks) - # turns (poisson variable) Generalized linear model / poisson Log(TD) ← C 0 + C CTC CTC + C ITC ITC But different tasks have different durations, so you’d want to normalize: Log(TD x /TD x ) ← C 0 + C CTC CTC + C ITC ITC Instead, use regression offsets Log(TD x ) ← 1Log(TD x ) + C 0 + C CTC CTC + C ITC ITC Tradeoff variables: same as before Model 2: Description intro : data collection : rejection threshold
28 Model 2: Resulting fit and coefficients VariableCoeffpse Const CTC / oreq ITC / oreq CTC / req(bool) ITC / req(bool) CTC / req(non-bool) ITC / req(non-bool) intro : data collection : rejection threshold
29 Model 2: Resulting fit and coefficients VariableCoeffpse Const CTC / oreq ITC / oreq CTC / req(bool) ITC / req(bool) CTC / req(non-bool) ITC / req(non-bool) R^2 = 0.56 intro : data collection : rejection threshold
30 Model 1: Threshold optimization Open-request Request(bool) Request(non-bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool):61.00 intro : data collection : rejection threshold
31 Conclusion Model for tuning rejection Really data-driven Relates state-specific costs of rejection to global dialog performance Bridge mismatch between off-the-shelf confidence annotation scheme and particular characteristics of system’s domain More data would allow even finer-grained distinctions Expected performance improvements remain to be verified intro : data collection : rejection threshold
32 Next time … Data Collection Detection / Diagnosis What are the main causes (sources) of non-understandings? What is their impact on global performance? Can we diagnose non-understandings at run-time? Can we optimize the rejection process in a more principled way? Set of recovery strategies What is the relative performance of different recovery strategies? Can refine current strategies and find new ones? Policy for choosing between them Can we improve performance by making smarter choices? If so, can we learn how to make these smarter choices? intro : data collection : issues under investigation