Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005

2 ASR Errors & Spoken Dialog Call RoomLine! 1-412-268-1084 Call Let’s Go! 1-412-268-1185 intro : data collection : issues under investigation

3 Non-understandings and Misunderstandings  Recognition errors can lead to 2 types of problems in a spoken dialog system: S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding  System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [BERLIN PM]  System extracts incorrect information from the user’s turn MIS- understanding intro : data collection : issues under investigation

4 Non-understandings and Misunderstandings S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding  System cannot extract any meaningful information from the user’s turn  How can we prevent non-understandings?  How can we recover from them?  Detection  Set of recovery strategies  Policy for choosing between them intro : data collection : issues under investigation

5 Current State of Affairs  Detection / Diagnosis  Systems know when a non-understanding happens There was a detected user turn but no meaningful information Systems decide to reject because of low confidence  Not much exists in terms of diagnosis  Set of recovery strategies  Repeat the question again  “Can you repeat that?”  “Sorry I didn’t catch that …”  Policy for choosing between them  Traditionally, simple heuristics are used intro : data collection : issues under investigation

6 Questions Under Investigation  Detection / Diagnosis  What are the main causes (sources) of non-understandings?  What is their impact on global performance?  Can we diagnose non-understandings at run-time?  Can we optimize the rejection process in a more principled way?  Set of recovery strategies  What is the relative performance of different recovery strategies?  Can we refine current strategies and find new ones?  Policy for choosing between them  Can we improve performance by making smarter choices?  If so, can we learn how to make these smarter choices?  Data Collection Experiment intro : data collection : issues under investigation

7 Data Collection: Experimental Design  Subjects interact over the telephone with RoomLine  Performed 10 of scenario-based tasks  Between-subjects experiment, 2 groups:  Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies  Wizard: policy is determined at runtime by a human (wizard)  46 subjects, balanced gender x native intro : data collection : issues under investigation

8 REPROMPT NOTIFY MOVE-ON HELP REPEAT Non-understanding Strategies S: For when do you need the room? U: [non-understanding] 1. MOVE-ON Sorry, I didn’t catch that. For which day you need the room? 2. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 3. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 4. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … 5. ASK REPEAT (AREP) Could you please repeat that? 6. ASK REPHRASE (ARPH) Could you please try to rephrase that? 7. NOTIFY (NTFY) Sorry, I didn’t catch that... 8. YIELD TURN (YLD) … 9. REPROMPT (RP) For when do you need the conference room? 10. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … intro : data collection : issues under investigation

9 Experimental Design: Scenarios  10 scenarios, fixed order  Presented graphically (explained during briefing) intro : data collection : issues under investigation

10 Experimental Design: Evaluation  Participants filled in a SASSI evaluation questionnaire  35 questions, 1-7 Likert scale; 6 factors: response accuracy, likeability, cognitive demand, annoyance, habitability, speed  Overall user satisfaction score: 1-7  What did you like best / least?  What would you change first? intro : data collection : issues under investigation

11 Corpus Statistics / Characteristics  46 users; 449 sessions; 8278 user turns  User utterances transcribed & checked  Annotated with:  Concept transfer & Misunderstandings Correctly, incorrectly, deleted, substituted concepts  Correct concept values at each turn  Transcript grammaticality labels OK, OOR, OOG, OOS, OOD, VOID, PART  Corrections  User response to non-understanding recovery Repeat, Rephrase, Contradict, Change, Other intro : data collection : issues under investigation

12 Corpus intro : data collection : issues under investigation

13 General corpus statistics TotalNativesNon-Natives # Users463412 # Sessions449338111 # User turns827857832495 Average turn length (# words)3.052.913.22 WER25.61%19.60%39.54% CER35.73%26.30%57.58% OOV rate0.35%0.33%0.39% % Non-understandings16.96%13.38%25.25% % Misunderstandings13.53%9.67%22.48% Task success75.06%85.21%44.14% User satisfaction (1-7 Likert)3.934.382.67 intro : data collection : issues under investigation

14 Back to the Issues  Data Collection  Detection / Diagnosis  What are the main causes (sources) of non-understandings?  What is their impact on global performance?  Can we diagnose non-understandings at run-time?  Can we optimize the rejection process in a more principled way?  Set of recovery strategies  What is the relative performance of different recovery strategies?  Can we refine current strategies and find new ones  Policy for choosing between them  Can we improve performance by making smarter choices?  If so, can we learn how to make these smarter choices? intro : data collection : issues under investigation

15 Next …  Data Collection  Detection / Diagnosis  What are the main causes (sources) of non-understandings?  What is their impact on global performance?  Can we diagnose non-understandings at run-time?  Can we optimize the rejection process in a more principled way?  Set of recovery strategies  What is the relative performance of different recovery strategies?  Can we refine current strategies and find new ones?  Policy for choosing between them  Can we improve performance by making smarter choices?  If so, can we learn how to make these smarter choices? intro : data collection : rejection threshold

16 Utterance Rejection  Systems use confidence scores to assess reliability of inputs  A widely used design pattern:  If confidence is very low (i.e. below a certain threshold), reject the utterance altogether  Genuine non-understandings + rejections  This creates a tradeoff between non-understandings and misunderstandings intro : data collection : rejection threshold

17 Nonu- / Mis-understanding Tradeoff  The nonu- vs. mis-understanding tradeoff intro : data collection : rejection threshold threshold

18 An Alternative, More Informative View  Number of Concepts transferred  Correctly(CTC) or Incorrectly (ITC) intro : data collection : rejection threshold threshold

19  Set the threshold like the ASR manual says:  In all likelihood … Mismatch: ASR confidence optimization is probably for WER The tradeoff between misunderstandings and rejections probably varies across domains, and even across dialog states  Go for the break-even point  Acknowledge the tradeoff; solve it by postulating costs  Misunderstandings cost twice as much as rejections Current solutions intro : data collection : rejection threshold

20 Proposed Approach  Use a data-driven approach to establish the costs, then optimize threshold 1.Identify a set of variables involved in the tradeoff CTC(th) vs. ITC(th) 2.Choose a dialog performance metric TC – task completion (binary, kappa); TD – task duration (# turns), US – user satisfaction 3.Build a regression model m logit(TC) ← C 0 + C CTC CTC + C ITC ITC 4.Optimize threshold to maximize performance th* = argmax (C CTC CTC + C ITC ITC) th intro : data collection : rejection threshold

21 State-specific costs & thresholds  The costs are potentially different at different points in the dialog  Count CTC and ITC at different states with different variables logit(TC) ← C 0 + C CTCstate1 CTC state1 + C ITCstate1 ITC state1 + C CTCstate2 CTC state2 + C ITCstate2 ITC state2 + C CTCstate3 CTC state3 + C ITCstate3 ITC state3 + …  Optimize separate threshold for each state th*/state_x = argmax (C CTCstate_x CTC state_x + C ITCstate_x ITC state_x ) intro : data collection : rejection threshold th

22 States Considered  Open request  How may I help you?  Request(bool)  Did you want a reservation for this room?  Request(non-bool)  Starting at what time do you need the room?  Finer granularity is desirable  Can be achieved given more data intro : data collection : rejection threshold

23 Model 1: Resulting fit and coefficients VariableCoeffpse Const-2.34420.04161.1504 CTC / oreq0.55180.06190.2955 ITC / oreq-0.40670.38010.4634 CTC / req(bool)3.31270.00101.0076 ITC / req(bool)-0.59590.64911.3098 CTC / req(non-bool)2.55140.00170.8137 ITC / req(non-bool)-3.4410.00181.1046 BaselineTrainCross-V AVG-LL -0.4655-0.2952-0.3059 HARD 17.62%11.66%11.75% intro : data collection : rejection threshold

24 Model 1: Threshold optimization Open-request Request(bool) Request(non-bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool):61.00 intro : data collection : rejection threshold

25 Results do confirm expectations  Anecdotal evidence from data collection indicates too many false rejections on open requests  Data analysis confirms this view intro : data collection : rejection threshold

26 What would change? CurrentNew EstimateDelta Open-request CTC0.540.89+0.35 ITC0.160.31+0.15 Request bool CTC0.840.86+0.02 ITC0.090.12+0.03 Request non-bool CTC0.720.66-0.06 ITC0.250.17-0.08 CurrentNew EstimateDelta Task success82.75%87.16%+4.41% Remains to be seen … intro : data collection : rejection threshold

27  Global performance metric  Task duration (successful tasks) - # turns (poisson variable)  Generalized linear model / poisson Log(TD) ← C 0 + C CTC CTC + C ITC ITC  But different tasks have different durations, so you’d want to normalize: Log(TD x /TD x ) ← C 0 + C CTC CTC + C ITC ITC  Instead, use regression offsets Log(TD x ) ← 1Log(TD x ) + C 0 + C CTC CTC + C ITC ITC  Tradeoff variables: same as before Model 2: Description intro : data collection : rejection threshold

28 Model 2: Resulting fit and coefficients VariableCoeffpse Const1.27500.00000.1019 CTC / oreq-0.17690.00000.0187 ITC / oreq-0.15670.00010.0401 CTC / req(bool)-0.78650.00000.0869 ITC / req(bool)-0.63890.00000.1297 CTC / req(non-bool)-0.51270.00000.0440 ITC / req(non-bool)0.42560.00000.0851 intro : data collection : rejection threshold

29 Model 2: Resulting fit and coefficients VariableCoeffpse Const1.27500.00000.1019 CTC / oreq-0.17690.00000.0187 ITC / oreq-0.15670.00010.0401 CTC / req(bool)-0.78650.00000.0869 ITC / req(bool)-0.63890.00000.1297 CTC / req(non-bool)-0.51270.00000.0440 ITC / req(non-bool)0.42560.00000.0851 R^2 = 0.56 intro : data collection : rejection threshold

30 Model 1: Threshold optimization Open-request Request(bool) Request(non-bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool):61.00 intro : data collection : rejection threshold

31 Conclusion  Model for tuning rejection  Really data-driven  Relates state-specific costs of rejection to global dialog performance  Bridge mismatch between off-the-shelf confidence annotation scheme and particular characteristics of system’s domain  More data would allow even finer-grained distinctions  Expected performance improvements remain to be verified intro : data collection : rejection threshold

32 Next time …  Data Collection  Detection / Diagnosis  What are the main causes (sources) of non-understandings?  What is their impact on global performance?  Can we diagnose non-understandings at run-time?  Can we optimize the rejection process in a more principled way?  Set of recovery strategies  What is the relative performance of different recovery strategies?  Can refine current strategies and find new ones?  Policy for choosing between them  Can we improve performance by making smarter choices?  If so, can we learn how to make these smarter choices? intro : data collection : issues under investigation

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.

Similar presentations

Presentation on theme: "Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.

Similar presentations

Presentation on theme: "Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch."— Presentation transcript:

Similar presentations

About project

Feedback