Download presentation
Presentation is loading. Please wait.
1
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005
2
2 ASR Errors & Spoken Dialog Call RoomLine! 1-412-268-1084 Call Let’s Go! 1-412-268-1185 intro : data collection : issues under investigation
3
3 Non-understandings and Misunderstandings Recognition errors can lead to 2 types of problems in a spoken dialog system: S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [BERLIN PM] System extracts incorrect information from the user’s turn MIS- understanding intro : data collection : issues under investigation
4
4 Non-understandings and Misunderstandings S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding System cannot extract any meaningful information from the user’s turn How can we prevent non-understandings? How can we recover from them? Detection Set of recovery strategies Policy for choosing between them intro : data collection : issues under investigation
5
5 Current State of Affairs Detection / Diagnosis Systems know when a non-understanding happens There was a detected user turn but no meaningful information Systems decide to reject because of low confidence Not much exists in terms of diagnosis Set of recovery strategies Repeat the question again “Can you repeat that?” “Sorry I didn’t catch that …” Policy for choosing between them Traditionally, simple heuristics are used intro : data collection : issues under investigation
6
6 Questions Under Investigation Detection / Diagnosis What are the main causes (sources) of non-understandings? What is their impact on global performance? Can we diagnose non-understandings at run-time? Can we optimize the rejection process in a more principled way? Set of recovery strategies What is the relative performance of different recovery strategies? Can we refine current strategies and find new ones? Policy for choosing between them Can we improve performance by making smarter choices? If so, can we learn how to make these smarter choices? Data Collection Experiment intro : data collection : issues under investigation
7
7 Data Collection: Experimental Design Subjects interact over the telephone with RoomLine Performed 10 of scenario-based tasks Between-subjects experiment, 2 groups: Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies Wizard: policy is determined at runtime by a human (wizard) 46 subjects, balanced gender x native intro : data collection : issues under investigation
8
8 REPROMPT NOTIFY MOVE-ON HELP REPEAT Non-understanding Strategies S: For when do you need the room? U: [non-understanding] 1. MOVE-ON Sorry, I didn’t catch that. For which day you need the room? 2. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 3. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 4. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … 5. ASK REPEAT (AREP) Could you please repeat that? 6. ASK REPHRASE (ARPH) Could you please try to rephrase that? 7. NOTIFY (NTFY) Sorry, I didn’t catch that... 8. YIELD TURN (YLD) … 9. REPROMPT (RP) For when do you need the conference room? 10. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … intro : data collection : issues under investigation
9
9 Experimental Design: Scenarios 10 scenarios, fixed order Presented graphically (explained during briefing) intro : data collection : issues under investigation
10
10 Experimental Design: Evaluation Participants filled in a SASSI evaluation questionnaire 35 questions, 1-7 Likert scale; 6 factors: response accuracy, likeability, cognitive demand, annoyance, habitability, speed Overall user satisfaction score: 1-7 What did you like best / least? What would you change first? intro : data collection : issues under investigation
11
11 Corpus Statistics / Characteristics 46 users; 449 sessions; 8278 user turns User utterances transcribed & checked Annotated with: Concept transfer & Misunderstandings Correctly, incorrectly, deleted, substituted concepts Correct concept values at each turn Transcript grammaticality labels OK, OOR, OOG, OOS, OOD, VOID, PART Corrections User response to non-understanding recovery Repeat, Rephrase, Contradict, Change, Other intro : data collection : issues under investigation
12
12 Corpus intro : data collection : issues under investigation
13
13 General corpus statistics TotalNativesNon-Natives # Users463412 # Sessions449338111 # User turns827857832495 Average turn length (# words)3.052.913.22 WER25.61%19.60%39.54% CER35.73%26.30%57.58% OOV rate0.35%0.33%0.39% % Non-understandings16.96%13.38%25.25% % Misunderstandings13.53%9.67%22.48% Task success75.06%85.21%44.14% User satisfaction (1-7 Likert)3.934.382.67 intro : data collection : issues under investigation
14
14 Back to the Issues Data Collection Detection / Diagnosis What are the main causes (sources) of non-understandings? What is their impact on global performance? Can we diagnose non-understandings at run-time? Can we optimize the rejection process in a more principled way? Set of recovery strategies What is the relative performance of different recovery strategies? Can we refine current strategies and find new ones Policy for choosing between them Can we improve performance by making smarter choices? If so, can we learn how to make these smarter choices? intro : data collection : issues under investigation
15
15 Next … Data Collection Detection / Diagnosis What are the main causes (sources) of non-understandings? What is their impact on global performance? Can we diagnose non-understandings at run-time? Can we optimize the rejection process in a more principled way? Set of recovery strategies What is the relative performance of different recovery strategies? Can we refine current strategies and find new ones? Policy for choosing between them Can we improve performance by making smarter choices? If so, can we learn how to make these smarter choices? intro : data collection : rejection threshold
16
16 Utterance Rejection Systems use confidence scores to assess reliability of inputs A widely used design pattern: If confidence is very low (i.e. below a certain threshold), reject the utterance altogether Genuine non-understandings + rejections This creates a tradeoff between non-understandings and misunderstandings intro : data collection : rejection threshold
17
17 Nonu- / Mis-understanding Tradeoff The nonu- vs. mis-understanding tradeoff intro : data collection : rejection threshold threshold
18
18 An Alternative, More Informative View Number of Concepts transferred Correctly(CTC) or Incorrectly (ITC) intro : data collection : rejection threshold threshold
19
19 Set the threshold like the ASR manual says: In all likelihood … Mismatch: ASR confidence optimization is probably for WER The tradeoff between misunderstandings and rejections probably varies across domains, and even across dialog states Go for the break-even point Acknowledge the tradeoff; solve it by postulating costs Misunderstandings cost twice as much as rejections Current solutions intro : data collection : rejection threshold
20
20 Proposed Approach Use a data-driven approach to establish the costs, then optimize threshold 1.Identify a set of variables involved in the tradeoff CTC(th) vs. ITC(th) 2.Choose a dialog performance metric TC – task completion (binary, kappa); TD – task duration (# turns), US – user satisfaction 3.Build a regression model m logit(TC) ← C 0 + C CTC CTC + C ITC ITC 4.Optimize threshold to maximize performance th* = argmax (C CTC CTC + C ITC ITC) th intro : data collection : rejection threshold
21
21 State-specific costs & thresholds The costs are potentially different at different points in the dialog Count CTC and ITC at different states with different variables logit(TC) ← C 0 + C CTCstate1 CTC state1 + C ITCstate1 ITC state1 + C CTCstate2 CTC state2 + C ITCstate2 ITC state2 + C CTCstate3 CTC state3 + C ITCstate3 ITC state3 + … Optimize separate threshold for each state th*/state_x = argmax (C CTCstate_x CTC state_x + C ITCstate_x ITC state_x ) intro : data collection : rejection threshold th
22
22 States Considered Open request How may I help you? Request(bool) Did you want a reservation for this room? Request(non-bool) Starting at what time do you need the room? Finer granularity is desirable Can be achieved given more data intro : data collection : rejection threshold
23
23 Model 1: Resulting fit and coefficients VariableCoeffpse Const-2.34420.04161.1504 CTC / oreq0.55180.06190.2955 ITC / oreq-0.40670.38010.4634 CTC / req(bool)3.31270.00101.0076 ITC / req(bool)-0.59590.64911.3098 CTC / req(non-bool)2.55140.00170.8137 ITC / req(non-bool)-3.4410.00181.1046 BaselineTrainCross-V AVG-LL -0.4655-0.2952-0.3059 HARD 17.62%11.66%11.75% intro : data collection : rejection threshold
24
24 Model 1: Threshold optimization Open-request Request(bool) Request(non-bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool):61.00 intro : data collection : rejection threshold
25
25 Results do confirm expectations Anecdotal evidence from data collection indicates too many false rejections on open requests Data analysis confirms this view intro : data collection : rejection threshold
26
26 What would change? CurrentNew EstimateDelta Open-request CTC0.540.89+0.35 ITC0.160.31+0.15 Request bool CTC0.840.86+0.02 ITC0.090.12+0.03 Request non-bool CTC0.720.66-0.06 ITC0.250.17-0.08 CurrentNew EstimateDelta Task success82.75%87.16%+4.41% Remains to be seen … intro : data collection : rejection threshold
27
27 Global performance metric Task duration (successful tasks) - # turns (poisson variable) Generalized linear model / poisson Log(TD) ← C 0 + C CTC CTC + C ITC ITC But different tasks have different durations, so you’d want to normalize: Log(TD x /TD x ) ← C 0 + C CTC CTC + C ITC ITC Instead, use regression offsets Log(TD x ) ← 1Log(TD x ) + C 0 + C CTC CTC + C ITC ITC Tradeoff variables: same as before Model 2: Description intro : data collection : rejection threshold
28
28 Model 2: Resulting fit and coefficients VariableCoeffpse Const1.27500.00000.1019 CTC / oreq-0.17690.00000.0187 ITC / oreq-0.15670.00010.0401 CTC / req(bool)-0.78650.00000.0869 ITC / req(bool)-0.63890.00000.1297 CTC / req(non-bool)-0.51270.00000.0440 ITC / req(non-bool)0.42560.00000.0851 intro : data collection : rejection threshold
29
29 Model 2: Resulting fit and coefficients VariableCoeffpse Const1.27500.00000.1019 CTC / oreq-0.17690.00000.0187 ITC / oreq-0.15670.00010.0401 CTC / req(bool)-0.78650.00000.0869 ITC / req(bool)-0.63890.00000.1297 CTC / req(non-bool)-0.51270.00000.0440 ITC / req(non-bool)0.42560.00000.0851 R^2 = 0.56 intro : data collection : rejection threshold
30
30 Model 1: Threshold optimization Open-request Request(bool) Request(non-bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool):61.00 intro : data collection : rejection threshold
31
31 Conclusion Model for tuning rejection Really data-driven Relates state-specific costs of rejection to global dialog performance Bridge mismatch between off-the-shelf confidence annotation scheme and particular characteristics of system’s domain More data would allow even finer-grained distinctions Expected performance improvements remain to be verified intro : data collection : rejection threshold
32
32 Next time … Data Collection Detection / Diagnosis What are the main causes (sources) of non-understandings? What is their impact on global performance? Can we diagnose non-understandings at run-time? Can we optimize the rejection process in a more principled way? Set of recovery strategies What is the relative performance of different recovery strategies? Can refine current strategies and find new ones? Policy for choosing between them Can we improve performance by making smarter choices? If so, can we learn how to make these smarter choices? intro : data collection : issues under investigation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.