An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

2 Non-understandings S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY]  System knows there was a user turn, but  There is no relevant semantic information in the input  Confidence is too low to trust any semantic information in the input  10 – 30% of turns in a mixed initiative system  GOAL: Do a better job at recovering from non-understandings

3 Recovery Ingredients  Detection  Set of strategies (actions)  Policy (method for choosing between actions)

4 Recovery Ingredients – Non-understandings  Detection  Generally, system knows when a non- understanding happened  Set of strategies (actions)  Notify non-understanding, repeat question, ask repeat/rephrase, provide help, etc.  Policy (method for choosing between actions)  Traditionally fixed heuristic

5 Issues under Investigation  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

6 Issues under Investigation  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

7 Experimental Design - Overview  Subjects interact over the telephone with RoomLine  Perform a number of scenario-based tasks  Between-subjects experiment  Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies  Wizard: policy is determined at runtime by a human (wizard)  46 subjects, balanced Gender x Native

8 MOVE-ON HELP SIGNAL Non-understanding Strategies S: For when do you need the room? U: [non-understanding]  FAIL Sorry, I didn’t catch that. Tell me for what day you need the room  YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am …  TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am …  FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am …  ASK REPEAT (AREP) Could you please repeat that?  ASK REPHRASE (ARPH) Could you please try to rephrase that?  NOTIFY (NTFY) Sorry, I don’t think I understood you correctly…  YIELD TURN (YLD) …  REPEAT SYSTEM PROMPT (REPP) For when do you need the conference room?  EXPLAIN MORE (EXPL) Right now I need to know the date and time for when you need the reservation … Verb. V T A T T T T A T Prompt. Y N Y N N N N Y Y

9 Experimental Design: Scenarios  Presented graphically (explained during briefing)

10 Corpus Statistics / Characteristics  46 users; 484 sessions; ~ 9000 turns  Transcribed  Annotated with:  Misunderstandings & deletions  Non-understandings  Concept transfer accuracy  Transcript grammaticality labels OK, OOR, OOG, OOS, OOD, VOID  Correct concept values in each turn – [ongoing]

11 Back to the Issues  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

12 Impact of Policy on Performance  General picture  Significant improvements for non-natives, especially after non-understandings  Global  Task success Significant improvements (x1.77) for non-natives  SASSI Scores: nothing detectable  Local  WER significant improvements across the board  Understanding error metrics (CT, CER, NONU, MIS) significant improvement for non-natives  Recovery Nothing detectable (?) Faster on the wizard side

13 Impact of Policy on Performance  … Weird stuff  Conclusion?

14  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data Back to the Issues

15 Impact on task performance  Models for predicting task success from various types of errors  [show in Matlab]  Can shed more light on:  Effect of the policy  Native / non-native differences  Costs of various types of errors  Currently analyzing it. Issues:  Build (state-)conditioned cost models  Robustness

17 Individual strategy performance  Under “random”/uniform conditions (control)  All-way-comparison: Matlab, summary file (rank analysis ?)  First conclusions:  Moving-on helps  Help helps  Just signaling is not so good, YLD is pretty bad  Compare with wizard:  Ask Repeat boosted (significantly x1.58)  Wizard reverse engineering (?)  HELP / FAIL behavior in non-natives (?)  Predicting success: when to help, when to ask repeat?

18 MOVE-ON HELP SIGNAL Non-understanding Strategies S: For when do you need the room? U: [non-understanding]  FAIL Sorry, I didn’t catch that. Tell me for what day you need the room  YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am …  TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am …  FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am …  ASK REPEAT (AREP) Could you please repeat that?  ASK REPHRASE (ARPH) Could you please try to rephrase that?  NOTIFY (NTFY) Sorry, I don’t think I understood you correctly…  YIELD TURN (YLD) …  REPEAT SYSTEM PROMPT (REPP) For when do you need the conference room?  EXPLAIN MORE (EXPL) Right now I need to know the date and time for when you need the reservation … Verb. V T A T T T T A T Prompt. Y N Y N N N N Y Y

20 Identify Potential New Strategies  Better informed by the error-type / blame assignment analysis (top of my stack)  So far  Ask user to speak shorter  Ask user to speak louder  Speculative execution

21 Speculative execution  A lot of small recognition errors appear repeatedly  YES > THIS, NEXT  GUEST > YES  GUEST USER > TUESDAY  Etc…  Learn from experience how to avoid these errors  Example: S: Did you say you wanted a room for Tuesday? U: YES [THIS] S: Sorry, I didn’t catch that. Did you say you wanted a room for Tuesday? U: YES [YES]  Learn that “THIS” actually means “YES”

22 Speculative execution - components  Learn mapping  Learner with high precision (no false positives)  Apply mapping  Learner with high recall  Precision / Recall tradeoff  How much can this method really buy us?

23 Speculative Execution – 0st cut  Conservative Learner  Learns from non-understanding segments where Dialogue state is the same throughout (mapping is state- specific) Final response is in focus, contains only one concept and has high confidence  Conservative Applier  Apply only when dialogue state matches and non- understood input matches perfectly at the state level  Going through the whole dataset, learning as you go results:  10% application at the end, does not asymptote yet Precision? (480 ruled learned)  How does this look to you?

24 Speculative execution  Of course much more to dig in here …  Learners which generalize more  Confidence score on the rules  Active learning: appliers with confidence, and feedback into learning  Potentially use it in other cases (not only non- understandings, but potential misunderstandings)

26 Building a Policy from Data  Experiment shown that wizard boosted performance of Ask Repeat  Can we predict likelihood of success for each strategy from features available online?  Identify informative features Might be better informed by error-type/blame-assignment analysis  Try simple classifiers  MDP (?)  Can also formulate problem as a decision boundary or classification problem… (?)

27 Thank you!

28 Experimental Design: Control vs Wizard Conditions  Control: random (uniform) policy  Wizard: human with access to audio & system state Performance Random (uniform) policy Manually designed policy Data-driven designed policy Human wizard with access to audio ? Human wizard with access to only system state ?

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.

Similar presentations

Presentation on theme: "An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.

Similar presentations

Presentation on theme: "An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004."— Presentation transcript:

Similar presentations

About project

Feedback