An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.

Slides:

Advertisements

Similar presentations

2009 English Education Program

Advertisements

Welcome Back to School!!! Mr. Sortina.

(nothing to see here). First thing you need to learn is that sysadmin is about people, not technology If youre a sysadmin so you dont have to deal with.

Is This Conversation on Track? Utterance Level Confidence Annotation in the CMU Communicator spoken dialog system Presented by: Dan Bohus

The right tools for the job How to choose a web / bespoke development company.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented.

Foundations and Strategies Attention Investment CS352.

Imbalanced data David Kauchak CS 451 – Fall 2013.

Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

The HIGGINS domain The primary domain of HIGGINS is city navigation for pedestrians. Secondarily, HIGGINS is intended to provide simple information about.

constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University.

Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.

What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various.

Belief Updating in Spoken Dialog Systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh,

Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,

Online supervised learning of non-understanding recovery policies Dan Bohus Computer Science Department Carnegie.

Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002.

Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus

belief updating in spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements Alex Rudnicky,

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department

Perceptions of the Role of Feedback in Supporting 1 st Yr Learning Jon Scott, Ruth Bevan, Jo Badge & Alan Cann School of Biological Sciences.

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.

A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.

Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.

How to Read and Understand Your Textbook

Speech Guidelines 2 of Errors VUIs are error-prone due to speech recognition. Humans aren’t perfect speech recognizers, therefore, machines aren’t.

Object detection, tracking and event recognition: the ETISEO experience Andrea Cavallaro Multimedia and Vision Lab Queen Mary, University of London

Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.

Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 7: Focusing on Users and Their Tasks.

Conclusions (in general… and for this essay). Purpose: The conclusion of an essay has a few purposes. In addition, there are several different kinds of.

Automatic Detection of Plagiarized Spoken Responses Copyright © 2014 by Educational Testing Service. All rights reserved. Keelan Evanini and Xinhao Wang.

SEG3120 User Interfaces Design and Implementation

Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.

Expository Essays What is it? How do you write it? How do you rock it?

Scaling up Decision Trees. Decision tree learning.

A Successful Dialogue without Adaptation S: Hi, this is AT&T Amtrak schedule system. This is Toot. How may I help you? U: I want a train from Baltimore.

Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.

Wendy L. DuCassé, MSW, LCSW AIM username: InstructorWLD HN205: Applied Skills for Human Services Unit 1: Introduction to the Interview.

Sampling and Probability Chapter 5. Sampling & Elections >Problems with predicting elections: Sample sizes are too small Samples are biased (also tied.

CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.

AutoTutor Benjamin Kempe Tutoring Research Group, University of Memphis

Copyright 2010, The World Bank Group. All Rights Reserved. Testing and Documentation Part II.

Inha University 2010 English Education Program. Welcome to Language Teaching Activities For Teachers Inha Tesol 2010 Friday Nights 7:55 -9:10.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Programming Errors. Errors of different types Syntax errors – easiest to fix, found by compiler or interpreter Semantic errors – logic errors, found by.

Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.

Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.

User Modeling and Recommender Systems: evaluation and interfaces Adolfo Ruiz Calleja 18/10/2014.

Learning from Obama: Redesigning Analytics.  In 2008, Obama campaign raised $750 million  Would not be enough in 2012 The fundraising challenge Not.

CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.

RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Writing Across the Curriculum at Kennedy-King College Some myths and facts…

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

The Information School of the University of Washington Information System Design Info-440 Autumn 2002 Session #20.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

The Fine Art of Knowing How Wrong You Might Be. To Err Is Human Humans make an infinitude of mistakes. I figure safety in numbers makes it a little more.

Final Project ED Modeling & Prediction

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

CSSE463: Image Recognition Day 11

Spoken Dialogue Systems

CSSE463: Image Recognition Day 11

Spoken Dialogue Systems

CSSE463: Image Recognition Day 11

CSSE463: Image Recognition Day 11

Presentation transcript:

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

2 Non-understandings S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY]  System knows there was a user turn, but  There is no relevant semantic information in the input  Confidence is too low to trust any semantic information in the input  10 – 30% of turns in a mixed initiative system  GOAL: Do a better job at recovering from non-understandings

3 Recovery Ingredients  Detection  Set of strategies (actions)  Policy (method for choosing between actions)

4 Recovery Ingredients – Non-understandings  Detection  Generally, system knows when a non- understanding happened  Set of strategies (actions)  Notify non-understanding, repeat question, ask repeat/rephrase, provide help, etc.  Policy (method for choosing between actions)  Traditionally fixed heuristic

5 Issues under Investigation  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

6 Issues under Investigation  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

7 Experimental Design - Overview  Subjects interact over the telephone with RoomLine  Perform a number of scenario-based tasks  Between-subjects experiment  Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies  Wizard: policy is determined at runtime by a human (wizard)  46 subjects, balanced Gender x Native

8 MOVE-ON HELP SIGNAL Non-understanding Strategies S: For when do you need the room? U: [non-understanding]  FAIL Sorry, I didn’t catch that. Tell me for what day you need the room  YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am …  TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am …  FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am …  ASK REPEAT (AREP) Could you please repeat that?  ASK REPHRASE (ARPH) Could you please try to rephrase that?  NOTIFY (NTFY) Sorry, I don’t think I understood you correctly…  YIELD TURN (YLD) …  REPEAT SYSTEM PROMPT (REPP) For when do you need the conference room?  EXPLAIN MORE (EXPL) Right now I need to know the date and time for when you need the reservation … Verb. V T A T T T T A T Prompt. Y N Y N N N N Y Y

9 Experimental Design: Scenarios  Presented graphically (explained during briefing)

10 Corpus Statistics / Characteristics  46 users; 484 sessions; ~ 9000 turns  Transcribed  Annotated with:  Misunderstandings & deletions  Non-understandings  Concept transfer accuracy  Transcript grammaticality labels OK, OOR, OOG, OOS, OOD, VOID  Correct concept values in each turn – [ongoing]

11 Back to the Issues  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

12 Impact of Policy on Performance  General picture  Significant improvements for non-natives, especially after non-understandings  Global  Task success Significant improvements (x1.77) for non-natives  SASSI Scores: nothing detectable  Local  WER significant improvements across the board  Understanding error metrics (CT, CER, NONU, MIS) significant improvement for non-natives  Recovery Nothing detectable (?) Faster on the wizard side

13 Impact of Policy on Performance  … Weird stuff  Conclusion?

14  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data Back to the Issues

15 Impact on task performance  Models for predicting task success from various types of errors  [show in Matlab]  Can shed more light on:  Effect of the policy  Native / non-native differences  Costs of various types of errors  Currently analyzing it. Issues:  Build (state-)conditioned cost models  Robustness

16 Back to the Issues  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

17 Individual strategy performance  Under “random”/uniform conditions (control)  All-way-comparison: Matlab, summary file (rank analysis ?)  First conclusions:  Moving-on helps  Help helps  Just signaling is not so good, YLD is pretty bad  Compare with wizard:  Ask Repeat boosted (significantly x1.58)  Wizard reverse engineering (?)  HELP / FAIL behavior in non-natives (?)  Predicting success: when to help, when to ask repeat?

18 MOVE-ON HELP SIGNAL Non-understanding Strategies S: For when do you need the room? U: [non-understanding]  FAIL Sorry, I didn’t catch that. Tell me for what day you need the room  YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am …  TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am …  FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am …  ASK REPEAT (AREP) Could you please repeat that?  ASK REPHRASE (ARPH) Could you please try to rephrase that?  NOTIFY (NTFY) Sorry, I don’t think I understood you correctly…  YIELD TURN (YLD) …  REPEAT SYSTEM PROMPT (REPP) For when do you need the conference room?  EXPLAIN MORE (EXPL) Right now I need to know the date and time for when you need the reservation … Verb. V T A T T T T A T Prompt. Y N Y N N N N Y Y

19 Back to the Issues  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

20 Identify Potential New Strategies  Better informed by the error-type / blame assignment analysis (top of my stack)  So far  Ask user to speak shorter  Ask user to speak louder  Speculative execution

21 Speculative execution  A lot of small recognition errors appear repeatedly  YES > THIS, NEXT  GUEST > YES  GUEST USER > TUESDAY  Etc…  Learn from experience how to avoid these errors  Example: S: Did you say you wanted a room for Tuesday? U: YES [THIS] S: Sorry, I didn’t catch that. Did you say you wanted a room for Tuesday? U: YES [YES]  Learn that “THIS” actually means “YES”

22 Speculative execution - components  Learn mapping  Learner with high precision (no false positives)  Apply mapping  Learner with high recall  Precision / Recall tradeoff  How much can this method really buy us?

23 Speculative Execution – 0st cut  Conservative Learner  Learns from non-understanding segments where Dialogue state is the same throughout (mapping is state- specific) Final response is in focus, contains only one concept and has high confidence  Conservative Applier  Apply only when dialogue state matches and non- understood input matches perfectly at the state level  Going through the whole dataset, learning as you go results:  10% application at the end, does not asymptote yet Precision? (480 ruled learned)  How does this look to you?

24 Speculative execution  Of course much more to dig in here …  Learners which generalize more  Confidence score on the rules  Active learning: appliers with confidence, and feedback into learning  Potentially use it in other cases (not only non- understandings, but potential misunderstandings)

25 Back to the Issues  Detection  Analysis of error types, blame assignment, impact on task performance  Detection of error type  Adaptation of rejection threshold  Set of strategies  Investigate individual strategy performance  Identify potential new strategies  Policy  Impact of a “smarter” policy on performance  Building a policy from data

26 Building a Policy from Data  Experiment shown that wizard boosted performance of Ask Repeat  Can we predict likelihood of success for each strategy from features available online?  Identify informative features Might be better informed by error-type/blame-assignment analysis  Try simple classifiers  MDP (?)  Can also formulate problem as a decision boundary or classification problem… (?)

27 Thank you!

28 Experimental Design: Control vs Wizard Conditions  Control: random (uniform) policy  Wizard: human with access to audio & system state Performance Random (uniform) policy Manually designed policy Data-driven designed policy Human wizard with access to audio ? Human wizard with access to only system state ?