Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.

Slides:



Advertisements
Similar presentations
Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
Advertisements

1 A Pipeline Model for Bottom-Up Dependency Parsing Ming-Wei Chang, Quang Do, Dan Roth Computer Science Department University of Illinois, Urbana-Champaign.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Using Multiple Baseline Designs to Demonstrate the Efficacy of Using Behavior Therapy to Teach Children to Answer Questions Kathryn R. Haugle, Chelsea.
Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Departments of Medicine and Biostatistics
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
Collection and Analysis of Multimodal Interaction in Direction Giving Dialogues Seikei University Takeo TsukamotoYumi Muroya Masashi Okamoto Yukiko Nakano.
Checking For Understanding
An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.
The HIGGINS domain The primary domain of HIGGINS is city navigation for pedestrians. Secondarily, HIGGINS is intended to provide simple information about.
constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University.
Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems Thesis Proposal Dan Bohus Carnegie Mellon University, January 2004 Thesis Committee.
Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee.
U1, Speech in the interface:2. Dialogue Management1 Module u1: Speech in the Interface 2: Dialogue Management Jacques Terken HG room 2:40 tel. (247) 5254.
Chapter 14: Usability testing and field studies. Usability Testing Emphasizes the property of being usable Key Components –User Pre-Test –User Test –User.
Increased Robustness in Spoken Dialog Systems 1 (roadmap to a thesis proposal) Dan Bohus, SPHINX Lunch, May 2003.
A Heuristic Bidding Strategy for Multiple Heterogeneous Auctions Patricia Anthony & Nicholas R. Jennings Dept. of Electronics and Computer Science University.
What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various.
Belief Updating in Spoken Dialog Systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh,
Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,
Online supervised learning of non-understanding recovery policies Dan Bohus Computer Science Department Carnegie.
1RADAR – Scheduling Task © 2003 Carnegie Mellon University RADAR – Scheduling Task May 20, 2003 Manuela Veloso, Stephen Smith, Jaime Carbonell, Brett Browning,
The Effects of Increased Cognitive Demands on the Written Discourse Ability of Young Adolescents Ashleigh Elaine Zumwalt Eastern Illinois University.
Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus
misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department Carnegie Mellon.
belief updating in spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements Alex Rudnicky,
“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department
FIGURE 1-1 A Computer System
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.
A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.
Today Concepts underlying inferential statistics
Chapter 14: Usability testing and field studies
Using Oral Language to Check for understanding
WORKPLACE SEXUAL HARASSMENT THE A-SQUAD: WHITNEY, ABBEY, TAYLER, ANDRE.
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
Analysis of fMRI data with linear models Typical fMRI processing steps Image reconstruction Slice time correction Motion correction Temporal filtering.
Lecture 1.2 Field work (lab work). Analysis of data.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.
CLOSE READING & ANNOTATING WHAT IT IS AND HOW TO DO IT.
Testing & modeling users. The aims Describe how to do user testing. Discuss the differences between user testing, usability testing and research experiments.
LEVEL 3 I can identify differences and similarities or changes in different scientific ideas. I can suggest solutions to problems and build models to.
Chapter 8: Asking for Clarification
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Developing an automated assessment tool for children’s oral reading Leen Cleuren March
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Chapter 6: Analyzing and Interpreting Quantitative Data
Copyright © 2014 by Educational Testing Service. All rights reserved. Influencing Education: Implementing Online Reporting Systems to Support Assessment.
Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.
Chapter 15: Analytical evaluation. Aims: Describe inspection methods. Show how heuristic evaluation can be adapted to evaluate different products. Explain.
Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.
Fundamentals of Governance: Parliament and Government Understanding and Demonstrating Assessment Criteria Facilitator: Tony Cash.
Grounding and Repair Joe Tepperman CS 599 – Dialogue Modeling Fall 2005.
Improving (Meta)cognitive Tutoring by Detecting and Responding to Uncertainty Diane Litman & Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA.
Using Natural Language Processing to Analyze Tutorial Dialogue Corpora Across Domains and Modalities Diane Litman, University of Pittsburgh, Pittsburgh,
Engagement Strategies in Lower Achieving Students Alison Ambrogio School of Education Colorado State University Afternoon half-day kindergarten classroom.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Oliver Sawi1,2, Hunter Johnson1, Kenneth Paap1;
Spoken Dialogue Systems
Spoken Dialogue Systems
Evaluation.
Presentation transcript:

sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department Carnegie Mellon University Pittsburgh, PA, 15213

2 systems often do not understand correctly S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding  System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [BERLIN PM]  System extracts incorrect information from the user’s turn MIS- understanding  non-understandings and misunderstandings

3 systems often do not understand correctly S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding  System cannot extract any meaningful information from the user’s turn  detection  strategies  policy (knowing how to engage the strategies)  large space of strategies  tradeoffs between them not well understood  typically trivial; although diagnosis is not  simple heuristics: “incremental prompting”

4 questions under investigation  what are the main causes of non-understandings?  how large is their impact on performance?  how do various recovery strategies compare to each other?  what are the relationships between strategies and user behaviors?  can we improve global dialog performance by using a smarter policy?  if yes, can we learn a better policy from data?  data

5 data collection  Roomline  phone-based, mixed-initiative system  conference room reservations  experimental design  control group: uninformed recovery policy  wizard group: recovery policy implemented by wizard  46 participants, first-time users  tasks & experimental procedure  up to 10 scenario-driven interactions

6 non-understanding recovery strategies S: For when do you need the conference room? 1. ASK REPEAT Could you please repeat that? 2. ASK REPHRASE Could you please try to rephrase that? 3. NOTIFY (NTFY) Sorry, I didn’t catch that YIELD TURN (YLD) … 5. REPROMPT (RP) For when do you need the conference room? 6. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … 7. MOVE-ON Sorry, I didn’t catch that. For which day you need the room? 8. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 9. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 10. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am …

7 corpus statistics  449 sessions  8278 user turns  utterances transcribed and checked  manual annotations  misunderstandings  correct concept values at each turn  sources of understanding errors  user response-types to recovery strategies

8 questions under investigation  data  what are the main causes of non-understandings?  how large is their impact on performance?  how do various recovery strategies compare to each other?  what are the relationships between strategies and user behaviors?

9 causes of non-understandings conversation level intention level signal level channel level channel Recognition ParsingInterpretation End-pointing Goal Semantics TextAudio user system

10 causes of non-understandings conversation level intention level signal level channel level out-of-application 16% out-of-grammar 16% ASR error 62% endpointer error

11 questions under investigation  data  what are the main causes of non-understandings?  how large is their impact on performance?  how do various recovery strategies compare to each other?  what are the relationships between strategies and user behaviors? data : causes of non-understandings : impact on performance : strategy comparison : user behaviors

e -( α + β ·FNON)  logistic regression  P(Task Success) = modeling impact on performance 1

13 questions under investigation  data  what are the main causes of non-understandings?  how large is their impact on performance?  how do various recovery strategies compare to each other?  what are the relationships between strategies and user behaviors? data : causes of non-understandings : impact on performance : strategy comparison : user behaviors

14 strategy performance – recovery rate  overall logistic ANOVA  significant differences in mean recovery rates  all pairs comparison (corrected using FDR) MoveOn Help TerseYouCanSay RePrompt YouCanSay AskRephrase DetailedReprompt Notify AskRepeat Yield recovery rate

15 questions under investigation  data  what are the main causes of non-understandings?  how large is their impact on performance?  how do various recovery strategies compare to each other?  what are the relationships between strategies and user behaviors? data : causes of non-understandings : impact on performance : strategy comparison : user behaviors

16 user response types  tagging scheme by Shin  also used by Choularton, Raux  5 categories  repeat  rephrase  contradict  change  other

17 50% 40% 30% 20% 10% response types after non-understaning 0% rephraserepeat contradictchangeother Pizza (choularton & dale) Communicator (Shin et al.) Roomline (this study)

18 user response types by strategy MoveOn Help TerseYouCanSay RePrompt YouCanSay AskRephrase DetailedReprompt Notify AskRepeat Yield Rephrase Change Repeat Other 100% 80% 60% 40% 20% 0%

19  sources of non-understandings  impact on performance  strategy comparison  user responses summary  can we improve global dialog performance by using a smarter policy?  can we learn a better policy from data?  asr, but also “language” errors → more shaping strategies …  regression model allows better quantitative assessment  help, “move-on” → further investigate “move-on”  margin for improving control over user responses  yes  preliminary results promising …

20 thank you! questions …

21 rejections Figure 3. Misunderstandings and non-understandings before and after rejections Before rejection mechanism After rejection mechanism False rejections Correct rejections

22 strategy performance assessment  recovery rate  recovery utility  weighted sum of correctly and incorrectly acquired concepts  weights are determined in a data-driven fashion  recovery efficiency  also takes time to recovery into account

23 experimental design: scenarios  10 scenarios, fixed order  presented graphically (explained during briefing)

24 strategy pair-wise comparison  recovery performance ranked list, based on pair-wise t-tests: RNKMOVEHELPTYCSRPYCSARPHDRPNTFYAREPYLD MOVE1MOVE: HELP2HELP: HELP3TYCS: SIG4RP: HELP5YCS: SIG6ARPH: SIG?DRP: SIG?NTFY: SIG?AREP: SIG?YLD:  CER evaluation shows similar results

25 recovery for various response-types

26

27 impact of recovery rate on performance 1 + e -( α + β ·RecoveryRate)  recovery = next turn is correctly understood  P(Task Success) = 1