Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.

Slides:

Advertisements

Similar presentations

( current & future work ) explicit confirmation implicit confirmation unplanned implicit confirmation request constructing accurate beliefs in spoken dialog.

Advertisements

Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.

Imbalanced data David Kauchak CS 451 – Fall 2013.

Reliability and Validity

Assessment in Advising: Getting the Right Stuff 2013 Academic Advising Conference Adrian Hall.

Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Sampling distributions. Example Take random sample of 1 hour periods in an ER. Ask “how many patients arrived in that one hour period ?” Calculate statistic,

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.

The HIGGINS domain The primary domain of HIGGINS is city navigation for pedestrians. Secondarily, HIGGINS is intended to provide simple information about.

constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University.

Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.

USABILITY AND EVALUATION Motivations and Methods.

The art and science of measuring people l Reliability l Validity l Operationalizing.

Belief Updating in Spoken Dialog Systems Dialogs on Dialogs Reading Group June, 2005 Dan Bohus Carnegie Mellon University, January 2004.

What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various.

Belief Updating in Spoken Dialog Systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh,

Software Quality Control Methods. Introduction Quality control methods have received a world wide surge of interest within the past couple of decades.

Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,

Online supervised learning of non-understanding recovery policies Dan Bohus Computer Science Department Carnegie.

SIM5102 Software Evaluation

Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus

A “k-hypotheses + other” belief updating model Dan Bohus Alex Rudnicky Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements.

misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department Carnegie Mellon.

belief updating in spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements Alex Rudnicky,

The Experimental Approach September 15, 2009Introduction to Cognitive Science Lecture 3: The Experimental Approach.

“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department

A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.

Today Concepts underlying inferential statistics

Problem Management Launch Michael Hall Real-World IT

Review an existing website Usability in Design. to begin with.. Meeting Organization’s objectives and your Usability goals Meeting User’s Needs Complying.

Introduction to Regression Analysis, Chapter 13,

Inference for regression - Simple linear regression

Beyond Usability: Measuring Speech Application Success Silke Witt-Ehsani, PhD VP, VUI Design Center TuVox.

Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

User Study Evaluation Human-Computer Interaction.

Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

SEG3120 User Interfaces Design and Implementation

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sampling and Probability Chapter 5. Sampling & Elections >Problems with predicting elections: Sample sizes are too small Samples are biased (also tied.

KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.

©2010 John Wiley and Sons Chapter 2 Research Methods in Human-Computer Interaction Chapter 2- Experimental Research.

Copyright 2010, The World Bank Group. All Rights Reserved. Testing and Documentation Part II.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

4. Marketing research After carefully studying this chapter, you should be able to: Define marketing research; Identify and explain the major forms of.

RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

How Psychologists Do Research Chapter 2. How Psychologists Do Research What makes psychological research scientific? Research Methods Descriptive studies.

Evaluation / Usability. ImplementDesignAnalysisEvaluateDevelop ADDIE.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

SIE 515 Design Evaluation Lecture 7.

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Understanding Results

Spoken Dialogue Systems

Spoken Dialogue Systems

Toward a Reliable Evaluation of Mixed-Initiative Systems

Reliability and Validity

Analyzing Stability in Colorado K-12 Public Schools

Title of your experimental design

Psychological Research Methods and Statistics

Presentation transcript:

Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005

2 ASR Errors & Spoken Dialog Call RoomLine! Call Let’s Go! intro : data collection : issues under investigation

3 Non-understandings and Misunderstandings  Recognition errors can lead to 2 types of problems in a spoken dialog system: S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding  System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [BERLIN PM]  System extracts incorrect information from the user’s turn MIS- understanding intro : data collection : issues under investigation

4 Non-understandings and Misunderstandings S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding  System cannot extract any meaningful information from the user’s turn  How can we prevent non-understandings?  How can we recover from them?  Detection  Set of recovery strategies  Policy for choosing between them intro : data collection : issues under investigation

5 Current State of Affairs  Detection / Diagnosis  Systems know when a non-understanding happens There was a detected user turn but no meaningful information Systems decide to reject because of low confidence  Not much exists in terms of diagnosis  Set of recovery strategies  Repeat the question again  “Can you repeat that?”  “Sorry I didn’t catch that …”  Policy for choosing between them  Traditionally, simple heuristics are used intro : data collection : issues under investigation

6 Questions Under Investigation  Detection / Diagnosis  What are the main causes (sources) of non-understandings?  What is their impact on global performance?  Can we diagnose non-understandings at run-time?  Can we optimize the rejection process in a more principled way?  Set of recovery strategies  What is the relative performance of different recovery strategies?  Can we refine current strategies and find new ones?  Policy for choosing between them  Can we improve performance by making smarter choices?  If so, can we learn how to make these smarter choices?  Data Collection Experiment intro : data collection : issues under investigation

7 Data Collection: Experimental Design  Subjects interact over the telephone with RoomLine  Performed 10 of scenario-based tasks  Between-subjects experiment, 2 groups:  Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies  Wizard: policy is determined at runtime by a human (wizard)  46 subjects, balanced gender x native intro : data collection : issues under investigation

8 REPROMPT NOTIFY MOVE-ON HELP REPEAT Non-understanding Strategies S: For when do you need the room? U: [non-understanding] 1. MOVE-ON Sorry, I didn’t catch that. For which day you need the room? 2. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 3. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 4. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … 5. ASK REPEAT (AREP) Could you please repeat that? 6. ASK REPHRASE (ARPH) Could you please try to rephrase that? 7. NOTIFY (NTFY) Sorry, I didn’t catch that YIELD TURN (YLD) … 9. REPROMPT (RP) For when do you need the conference room? 10. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … intro : data collection : issues under investigation

9 Experimental Design: Scenarios  10 scenarios, fixed order  Presented graphically (explained during briefing) intro : data collection : issues under investigation

10 Experimental Design: Evaluation  Participants filled in a SASSI evaluation questionnaire  35 questions, 1-7 Likert scale; 6 factors: response accuracy, likeability, cognitive demand, annoyance, habitability, speed  Overall user satisfaction score: 1-7  What did you like best / least?  What would you change first? intro : data collection : issues under investigation

11 Corpus Statistics / Characteristics  46 users; 449 sessions; 8278 user turns  User utterances transcribed & checked  Annotated with:  Concept transfer & Misunderstandings Correctly, incorrectly, deleted, substituted concepts  Correct concept values at each turn  Transcript grammaticality labels OK, OOR, OOG, OOS, OOD, VOID, PART  Corrections  User response to non-understanding recovery Repeat, Rephrase, Contradict, Change, Other intro : data collection : issues under investigation

12 Corpus intro : data collection : issues under investigation

13 General corpus statistics TotalNativesNon-Natives # Users # Sessions # User turns Average turn length (# words) WER25.61%19.60%39.54% CER35.73%26.30%57.58% OOV rate0.35%0.33%0.39% % Non-understandings16.96%13.38%25.25% % Misunderstandings13.53%9.67%22.48% Task success75.06%85.21%44.14% User satisfaction (1-7 Likert) intro : data collection : issues under investigation

14 Back to the Issues  Data Collection  Detection / Diagnosis  What are the main causes (sources) of non-understandings?  What is their impact on global performance?  Can we diagnose non-understandings at run-time?  Can we optimize the rejection process in a more principled way?  Set of recovery strategies  What is the relative performance of different recovery strategies?  Can we refine current strategies and find new ones  Policy for choosing between them  Can we improve performance by making smarter choices?  If so, can we learn how to make these smarter choices? intro : data collection : issues under investigation

15 Next …  Data Collection  Detection / Diagnosis  What are the main causes (sources) of non-understandings?  What is their impact on global performance?  Can we diagnose non-understandings at run-time?  Can we optimize the rejection process in a more principled way?  Set of recovery strategies  What is the relative performance of different recovery strategies?  Can we refine current strategies and find new ones?  Policy for choosing between them  Can we improve performance by making smarter choices?  If so, can we learn how to make these smarter choices? intro : data collection : rejection threshold

16 Utterance Rejection  Systems use confidence scores to assess reliability of inputs  A widely used design pattern:  If confidence is very low (i.e. below a certain threshold), reject the utterance altogether  Genuine non-understandings + rejections  This creates a tradeoff between non-understandings and misunderstandings intro : data collection : rejection threshold

17 Nonu- / Mis-understanding Tradeoff  The nonu- vs. mis-understanding tradeoff intro : data collection : rejection threshold threshold

18 An Alternative, More Informative View  Number of Concepts transferred  Correctly(CTC) or Incorrectly (ITC) intro : data collection : rejection threshold threshold

19  Set the threshold like the ASR manual says:  In all likelihood … Mismatch: ASR confidence optimization is probably for WER The tradeoff between misunderstandings and rejections probably varies across domains, and even across dialog states  Go for the break-even point  Acknowledge the tradeoff; solve it by postulating costs  Misunderstandings cost twice as much as rejections Current solutions intro : data collection : rejection threshold

20 Proposed Approach  Use a data-driven approach to establish the costs, then optimize threshold 1.Identify a set of variables involved in the tradeoff CTC(th) vs. ITC(th) 2.Choose a dialog performance metric TC – task completion (binary, kappa); TD – task duration (# turns), US – user satisfaction 3.Build a regression model m logit(TC) ← C 0 + C CTC CTC + C ITC ITC 4.Optimize threshold to maximize performance th* = argmax (C CTC CTC + C ITC ITC) th intro : data collection : rejection threshold

21 State-specific costs & thresholds  The costs are potentially different at different points in the dialog  Count CTC and ITC at different states with different variables logit(TC) ← C 0 + C CTCstate1 CTC state1 + C ITCstate1 ITC state1 + C CTCstate2 CTC state2 + C ITCstate2 ITC state2 + C CTCstate3 CTC state3 + C ITCstate3 ITC state3 + …  Optimize separate threshold for each state th*/state_x = argmax (C CTCstate_x CTC state_x + C ITCstate_x ITC state_x ) intro : data collection : rejection threshold th

22 States Considered  Open request  How may I help you?  Request(bool)  Did you want a reservation for this room?  Request(non-bool)  Starting at what time do you need the room?  Finer granularity is desirable  Can be achieved given more data intro : data collection : rejection threshold

23 Model 1: Resulting fit and coefficients VariableCoeffpse Const CTC / oreq ITC / oreq CTC / req(bool) ITC / req(bool) CTC / req(non-bool) ITC / req(non-bool) BaselineTrainCross-V AVG-LL HARD 17.62%11.66%11.75% intro : data collection : rejection threshold

24 Model 1: Threshold optimization Open-request Request(bool) Request(non-bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool):61.00 intro : data collection : rejection threshold

25 Results do confirm expectations  Anecdotal evidence from data collection indicates too many false rejections on open requests  Data analysis confirms this view intro : data collection : rejection threshold

26 What would change? CurrentNew EstimateDelta Open-request CTC ITC Request bool CTC ITC Request non-bool CTC ITC CurrentNew EstimateDelta Task success82.75%87.16%+4.41% Remains to be seen … intro : data collection : rejection threshold

27  Global performance metric  Task duration (successful tasks) - # turns (poisson variable)  Generalized linear model / poisson Log(TD) ← C 0 + C CTC CTC + C ITC ITC  But different tasks have different durations, so you’d want to normalize: Log(TD x /TD x ) ← C 0 + C CTC CTC + C ITC ITC  Instead, use regression offsets Log(TD x ) ← 1Log(TD x ) + C 0 + C CTC CTC + C ITC ITC  Tradeoff variables: same as before Model 2: Description intro : data collection : rejection threshold

28 Model 2: Resulting fit and coefficients VariableCoeffpse Const CTC / oreq ITC / oreq CTC / req(bool) ITC / req(bool) CTC / req(non-bool) ITC / req(non-bool) intro : data collection : rejection threshold

29 Model 2: Resulting fit and coefficients VariableCoeffpse Const CTC / oreq ITC / oreq CTC / req(bool) ITC / req(bool) CTC / req(non-bool) ITC / req(non-bool) R^2 = 0.56 intro : data collection : rejection threshold

30 Model 1: Threshold optimization Open-request Request(bool) Request(non-bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool):61.00 intro : data collection : rejection threshold

31 Conclusion  Model for tuning rejection  Really data-driven  Relates state-specific costs of rejection to global dialog performance  Bridge mismatch between off-the-shelf confidence annotation scheme and particular characteristics of system’s domain  More data would allow even finer-grained distinctions  Expected performance improvements remain to be verified intro : data collection : rejection threshold

32 Next time …  Data Collection  Detection / Diagnosis  What are the main causes (sources) of non-understandings?  What is their impact on global performance?  Can we diagnose non-understandings at run-time?  Can we optimize the rejection process in a more principled way?  Set of recovery strategies  What is the relative performance of different recovery strategies?  Can refine current strategies and find new ones?  Policy for choosing between them  Can we improve performance by making smarter choices?  If so, can we learn how to make these smarter choices? intro : data collection : issues under investigation