constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
( current & future work ) explicit confirmation implicit confirmation unplanned implicit confirmation request constructing accurate beliefs in spoken dialog.
Advertisements

Is This Conversation on Track? Utterance Level Confidence Annotation in the CMU Communicator spoken dialog system Presented by: Dan Bohus
Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems Thesis Proposal Dan Bohus Carnegie Mellon University, January 2004 Thesis Committee.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Linear Regression.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
Computer Engineering 203 R Smith Project Tracking 12/ Project Tracking Why do we want to track a project? What is the projects MOV? – Why is tracking.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Chapter 7 – Classification and Regression Trees
An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.
Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University.
Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems Thesis Proposal Dan Bohus Carnegie Mellon University, January 2004 Thesis Committee.
Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.
U1, Speech in the interface:2. Dialogue Management1 Module u1: Speech in the Interface 2: Dialogue Management Jacques Terken HG room 2:40 tel. (247) 5254.
Error detection in spoken dialogue systems GSLT Dialogue Systems, 5p Gabriel Skantze TT Centrum för talteknologi.
Belief Updating in Spoken Dialog Systems Dialogs on Dialogs Reading Group June, 2005 Dan Bohus Carnegie Mellon University, January 2004.
Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.
Increased Robustness in Spoken Dialog Systems 1 (roadmap to a thesis proposal) Dan Bohus, SPHINX Lunch, May 2003.
What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various.
Belief Updating in Spoken Dialog Systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh,
Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,
Online supervised learning of non-understanding recovery policies Dan Bohus Computer Science Department Carnegie.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
1 error handling – Higgins / Galatea Dialogs on Dialogs Group July 2005.
Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002.
Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus
A “k-hypotheses + other” belief updating model Dan Bohus Alex Rudnicky Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements.
misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department Carnegie Mellon.
belief updating in spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements Alex Rudnicky,
Extending VERA (Conference Information) Design Specification & Schedules Arthur Chan (AC) Rohit Kumar (RK) Lingyun Gao (LG)
“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.
A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.
Speech Guidelines 2 of Errors VUIs are error-prone due to speech recognition. Humans aren’t perfect speech recognizers, therefore, machines aren’t.
Beyond Usability: Measuring Speech Application Success Silke Witt-Ehsani, PhD VP, VUI Design Center TuVox.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
Chapter 9 – Classification and Regression Trees
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Learning from Observations Chapter 18 Through
Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Extending VERA (Conference Information) Design Specification & Schedules Arthur Chan (AC) Rohit Kumar (RK) Lingyun Gu (LG)
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.
Classification Ensemble Methods 1
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,
1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.
Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
Towards Emotion Prediction in Spoken Tutoring Dialogues
Spoken Dialogue Systems
Generalized Linear Models (GLM) in R
Spoken Dialogue Systems
Presentation transcript:

constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213

2 problem spoken language interfaces lack robustness when faced with understanding errors  errors stem mostly from speech recognition  typical word error rates: 20-30%  significant negative impact on interactions

3 more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

4 NON understanding MIS understanding two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

5 approaches for increasing robustness  gracefully handle errors through interaction  improve recognition 1.detect the problems 2.develop a set of recovery strategies 3.know how to choose between them (policy)

6 six not-so-easy pieces … detection strategies policy misunderstandingsnon-understandings

7  construct more accurate beliefs by integrating information over multiple turns in a conversation today’s talk … detection misunderstandings S:Where would you like to go? U:Huntsville [SEOUL / 0.65] S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60]

8 belief updating: problem statement S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} [THE TRAVELING TO BERLIN P_M / 0.60]  given  an initial belief P initial (C) over concept C  a system action SA  a user response R  construct an updated belief  P updated (C) ← f (P initial (C), SA, R)

9 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work

10 current solutions  most systems only track values, not beliefs  new values overwrite old values  use confidence scores yes → trust hypothesis  explicit confirm +no → delete hypothesis “other” → non-understanding  implicit confirm: not much “users who discover errors through incorrect implicit confirmations have a harder time getting back on track” [Shin et al, 2002] related work : restricted version : data : user response analysis : results : current and future work

11 confidence / detecting misunderstandings  traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar, and many others]  recently: detecting misunderstandings [Walker, Wright, Litman, Bosch, Swerts, San-Segundo, Pao, Gurevych, Bohus, and many others]  machine learning approach: binary classification  in-domain, labeled dataset  features from different knowledge sources acoustic, language model, parsing, dialog management  ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work

12 detecting corrections  detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow]  machine learning approach binary classification  in-domain, labeled dataset  features from different knowledge sources acoustic, prosody, language model, parsing, dialog management  ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work

13 integration  confidence annotation and correction detection are useful tools  but separately, neither solves the problem  bridge together in a unified approach to accurately track beliefs related work : restricted version : data : user response analysis : results : current and future work

14 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

15 belief updating: general form  given  an initial belief P initial (C) over concept C  a system action SA  a user response R  construct an updated belief  P updated (C) ← f (P initial (C), SA, R) related work : restricted version : data : user response analysis : results : current and future work

16 two simplifications 1. belief representation  system unlikely to “hear” more than 3 or 4 values for a concept within a dialog session  in our data [considering only top hypothesis from recognition] max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard  compressed beliefs: top-K concept hypotheses + other  for now, K=1 2. updates following system confirmation actions related work : restricted version : data : user response analysis : results : current and future work

17  given  an initial confidence score for the current top hypothesis Conf init (th C ) for concept C  a system confirmation action SA  a user response R  construct an updated confi- dence score for that hypothesis  Conf upd (th C ) ← f (Conf init (th C ), SA, R) belief updating: reduced version {boston/0.65; austin/0.11; … } ExplicitConfirm( Boston ) [NOW] + + {boston/ ?} related work : restricted version : data : user response analysis : results : current and future work

18 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

19  I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? data  collected with RoomLine  a phone-based mixed-initiative spoken dialog system  conference room reservation  explicit and implicit confirmations  confidence threshold model (+ some exploration)  unplanned implicit confirmations related work : restricted version : data : user response analysis : results : current and future work

20 corpus  user study  46 participants (naïve users)  10 scenario-based interactions each  compensated per task success  corpus  449 sessions, 8848 user turns  orthographically transcribed  manually annotated misunderstandings corrections correct concept values related work : restricted version : data : user response analysis : results : current and future work

21 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

22 user response types  following [Krahmer and Swerts, 2000]  study on Dutch train-table information system  3 user response types  YES: yes, right, that’s right, correct, etc.  NO: no, wrong, etc.  OTHER  cross-tabulated against correctness of system confirmations related work : restricted version : data : user response analysis : results : current and future work

23 user responses to explicit confirmations YESNOOther CORRECT94% [93%]0% [0%]5% [7%] INCORRECT1% [6%]72% [57%]27% [37%] ~10% [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work

24 other responses to explicit confirmations  ~70% users repeat the correct value  ~15% users don’t address the question  attempt to shift conversation focus  how often users correct the system? User does not correct User corrects CORRECT11590 INCORRECT 29 [10% of incor] 250 [90% of incor] related work : restricted version : data : user response analysis : results : current and future work

25 user responses to implicit confirmations YESNOOther CORRECT30% [0%]7% [0%]63% [100%] INCORRECT6% [0%]33% [15%]61% [85%] [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work

26 ignoring errors in implicit confirmations User does not correct User corrects CORRECT5522 INCORRECT 118 [51% of incor] 111 [49% of incor]  explanation  users correct later (40% of 118)  users interact strategically / correct only if essential ~correct latercorrect later ~critical552 critical1447  how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work

27 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

28 machine learning approach  problem: Conf upd (th C ) ← f (Conf init (th C ), SA, R)  need good probability outputs  low cross-entropy between model predictions and reality  logistic regression  sample efficient  stepwise approach → feature selection  logistic model tree for each action  root splits on response-type related work : restricted version : data : user response analysis : results : current and future work

29 features. target. Initial initial confidence score of top hypothesis, # of initial hypotheses, concept type (bool / non-bool), concept identity; System action indicators describing other system actions in conjunction with current confirmation; User response Acoustic / prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause; Lexical number of words, lexical terms highly correlated with corrections (MI); Grammatical number of slots (new, repeated), parse fragmentation, parse gaps; Dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in.  target: was the top hypothesis correct? related work : restricted version : data : user response analysis : results : current and future work

30 baselines  initial baseline  accuracy of system beliefs before the update  heuristic baseline  accuracy of heuristic update rule used by the system  oracle baseline  accuracy if we knew exactly what the user said related work : restricted version : data : user response analysis : results : current and future work

31 results: explicit confirmation Hard error (%)Soft error % 20% 10% 0% initial heuristic logistic model tree oracle related work : restricted version : data : user response analysis : results : current and future work

32 results: implicit confirmation Hard error (%)Soft error % 20% 10% 0% initial heuristic logistic model tree oracle related work : restricted version : data : user response analysis : results : current and future work

33 results: unplanned implicit confirmation Hard error (%)Soft error % 10% 0% initial heuristic logistic model tree oracle related work : restricted version : data : user response analysis : results : current and future work

34 informative features  initial confidence score  prosody features  barge-in  expectation match  repeated grammar slots  concept identity related work : restricted version : data : user response analysis : results : current and future work

35 summary  data-driven approach for constructing accurate system beliefs  integrate information across multiple turns  bridge together detection of misunderstandings and corrections  performs better than current heuristics  user response analysis  users don’t correct unless the error is critical related work : restricted version : data : user response analysis : results : current and future work

36 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

37 current extensions  top hypothesis + other  logistic regression model belief representation  confirmation actions system action  k hyps + other  multinomial GLM  all actions  confirmation (expl/impl)  request  unexpected features  added priors related work : restricted version : data : user response analysis : results : current and future work

38 10% 20% 30% 2 hypotheses + other 4% 0% 8% 12% 9.49% 6.08% 98.14% 20% 0% 40% 45.03% 19.23% 80.00% 10% 0% 20% 30% 16.17% 5.52% 30.83% 0% 26.16% 17.56% 30.46% 0% 12% 6.06% 21.45% explicit confirmation implicit confirmation request unexpected update initial heuristic lmt(basic) lmt(basic+concept) oracle 4% 8% 15.15% 10.72% 15.49% 14.02% unplanned impl. conf. 7.86% 22.69% 12.95% 9.64% 25.66% related work : restricted version : data : user response analysis : results : current and future work

39 other work detection strategies policy misunderstandingsnon-understandings  costs for errors  rejection threshold adaptation  nonu impact on performance [Interspeech-05] related work : restricted version : data : user response analysis : results : current and future work  comparative analysis of 10 recovery strategies [SIGdial-05]  impact of policy on performance  towards learning non- understanding recovery policies [SIGdial-05]  belief updating [ASRU-05]  transfering confidence annotators across domains [in progress]  RavenClaw: dialog management for task-oriented systems - RoomLine, Let’s Go Public!, Vera, LARRI, TeamTalk, Sublime [EuroSpeech-03, HLT-05]

40 thank you! questions …

41 a more subtle caveat  distribution of training data  confidence annotator + heuristic update rules  distribution of run-time data  confidence annotator + learned model  always a problem when interacting with the world!  hopefully, distribution shift will not cause large degradation in performance  remains to validate empirically  maybe a bootstrap approach?

42 KL-divergence & cross-entropy  KL divergence: D(p||q)  Cross-entropy: CH(p, q) = H(p) + D(p||q)  Negative log likelihood

43 logistic regression  regression model for binomial (binary) dependent variables  fit a model using max likelihood (avg log-likelihood)  any stats package will do it for you  no R 2 measure  test fit using “likelihood ratio” test  stepwise logistic regression  keep adding variables while data likelihood increases signif.  use Bayesian information criterion to avoid overfitting

44 logistic regression

45 logistic model tree f g  regression tree, but with logistic models on leaves f=0 f=1 g>10g<=10

46 user study  46 participants, 1 st time users  10 scenarios, fixed order  presented graphically (explained during briefing)  participants compensated per task success