misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon.

Slides:



Advertisements
Similar presentations
( current & future work ) explicit confirmation implicit confirmation unplanned implicit confirmation request constructing accurate beliefs in spoken dialog.
Advertisements

Is This Conversation on Track? Utterance Level Confidence Annotation in the CMU Communicator spoken dialog system Presented by: Dan Bohus
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
Linear Regression.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.
TT Centre for Speech Technology Early error detection on word level Gabriel Skantze and Jens Edlund Centre for Speech Technology.
What is Statistical Modeling
An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004.
Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University.
constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University.
Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.
Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee.
Error detection in spoken dialogue systems GSLT Dialogue Systems, 5p Gabriel Skantze TT Centrum för talteknologi.
Belief Updating in Spoken Dialog Systems Dialogs on Dialogs Reading Group June, 2005 Dan Bohus Carnegie Mellon University, January 2004.
Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.
Increased Robustness in Spoken Dialog Systems 1 (roadmap to a thesis proposal) Dan Bohus, SPHINX Lunch, May 2003.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part II: Sources & impact of non-understandings, Performance of various.
Belief Updating in Spoken Dialog Systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh,
Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,
Online supervised learning of non-understanding recovery policies Dan Bohus Computer Science Department Carnegie.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
Detecting missrecognitions Predicting with prosody.
1 error handling – Higgins / Galatea Dialogs on Dialogs Group July 2005.
Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002.
Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus
A “k-hypotheses + other” belief updating model Dan Bohus Alex Rudnicky Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements.
belief updating in spoken dialog systems Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA acknowledgements Alex Rudnicky,
“k hypotheses + other” belief updating in spoken dialog systems Dialogs on Dialogs Talk, March 2006 Dan Bohus Computer Science Department
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I:Issues,Data Collection,Rejection Tuning Dan Bohus Sphinx Lunch.
A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.
Beyond Usability: Measuring Speech Application Success Silke Witt-Ehsani, PhD VP, VUI Design Center TuVox.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.
Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,
circle Spoken Dialogue for the Why2 Intelligent Tutoring System Diane J. Litman Learning Research and Development Center & Computer Science Department.
1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
Investigating Pitch Accent Recognition in Non-native Speech
Towards Emotion Prediction in Spoken Tutoring Dialogues
Error Detection and Correction in SDS
Issues in Spoken Dialogue Systems
Spoken Dialogue Systems
Spoken Dialogue Systems
Speaker Identification:
Presentation transcript:

misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213

2 problem spoken language interfaces lack robustness when faced with understanding errors  stems mostly from speech recognition  spans most domains and interaction types  exacerbated by operating conditions

3 more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

4 some statistics …  corrections [Krahmer, Swerts, Litman, Levow]  30% of utterances correct system mistakes  2-3 times more likely to be misrecognized  semantic error rates: ~25-35% SpeechActs [SRI] 25% CU Communicator [CU] 27% Jupiter [MIT] 28% CMU Communicator [CMU] 32% How May I Help You? [AT&T] 36%

5 two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding  System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [BERLIN PM]  System extracts incorrect information from the user’s turn MIS- understanding

6 misunderstandings S: What city are you leaving from? U: Birmingham [BERLIN PM]  System extracts incorrect information from the user’s turn MIS- understanding  detect potential misunderstandings; do something about them  fix recognition

7 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]

8 detecting misunderstandings  recognition confidence scores S: What city are you leaving from? U: Birmingham [BERLIN PM] conf=0.63  traditionally [Bansal, Chase, Cox, Kemp, many others]  speech recognition confidence scores  use acoustic, language model and search info  frame, phoneme, word-level

9 “semantic” confidence scores  we’re interested in semantics, not words  YES = YEAH, NO = NO WAY  use machine learning to build confidence annotators  in-domain, manually labeled data utterance: [BERLIN PM] Birmingham labels:correct / misunderstood  features from different knowledge sources  binary classification problem  probability of misunderstanding: regression problem

10 a typical result  Identifying User Corrections Automatically in a Spoken Dialog System [Walker, Wright, Langkilde]  HowMayIHelpYou corpus: call routing for phone services  turns  features  ASR: recog, numwords, duration, dtmf, rg-grammar, tempo …  understanding: confidence, context-shift, top-task, diff-conf, …  dialog & history: sys-label, confirmation, num-reprompts, num- confirms, num-subdials, …  binary classification task  majority baseline (error): 36.5%  RIPPER (error): 14%

11 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]

12 detect user corrections  is the user trying to correct the system? S: Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] user correction misunderstanding  same story: use machine learning  in-domain, manually labeled data  features from different knowledge sources  binary classification problem  probability of correction: regression problem

13 typical result  Identifying User Corrections Automatically in a Spoken Dialog System [Hirschberg, Litman, Swerts]  TOOT corpus: access to train information  2328 turns, 152 dialogs  features  prosodic: f0max, f0mn, rmsmax, dur, ppau, tempo …  ASR: gram, str, conf, ynstr, …  dialog position: diadist  dialog history: preturn, prepreturn, pmeanf  binary classification task  majority baseline: 29%  RIPPER: 15.7%

14 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]

15 belief updating problem: an easy case S:on which day would you like to travel? U:on September 3rd [AN DECEMBER THIRD] {CONF=0.25} S: did you say you wanted to leave on December 3 rd ? departure_date = {Dec-03/0.25} departure_date = {Ø} U: no [NO] {CONF=0.88}

16 belief updating problem: a trickier case S:Where would you like to go? U:Huntsville [SEOUL] {CONF=0.65} S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}

17  given:  an initial belief P initial (C) over concept C  a system action SA  a user response R  construct an updated belief:  P updated (C) ← f (P initial (C), SA, R) belief updating problem formalized S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} [THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}

18 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

19 belief updating: current solutions  most systems only track values, not beliefs  new values overwrite old values  explicit confirm + yes → trust hypothesis  explicit confirm + no → kill hypothesis  explicit confirm + “other” → non-understanding  implicit confirm: not much “users who discover errors through incorrect implicit confirmations have a harder time getting back on track” [Shin et al, 2002]

20 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

21 belief updating: general form  given:  an initial belief P initial (C) over concept C  a system action SA  a user response R  construct an updated belief:  P updated (C) ← f (P initial (C), SA, R)

22 restricted version: 2 simplifications 1.compact belief  system unlikely to “hear” more than 3 or 4 values single vs. multiple recognition results  in our data: max = 3 values, only 6.9% have >1 value  confidence score of top hypothesis 2.updates after confirmation actions  reduced problem  ConfTop updated (C) ← f (ConfTop initial (C), SA, R)

23 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

24  I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? data  collected with RoomLine  a phone-based mixed-initiative spoken dialog system  conference room reservation search and negotiation  explicit and implicit confirmations  confidence threshold model (+ some exploration)  implicit confirmation task

25 user study  46 participants, 1 st time users  10 scenarios, fixed order  presented graphically (explained during briefing)  compensated per task success

26 corpus statistics  449 sessions, 8848 user turns  orthographically transcribed  manually annotated  misunderstandings (concept-level)  non-understandings  user corrections  correct concept values

27 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

28 user response types  following Krahmer and Swerts  study on Dutch train-table information system  3 user response types  YES: yes, right, that’s right, correct, etc.  NO: no, wrong, etc.  OTHER  cross-tabulated against correctness of confirmations

29 user responses to explicit confirmations YESNOOther CORRECT94% [93%]0% [0%]5% [7%] INCORRECT1% [6%]72% [57%]27% [37%] ~10%  from transcripts [numbers in brackets from Krahmer&Swerts]  from decoded YESNOOther CORRECT87%1%12% INCORRECT1%61%38%

30 other responses to explicit confirmations  ~70% users repeat the correct value  ~15% users don’t address the question  attempt to shift conversation focus User does not correct User corrects CORRECT11590 INCORRECT 29 [10% of incor] 250 [90% of incor]

31 user responses to implicit confirmations YESNOOther CORRECT30% [0%]7% [0%]63% [100%] INCORRECT6% [0%]33% [15%]61% [85%]  transcripts [numbers in brackets from Krahmer&Swerts]  decoded YESNOOther CORRECT28%5%67% INCORRECT7%27%66%

32 ignoring errors in implicit confirmations User does not correct User corrects CORRECT5522 INCORRECT 118 [51% of incor] 111 [49% of incor]  users correct later (40% of 118)  users interact strategically  correct only if essential ~correct latercorrect later ~critical552 critical1447

33 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

34 machine learning approach  need good probability outputs  low cross-entropy between model predictions and reality  cross-entropy = negative average log posterior  logistic regression  sample efficient  stepwise approach → feature selection  logistic model tree for each action  root splits on response-type

35 features. target.  initial situation  initial confidence score  concept identity, dialog state, turn number  system action  other actions performed in parallel  features of the user response  acoustic / prosodic features  lexical features  grammatical features  dialog-level features  target: was the value correct?

36 baselines  initial baseline  accuracy of system beliefs before the update  heuristic baseline  accuracy of heuristic rule currently used in the system  oracle baseline  accuracy if we knew exactly when the user is correcting the system

37 results: explicit confirmation Hard error (%)Soft error

38 results: implicit confirmation Hard error (%)Soft error

39 results: unplanned implicit confirmation Hard error (%)Soft error

40 informative features  initial confidence score  prosody features  barge-in  expectation match  repeated grammar slots  concept id  priors on concept values [not included in these results]

41 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

42 discussion  evaluation  does it make sense?  what would be a better evaluation?  current limitation: belief compression  extending models to N hypothesis + other  current limitation: system actions  extending models to cover all system actions

43 thank you!

44 a more subtle caveat  distribution of training data  confidence annotator + heuristic update rules  distribution of run-time data  confidence annotator + learned model  always a problem when interacting with the world!  hopefully, distribution shift will not cause large degradation in performance  remains to validate empirically  maybe a bootstrap approach?

45 KL-divergence & cross-entropy  KL divergence: D(p||q)  Cross-entropy: CH(p, q) = H(p) + D(p||q)  Negative log likelihood

46 logistic regression  regression model for binomial (binary) dependent variables  fit a model using max likelihood (avg log-likelihood)  any stats package will do it for you  no R 2 measure  test fit using “likelihood ratio” test  stepwise logistic regression  keep adding variables while data likelihood increases signif.  use Bayesian information criterion to avoid overfitting

47 logistic regression

48 logistic model tree f g  regression tree, but with logistic models on leaves f=0 f=1 g>10g<=10