Presentation is loading. Please wait.

Presentation is loading. Please wait.

misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon.

Similar presentations


Presentation on theme: "misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon."— Presentation transcript:

1

2 misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University dbohus@cs.cmu.edu Pittsburgh, PA 15213

3 2 problem spoken language interfaces lack robustness when faced with understanding errors  stems mostly from speech recognition  spans most domains and interaction types  exacerbated by operating conditions

4 3 more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

5 4 some statistics …  corrections [Krahmer, Swerts, Litman, Levow]  30% of utterances correct system mistakes  2-3 times more likely to be misrecognized  semantic error rates: ~25-35% SpeechActs [SRI] 25% CU Communicator [CU] 27% Jupiter [MIT] 28% CMU Communicator [CMU] 32% How May I Help You? [AT&T] 36%

6 5 two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding  System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [BERLIN PM]  System extracts incorrect information from the user’s turn MIS- understanding

7 6 misunderstandings S: What city are you leaving from? U: Birmingham [BERLIN PM]  System extracts incorrect information from the user’s turn MIS- understanding  detect potential misunderstandings; do something about them  fix recognition

8 7 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]

9 8 detecting misunderstandings  recognition confidence scores S: What city are you leaving from? U: Birmingham [BERLIN PM] conf=0.63  traditionally [Bansal, Chase, Cox, Kemp, many others]  speech recognition confidence scores  use acoustic, language model and search info  frame, phoneme, word-level

10 9 “semantic” confidence scores  we’re interested in semantics, not words  YES = YEAH, NO = NO WAY  use machine learning to build confidence annotators  in-domain, manually labeled data utterance: [BERLIN PM] Birmingham labels:correct / misunderstood  features from different knowledge sources  binary classification problem  probability of misunderstanding: regression problem

11 10 a typical result  Identifying User Corrections Automatically in a Spoken Dialog System [Walker, Wright, Langkilde]  HowMayIHelpYou corpus: call routing for phone services  11787 turns  features  ASR: recog, numwords, duration, dtmf, rg-grammar, tempo …  understanding: confidence, context-shift, top-task, diff-conf, …  dialog & history: sys-label, confirmation, num-reprompts, num- confirms, num-subdials, …  binary classification task  majority baseline (error): 36.5%  RIPPER (error): 14%

12 11 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]

13 12 detect user corrections  is the user trying to correct the system? S: Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] user correction misunderstanding  same story: use machine learning  in-domain, manually labeled data  features from different knowledge sources  binary classification problem  probability of correction: regression problem

14 13 typical result  Identifying User Corrections Automatically in a Spoken Dialog System [Hirschberg, Litman, Swerts]  TOOT corpus: access to train information  2328 turns, 152 dialogs  features  prosodic: f0max, f0mn, rmsmax, dur, ppau, tempo …  ASR: gram, str, conf, ynstr, …  dialog position: diadist  dialog history: preturn, prepreturn, pmeanf  binary classification task  majority baseline: 29%  RIPPER: 15.7%

15 14 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]

16 15 belief updating problem: an easy case S:on which day would you like to travel? U:on September 3rd [AN DECEMBER THIRD] {CONF=0.25} S: did you say you wanted to leave on December 3 rd ? departure_date = {Dec-03/0.25} departure_date = {Ø} U: no [NO] {CONF=0.88}

17 16 belief updating problem: a trickier case S:Where would you like to go? U:Huntsville [SEOUL] {CONF=0.65} S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}

18 17  given:  an initial belief P initial (C) over concept C  a system action SA  a user response R  construct an updated belief:  P updated (C) ← f (P initial (C), SA, R) belief updating problem formalized S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} [THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}

19 18 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

20 19 belief updating: current solutions  most systems only track values, not beliefs  new values overwrite old values  explicit confirm + yes → trust hypothesis  explicit confirm + no → kill hypothesis  explicit confirm + “other” → non-understanding  implicit confirm: not much “users who discover errors through incorrect implicit confirmations have a harder time getting back on track” [Shin et al, 2002]

21 20 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

22 21 belief updating: general form  given:  an initial belief P initial (C) over concept C  a system action SA  a user response R  construct an updated belief:  P updated (C) ← f (P initial (C), SA, R)

23 22 restricted version: 2 simplifications 1.compact belief  system unlikely to “hear” more than 3 or 4 values single vs. multiple recognition results  in our data: max = 3 values, only 6.9% have >1 value  confidence score of top hypothesis 2.updates after confirmation actions  reduced problem  ConfTop updated (C) ← f (ConfTop initial (C), SA, R)

24 23 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

25 24  I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? data  collected with RoomLine  a phone-based mixed-initiative spoken dialog system  conference room reservation search and negotiation  explicit and implicit confirmations  confidence threshold model (+ some exploration)  implicit confirmation task

26 25 user study  46 participants, 1 st time users  10 scenarios, fixed order  presented graphically (explained during briefing)  compensated per task success

27 26 corpus statistics  449 sessions, 8848 user turns  orthographically transcribed  manually annotated  misunderstandings (concept-level)  non-understandings  user corrections  correct concept values

28 27 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

29 28 user response types  following Krahmer and Swerts  study on Dutch train-table information system  3 user response types  YES: yes, right, that’s right, correct, etc.  NO: no, wrong, etc.  OTHER  cross-tabulated against correctness of confirmations

30 29 user responses to explicit confirmations YESNOOther CORRECT94% [93%]0% [0%]5% [7%] INCORRECT1% [6%]72% [57%]27% [37%] ~10%  from transcripts [numbers in brackets from Krahmer&Swerts]  from decoded YESNOOther CORRECT87%1%12% INCORRECT1%61%38%

31 30 other responses to explicit confirmations  ~70% users repeat the correct value  ~15% users don’t address the question  attempt to shift conversation focus User does not correct User corrects CORRECT11590 INCORRECT 29 [10% of incor] 250 [90% of incor]

32 31 user responses to implicit confirmations YESNOOther CORRECT30% [0%]7% [0%]63% [100%] INCORRECT6% [0%]33% [15%]61% [85%]  transcripts [numbers in brackets from Krahmer&Swerts]  decoded YESNOOther CORRECT28%5%67% INCORRECT7%27%66%

33 32 ignoring errors in implicit confirmations User does not correct User corrects CORRECT5522 INCORRECT 118 [51% of incor] 111 [49% of incor]  users correct later (40% of 118)  users interact strategically  correct only if essential ~correct latercorrect later ~critical552 critical1447

34 33 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

35 34 machine learning approach  need good probability outputs  low cross-entropy between model predictions and reality  cross-entropy = negative average log posterior  logistic regression  sample efficient  stepwise approach → feature selection  logistic model tree for each action  root splits on response-type

36 35 features. target.  initial situation  initial confidence score  concept identity, dialog state, turn number  system action  other actions performed in parallel  features of the user response  acoustic / prosodic features  lexical features  grammatical features  dialog-level features  target: was the value correct?

37 36 baselines  initial baseline  accuracy of system beliefs before the update  heuristic baseline  accuracy of heuristic rule currently used in the system  oracle baseline  accuracy if we knew exactly when the user is correcting the system

38 37 results: explicit confirmation Hard error (%)Soft error

39 38 results: implicit confirmation Hard error (%)Soft error

40 39 results: unplanned implicit confirmation Hard error (%)Soft error

41 40 informative features  initial confidence score  prosody features  barge-in  expectation match  repeated grammar slots  concept id  priors on concept values [not included in these results]

42 41 outline  detecting misunderstandings  detecting user corrections [late-detection of misunderstandings]  belief updating [construct accurate beliefs by integrating information from multiple turns]  current solutions  a restricted version  data  user response analysis  experiments and results  discussion. caveats. future work

43 42 discussion  evaluation  does it make sense?  what would be a better evaluation?  current limitation: belief compression  extending models to N hypothesis + other  current limitation: system actions  extending models to cover all system actions

44 43 thank you!

45 44 a more subtle caveat  distribution of training data  confidence annotator + heuristic update rules  distribution of run-time data  confidence annotator + learned model  always a problem when interacting with the world!  hopefully, distribution shift will not cause large degradation in performance  remains to validate empirically  maybe a bootstrap approach?

46 45 KL-divergence & cross-entropy  KL divergence: D(p||q)  Cross-entropy: CH(p, q) = H(p) + D(p||q)  Negative log likelihood

47 46 logistic regression  regression model for binomial (binary) dependent variables  fit a model using max likelihood (avg log-likelihood)  any stats package will do it for you  no R 2 measure  test fit using “likelihood ratio” test  stepwise logistic regression  keep adding variables while data likelihood increases signif.  use Bayesian information criterion to avoid overfitting

48 47 logistic regression

49 48 logistic model tree f g  regression tree, but with logistic models on leaves f=0 f=1 g>10g<=10


Download ppt "misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon."

Similar presentations


Ads by Google