Presentation is loading. Please wait.

Presentation is loading. Please wait.

constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University.

Similar presentations


Presentation on theme: "constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University."— Presentation transcript:

1

2 constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University dbohus@cs.cmu.edu Pittsburgh, PA 15213

3 2 problem spoken language interfaces lack robustness when faced with understanding errors  errors stem mostly from speech recognition  typical word error rates: 20-30%  significant negative impact on interactions

4 3 more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

5 4 NON understanding MIS understanding two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

6 5 approaches for increasing robustness  gracefully handle errors through interaction  improve recognition 1.detect the problems 2.develop a set of recovery strategies 3.know how to choose between them (policy)

7 6 six not-so-easy pieces … detection strategies policy misunderstandingsnon-understandings

8 7  construct more accurate beliefs by integrating information over multiple turns in a conversation today’s talk … detection misunderstandings S:Where would you like to go? U:Huntsville [SEOUL / 0.65] S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60]

9 8 belief updating: problem statement S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} [THE TRAVELING TO BERLIN P_M / 0.60]  given  an initial belief P initial (C) over concept C  a system action SA  a user response R  construct an updated belief  P updated (C) ← f (P initial (C), SA, R)

10 9 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work

11 10 current solutions  most systems only track values, not beliefs  new values overwrite old values  use confidence scores yes → trust hypothesis  explicit confirm +no → delete hypothesis “other” → non-understanding  implicit confirm: not much “users who discover errors through incorrect implicit confirmations have a harder time getting back on track” [Shin et al, 2002] related work : restricted version : data : user response analysis : results : current and future work

12 11 confidence / detecting misunderstandings  traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar, and many others]  recently: detecting misunderstandings [Walker, Wright, Litman, Bosch, Swerts, San-Segundo, Pao, Gurevych, Bohus, and many others]  machine learning approach: binary classification  in-domain, labeled dataset  features from different knowledge sources acoustic, language model, parsing, dialog management  ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work

13 12 detecting corrections  detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow]  machine learning approach binary classification  in-domain, labeled dataset  features from different knowledge sources acoustic, prosody, language model, parsing, dialog management  ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work

14 13 integration  confidence annotation and correction detection are useful tools  but separately, neither solves the problem  bridge together in a unified approach to accurately track beliefs related work : restricted version : data : user response analysis : results : current and future work

15 14 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

16 15 belief updating: general form  given  an initial belief P initial (C) over concept C  a system action SA  a user response R  construct an updated belief  P updated (C) ← f (P initial (C), SA, R) related work : restricted version : data : user response analysis : results : current and future work

17 16 two simplifications 1. belief representation  system unlikely to “hear” more than 3 or 4 values for a concept within a dialog session  in our data [considering only top hypothesis from recognition] max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard  compressed beliefs: top-K concept hypotheses + other  for now, K=1 2. updates following system confirmation actions related work : restricted version : data : user response analysis : results : current and future work

18 17  given  an initial confidence score for the current top hypothesis Conf init (th C ) for concept C  a system confirmation action SA  a user response R  construct an updated confi- dence score for that hypothesis  Conf upd (th C ) ← f (Conf init (th C ), SA, R) belief updating: reduced version {boston/0.65; austin/0.11; … } ExplicitConfirm( Boston ) [NOW] + + {boston/ ?} related work : restricted version : data : user response analysis : results : current and future work

19 18 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

20 19  I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? data  collected with RoomLine  a phone-based mixed-initiative spoken dialog system  conference room reservation  explicit and implicit confirmations  confidence threshold model (+ some exploration)  unplanned implicit confirmations related work : restricted version : data : user response analysis : results : current and future work

21 20 corpus  user study  46 participants (naïve users)  10 scenario-based interactions each  compensated per task success  corpus  449 sessions, 8848 user turns  orthographically transcribed  manually annotated misunderstandings corrections correct concept values related work : restricted version : data : user response analysis : results : current and future work

22 21 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

23 22 user response types  following [Krahmer and Swerts, 2000]  study on Dutch train-table information system  3 user response types  YES: yes, right, that’s right, correct, etc.  NO: no, wrong, etc.  OTHER  cross-tabulated against correctness of system confirmations related work : restricted version : data : user response analysis : results : current and future work

24 23 user responses to explicit confirmations YESNOOther CORRECT94% [93%]0% [0%]5% [7%] INCORRECT1% [6%]72% [57%]27% [37%] ~10% [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work

25 24 other responses to explicit confirmations  ~70% users repeat the correct value  ~15% users don’t address the question  attempt to shift conversation focus  how often users correct the system? User does not correct User corrects CORRECT11590 INCORRECT 29 [10% of incor] 250 [90% of incor] related work : restricted version : data : user response analysis : results : current and future work

26 25 user responses to implicit confirmations YESNOOther CORRECT30% [0%]7% [0%]63% [100%] INCORRECT6% [0%]33% [15%]61% [85%] [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work

27 26 ignoring errors in implicit confirmations User does not correct User corrects CORRECT5522 INCORRECT 118 [51% of incor] 111 [49% of incor]  explanation  users correct later (40% of 118)  users interact strategically / correct only if essential ~correct latercorrect later ~critical552 critical1447  how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work

28 27 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

29 28 machine learning approach  problem: Conf upd (th C ) ← f (Conf init (th C ), SA, R)  need good probability outputs  low cross-entropy between model predictions and reality  logistic regression  sample efficient  stepwise approach → feature selection  logistic model tree for each action  root splits on response-type related work : restricted version : data : user response analysis : results : current and future work

30 29 features. target. Initial initial confidence score of top hypothesis, # of initial hypotheses, concept type (bool / non-bool), concept identity; System action indicators describing other system actions in conjunction with current confirmation; User response Acoustic / prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause; Lexical number of words, lexical terms highly correlated with corrections (MI); Grammatical number of slots (new, repeated), parse fragmentation, parse gaps; Dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in.  target: was the top hypothesis correct? related work : restricted version : data : user response analysis : results : current and future work

31 30 baselines  initial baseline  accuracy of system beliefs before the update  heuristic baseline  accuracy of heuristic update rule used by the system  oracle baseline  accuracy if we knew exactly what the user said related work : restricted version : data : user response analysis : results : current and future work

32 31 results: explicit confirmation Hard error (%)Soft error 31.15 8.41 3.57 2.71 30% 20% 10% 0% 0.51 0.19 0.6 0.4 0.2 0.0 0.12 initial heuristic logistic model tree oracle related work : restricted version : data : user response analysis : results : current and future work

33 32 results: implicit confirmation Hard error (%)Soft error 30.40 23.37 16.15 15.33 30% 20% 10% 0% 0.61 0.67 0.6 0.4 0.2 0.0 0.43 initial heuristic logistic model tree oracle 1.0 0.8 related work : restricted version : data : user response analysis : results : current and future work

34 33 results: unplanned implicit confirmation Hard error (%)Soft error 15.40 14.36 12.64 10.37 20% 10% 0% 0.43 0.46 0.6 0.4 0.2 0.0 0.34 initial heuristic logistic model tree oracle related work : restricted version : data : user response analysis : results : current and future work

35 34 informative features  initial confidence score  prosody features  barge-in  expectation match  repeated grammar slots  concept identity related work : restricted version : data : user response analysis : results : current and future work

36 35 summary  data-driven approach for constructing accurate system beliefs  integrate information across multiple turns  bridge together detection of misunderstandings and corrections  performs better than current heuristics  user response analysis  users don’t correct unless the error is critical related work : restricted version : data : user response analysis : results : current and future work

37 36 outline  related work  a restricted version  data  user response analysis  experiments and results  current and future work related work : restricted version : data : user response analysis : results : current and future work

38 37 current extensions  top hypothesis + other  logistic regression model belief representation  confirmation actions system action  k hyps + other  multinomial GLM  all actions  confirmation (expl/impl)  request  unexpected features  added priors related work : restricted version : data : user response analysis : results : current and future work

39 38 10% 20% 30% 2 hypotheses + other 4% 0% 8% 12% 9.49% 6.08% 98.14% 20% 0% 40% 45.03% 19.23% 80.00% 10% 0% 20% 30% 16.17% 5.52% 30.83% 0% 26.16% 17.56% 30.46% 0% 12% 6.06% 21.45% explicit confirmation implicit confirmation request unexpected update initial heuristic lmt(basic) lmt(basic+concept) oracle 4% 8% 15.15% 10.72% 15.49% 14.02% unplanned impl. conf. 7.86% 22.69% 12.95% 9.64% 25.66% related work : restricted version : data : user response analysis : results : current and future work

40 39 other work detection strategies policy misunderstandingsnon-understandings  costs for errors  rejection threshold adaptation  nonu impact on performance [Interspeech-05] related work : restricted version : data : user response analysis : results : current and future work  comparative analysis of 10 recovery strategies [SIGdial-05]  impact of policy on performance  towards learning non- understanding recovery policies [SIGdial-05]  belief updating [ASRU-05]  transfering confidence annotators across domains [in progress]  RavenClaw: dialog management for task-oriented systems - RoomLine, Let’s Go Public!, Vera, LARRI, TeamTalk, Sublime [EuroSpeech-03, HLT-05]

41 40 thank you! questions …

42 41 a more subtle caveat  distribution of training data  confidence annotator + heuristic update rules  distribution of run-time data  confidence annotator + learned model  always a problem when interacting with the world!  hopefully, distribution shift will not cause large degradation in performance  remains to validate empirically  maybe a bootstrap approach?

43 42 KL-divergence & cross-entropy  KL divergence: D(p||q)  Cross-entropy: CH(p, q) = H(p) + D(p||q)  Negative log likelihood

44 43 logistic regression  regression model for binomial (binary) dependent variables  fit a model using max likelihood (avg log-likelihood)  any stats package will do it for you  no R 2 measure  test fit using “likelihood ratio” test  stepwise logistic regression  keep adding variables while data likelihood increases signif.  use Bayesian information criterion to avoid overfitting

45 44 logistic regression

46 45 logistic model tree f g  regression tree, but with logistic models on leaves f=0 f=1 g>10g<=10

47 46 user study  46 participants, 1 st time users  10 scenarios, fixed order  presented graphically (explained during briefing)  participants compensated per task success


Download ppt "constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University."

Similar presentations


Ads by Google