Download presentation
Presentation is loading. Please wait.
2
constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University dbohus@cs.cmu.edu Pittsburgh, PA 15213
3
2 problem spoken language interfaces lack robustness when faced with understanding errors errors stem mostly from speech recognition typical word error rates: 20-30% significant negative impact on interactions
4
3 more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………
5
4 NON understanding MIS understanding two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………
6
5 approaches for increasing robustness gracefully handle errors through interaction improve recognition 1.detect the problems 2.develop a set of recovery strategies 3.know how to choose between them (policy)
7
6 six not-so-easy pieces … detection strategies policy misunderstandingsnon-understandings
8
7 construct more accurate beliefs by integrating information over multiple turns in a conversation today’s talk … detection misunderstandings S:Where would you like to go? U:Huntsville [SEOUL / 0.65] S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60]
9
8 belief updating: problem statement S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} [THE TRAVELING TO BERLIN P_M / 0.60] given an initial belief P initial (C) over concept C a system action SA a user response R construct an updated belief P updated (C) ← f (P initial (C), SA, R)
10
9 outline related work a restricted version data user response analysis experiments and results current and future work
11
10 current solutions most systems only track values, not beliefs new values overwrite old values use confidence scores yes → trust hypothesis explicit confirm +no → delete hypothesis “other” → non-understanding implicit confirm: not much “users who discover errors through incorrect implicit confirmations have a harder time getting back on track” [Shin et al, 2002] related work : restricted version : data : user response analysis : results : current and future work
12
11 confidence / detecting misunderstandings traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar, and many others] recently: detecting misunderstandings [Walker, Wright, Litman, Bosch, Swerts, San-Segundo, Pao, Gurevych, Bohus, and many others] machine learning approach: binary classification in-domain, labeled dataset features from different knowledge sources acoustic, language model, parsing, dialog management ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work
13
12 detecting corrections detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow] machine learning approach binary classification in-domain, labeled dataset features from different knowledge sources acoustic, prosody, language model, parsing, dialog management ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work
14
13 integration confidence annotation and correction detection are useful tools but separately, neither solves the problem bridge together in a unified approach to accurately track beliefs related work : restricted version : data : user response analysis : results : current and future work
15
14 outline related work a restricted version data user response analysis experiments and results current and future work related work : restricted version : data : user response analysis : results : current and future work
16
15 belief updating: general form given an initial belief P initial (C) over concept C a system action SA a user response R construct an updated belief P updated (C) ← f (P initial (C), SA, R) related work : restricted version : data : user response analysis : results : current and future work
17
16 two simplifications 1. belief representation system unlikely to “hear” more than 3 or 4 values for a concept within a dialog session in our data [considering only top hypothesis from recognition] max = 3 (conflicting values heard) only in 6.9% of cases, more than 1 value heard compressed beliefs: top-K concept hypotheses + other for now, K=1 2. updates following system confirmation actions related work : restricted version : data : user response analysis : results : current and future work
18
17 given an initial confidence score for the current top hypothesis Conf init (th C ) for concept C a system confirmation action SA a user response R construct an updated confi- dence score for that hypothesis Conf upd (th C ) ← f (Conf init (th C ), SA, R) belief updating: reduced version {boston/0.65; austin/0.11; … } ExplicitConfirm( Boston ) [NOW] + + {boston/ ?} related work : restricted version : data : user response analysis : results : current and future work
19
18 outline related work a restricted version data user response analysis experiments and results current and future work related work : restricted version : data : user response analysis : results : current and future work
20
19 I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? data collected with RoomLine a phone-based mixed-initiative spoken dialog system conference room reservation explicit and implicit confirmations confidence threshold model (+ some exploration) unplanned implicit confirmations related work : restricted version : data : user response analysis : results : current and future work
21
20 corpus user study 46 participants (naïve users) 10 scenario-based interactions each compensated per task success corpus 449 sessions, 8848 user turns orthographically transcribed manually annotated misunderstandings corrections correct concept values related work : restricted version : data : user response analysis : results : current and future work
22
21 outline related work a restricted version data user response analysis experiments and results current and future work related work : restricted version : data : user response analysis : results : current and future work
23
22 user response types following [Krahmer and Swerts, 2000] study on Dutch train-table information system 3 user response types YES: yes, right, that’s right, correct, etc. NO: no, wrong, etc. OTHER cross-tabulated against correctness of system confirmations related work : restricted version : data : user response analysis : results : current and future work
24
23 user responses to explicit confirmations YESNOOther CORRECT94% [93%]0% [0%]5% [7%] INCORRECT1% [6%]72% [57%]27% [37%] ~10% [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work
25
24 other responses to explicit confirmations ~70% users repeat the correct value ~15% users don’t address the question attempt to shift conversation focus how often users correct the system? User does not correct User corrects CORRECT11590 INCORRECT 29 [10% of incor] 250 [90% of incor] related work : restricted version : data : user response analysis : results : current and future work
26
25 user responses to implicit confirmations YESNOOther CORRECT30% [0%]7% [0%]63% [100%] INCORRECT6% [0%]33% [15%]61% [85%] [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work
27
26 ignoring errors in implicit confirmations User does not correct User corrects CORRECT5522 INCORRECT 118 [51% of incor] 111 [49% of incor] explanation users correct later (40% of 118) users interact strategically / correct only if essential ~correct latercorrect later ~critical552 critical1447 how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work
28
27 outline related work a restricted version data user response analysis experiments and results current and future work related work : restricted version : data : user response analysis : results : current and future work
29
28 machine learning approach problem: Conf upd (th C ) ← f (Conf init (th C ), SA, R) need good probability outputs low cross-entropy between model predictions and reality logistic regression sample efficient stepwise approach → feature selection logistic model tree for each action root splits on response-type related work : restricted version : data : user response analysis : results : current and future work
30
29 features. target. Initial initial confidence score of top hypothesis, # of initial hypotheses, concept type (bool / non-bool), concept identity; System action indicators describing other system actions in conjunction with current confirmation; User response Acoustic / prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause; Lexical number of words, lexical terms highly correlated with corrections (MI); Grammatical number of slots (new, repeated), parse fragmentation, parse gaps; Dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in. target: was the top hypothesis correct? related work : restricted version : data : user response analysis : results : current and future work
31
30 baselines initial baseline accuracy of system beliefs before the update heuristic baseline accuracy of heuristic update rule used by the system oracle baseline accuracy if we knew exactly what the user said related work : restricted version : data : user response analysis : results : current and future work
32
31 results: explicit confirmation Hard error (%)Soft error 31.15 8.41 3.57 2.71 30% 20% 10% 0% 0.51 0.19 0.6 0.4 0.2 0.0 0.12 initial heuristic logistic model tree oracle related work : restricted version : data : user response analysis : results : current and future work
33
32 results: implicit confirmation Hard error (%)Soft error 30.40 23.37 16.15 15.33 30% 20% 10% 0% 0.61 0.67 0.6 0.4 0.2 0.0 0.43 initial heuristic logistic model tree oracle 1.0 0.8 related work : restricted version : data : user response analysis : results : current and future work
34
33 results: unplanned implicit confirmation Hard error (%)Soft error 15.40 14.36 12.64 10.37 20% 10% 0% 0.43 0.46 0.6 0.4 0.2 0.0 0.34 initial heuristic logistic model tree oracle related work : restricted version : data : user response analysis : results : current and future work
35
34 informative features initial confidence score prosody features barge-in expectation match repeated grammar slots concept identity related work : restricted version : data : user response analysis : results : current and future work
36
35 summary data-driven approach for constructing accurate system beliefs integrate information across multiple turns bridge together detection of misunderstandings and corrections performs better than current heuristics user response analysis users don’t correct unless the error is critical related work : restricted version : data : user response analysis : results : current and future work
37
36 outline related work a restricted version data user response analysis experiments and results current and future work related work : restricted version : data : user response analysis : results : current and future work
38
37 current extensions top hypothesis + other logistic regression model belief representation confirmation actions system action k hyps + other multinomial GLM all actions confirmation (expl/impl) request unexpected features added priors related work : restricted version : data : user response analysis : results : current and future work
39
38 10% 20% 30% 2 hypotheses + other 4% 0% 8% 12% 9.49% 6.08% 98.14% 20% 0% 40% 45.03% 19.23% 80.00% 10% 0% 20% 30% 16.17% 5.52% 30.83% 0% 26.16% 17.56% 30.46% 0% 12% 6.06% 21.45% explicit confirmation implicit confirmation request unexpected update initial heuristic lmt(basic) lmt(basic+concept) oracle 4% 8% 15.15% 10.72% 15.49% 14.02% unplanned impl. conf. 7.86% 22.69% 12.95% 9.64% 25.66% related work : restricted version : data : user response analysis : results : current and future work
40
39 other work detection strategies policy misunderstandingsnon-understandings costs for errors rejection threshold adaptation nonu impact on performance [Interspeech-05] related work : restricted version : data : user response analysis : results : current and future work comparative analysis of 10 recovery strategies [SIGdial-05] impact of policy on performance towards learning non- understanding recovery policies [SIGdial-05] belief updating [ASRU-05] transfering confidence annotators across domains [in progress] RavenClaw: dialog management for task-oriented systems - RoomLine, Let’s Go Public!, Vera, LARRI, TeamTalk, Sublime [EuroSpeech-03, HLT-05]
41
40 thank you! questions …
42
41 a more subtle caveat distribution of training data confidence annotator + heuristic update rules distribution of run-time data confidence annotator + learned model always a problem when interacting with the world! hopefully, distribution shift will not cause large degradation in performance remains to validate empirically maybe a bootstrap approach?
43
42 KL-divergence & cross-entropy KL divergence: D(p||q) Cross-entropy: CH(p, q) = H(p) + D(p||q) Negative log likelihood
44
43 logistic regression regression model for binomial (binary) dependent variables fit a model using max likelihood (avg log-likelihood) any stats package will do it for you no R 2 measure test fit using “likelihood ratio” test stepwise logistic regression keep adding variables while data likelihood increases signif. use Bayesian information criterion to avoid overfitting
45
44 logistic regression
46
45 logistic model tree f g regression tree, but with logistic models on leaves f=0 f=1 g>10g<=10
47
46 user study 46 participants, 1 st time users 10 scenarios, fixed order presented graphically (explained during briefing) participants compensated per task success
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.