Download presentation
Presentation is loading. Please wait.
2
misunderstandings, corrections and beliefs in spoken language interfaces Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University dbohus@cs.cmu.edu Pittsburgh, PA 15213
3
2 problem spoken language interfaces lack robustness when faced with understanding errors stems mostly from speech recognition spans most domains and interaction types exacerbated by operating conditions
4
3 more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12 th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………
5
4 some statistics … corrections [Krahmer, Swerts, Litman, Levow] 30% of utterances correct system mistakes 2-3 times more likely to be misrecognized semantic error rates: ~25-35% SpeechActs [SRI] 25% CU Communicator [CU] 27% Jupiter [MIT] 28% CMU Communicator [CMU] 32% How May I Help You? [AT&T] 36%
6
5 two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON- understanding System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [BERLIN PM] System extracts incorrect information from the user’s turn MIS- understanding
7
6 misunderstandings S: What city are you leaving from? U: Birmingham [BERLIN PM] System extracts incorrect information from the user’s turn MIS- understanding detect potential misunderstandings; do something about them fix recognition
8
7 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns]
9
8 detecting misunderstandings recognition confidence scores S: What city are you leaving from? U: Birmingham [BERLIN PM] conf=0.63 traditionally [Bansal, Chase, Cox, Kemp, many others] speech recognition confidence scores use acoustic, language model and search info frame, phoneme, word-level
10
9 “semantic” confidence scores we’re interested in semantics, not words YES = YEAH, NO = NO WAY use machine learning to build confidence annotators in-domain, manually labeled data utterance: [BERLIN PM] Birmingham labels:correct / misunderstood features from different knowledge sources binary classification problem probability of misunderstanding: regression problem
11
10 a typical result Identifying User Corrections Automatically in a Spoken Dialog System [Walker, Wright, Langkilde] HowMayIHelpYou corpus: call routing for phone services 11787 turns features ASR: recog, numwords, duration, dtmf, rg-grammar, tempo … understanding: confidence, context-shift, top-task, diff-conf, … dialog & history: sys-label, confirmation, num-reprompts, num- confirms, num-subdials, … binary classification task majority baseline (error): 36.5% RIPPER (error): 14%
12
11 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns]
13
12 detect user corrections is the user trying to correct the system? S: Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] user correction misunderstanding same story: use machine learning in-domain, manually labeled data features from different knowledge sources binary classification problem probability of correction: regression problem
14
13 typical result Identifying User Corrections Automatically in a Spoken Dialog System [Hirschberg, Litman, Swerts] TOOT corpus: access to train information 2328 turns, 152 dialogs features prosodic: f0max, f0mn, rmsmax, dur, ppau, tempo … ASR: gram, str, conf, ynstr, … dialog position: diadist dialog history: preturn, prepreturn, pmeanf binary classification task majority baseline: 29% RIPPER: 15.7%
15
14 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns]
16
15 belief updating problem: an easy case S:on which day would you like to travel? U:on September 3rd [AN DECEMBER THIRD] {CONF=0.25} S: did you say you wanted to leave on December 3 rd ? departure_date = {Dec-03/0.25} departure_date = {Ø} U: no [NO] {CONF=0.88}
17
16 belief updating problem: a trickier case S:Where would you like to go? U:Huntsville [SEOUL] {CONF=0.65} S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}
18
17 given: an initial belief P initial (C) over concept C a system action SA a user response R construct an updated belief: P updated (C) ← f (P initial (C), SA, R) belief updating problem formalized S: traveling to Seoul. What day did you need to travel? destination = {seoul/0.65} destination = {?} [THE TRAVELING TO BERLIN P_M] {CONF=0.60} {COR=0.35}
19
18 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
20
19 belief updating: current solutions most systems only track values, not beliefs new values overwrite old values explicit confirm + yes → trust hypothesis explicit confirm + no → kill hypothesis explicit confirm + “other” → non-understanding implicit confirm: not much “users who discover errors through incorrect implicit confirmations have a harder time getting back on track” [Shin et al, 2002]
21
20 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
22
21 belief updating: general form given: an initial belief P initial (C) over concept C a system action SA a user response R construct an updated belief: P updated (C) ← f (P initial (C), SA, R)
23
22 restricted version: 2 simplifications 1.compact belief system unlikely to “hear” more than 3 or 4 values single vs. multiple recognition results in our data: max = 3 values, only 6.9% have >1 value confidence score of top hypothesis 2.updates after confirmation actions reduced problem ConfTop updated (C) ← f (ConfTop initial (C), SA, R)
24
23 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
25
24 I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? data collected with RoomLine a phone-based mixed-initiative spoken dialog system conference room reservation search and negotiation explicit and implicit confirmations confidence threshold model (+ some exploration) implicit confirmation task
26
25 user study 46 participants, 1 st time users 10 scenarios, fixed order presented graphically (explained during briefing) compensated per task success
27
26 corpus statistics 449 sessions, 8848 user turns orthographically transcribed manually annotated misunderstandings (concept-level) non-understandings user corrections correct concept values
28
27 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
29
28 user response types following Krahmer and Swerts study on Dutch train-table information system 3 user response types YES: yes, right, that’s right, correct, etc. NO: no, wrong, etc. OTHER cross-tabulated against correctness of confirmations
30
29 user responses to explicit confirmations YESNOOther CORRECT94% [93%]0% [0%]5% [7%] INCORRECT1% [6%]72% [57%]27% [37%] ~10% from transcripts [numbers in brackets from Krahmer&Swerts] from decoded YESNOOther CORRECT87%1%12% INCORRECT1%61%38%
31
30 other responses to explicit confirmations ~70% users repeat the correct value ~15% users don’t address the question attempt to shift conversation focus User does not correct User corrects CORRECT11590 INCORRECT 29 [10% of incor] 250 [90% of incor]
32
31 user responses to implicit confirmations YESNOOther CORRECT30% [0%]7% [0%]63% [100%] INCORRECT6% [0%]33% [15%]61% [85%] transcripts [numbers in brackets from Krahmer&Swerts] decoded YESNOOther CORRECT28%5%67% INCORRECT7%27%66%
33
32 ignoring errors in implicit confirmations User does not correct User corrects CORRECT5522 INCORRECT 118 [51% of incor] 111 [49% of incor] users correct later (40% of 118) users interact strategically correct only if essential ~correct latercorrect later ~critical552 critical1447
34
33 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
35
34 machine learning approach need good probability outputs low cross-entropy between model predictions and reality cross-entropy = negative average log posterior logistic regression sample efficient stepwise approach → feature selection logistic model tree for each action root splits on response-type
36
35 features. target. initial situation initial confidence score concept identity, dialog state, turn number system action other actions performed in parallel features of the user response acoustic / prosodic features lexical features grammatical features dialog-level features target: was the value correct?
37
36 baselines initial baseline accuracy of system beliefs before the update heuristic baseline accuracy of heuristic rule currently used in the system oracle baseline accuracy if we knew exactly when the user is correcting the system
38
37 results: explicit confirmation Hard error (%)Soft error
39
38 results: implicit confirmation Hard error (%)Soft error
40
39 results: unplanned implicit confirmation Hard error (%)Soft error
41
40 informative features initial confidence score prosody features barge-in expectation match repeated grammar slots concept id priors on concept values [not included in these results]
42
41 outline detecting misunderstandings detecting user corrections [late-detection of misunderstandings] belief updating [construct accurate beliefs by integrating information from multiple turns] current solutions a restricted version data user response analysis experiments and results discussion. caveats. future work
43
42 discussion evaluation does it make sense? what would be a better evaluation? current limitation: belief compression extending models to N hypothesis + other current limitation: system actions extending models to cover all system actions
44
43 thank you!
45
44 a more subtle caveat distribution of training data confidence annotator + heuristic update rules distribution of run-time data confidence annotator + learned model always a problem when interacting with the world! hopefully, distribution shift will not cause large degradation in performance remains to validate empirically maybe a bootstrap approach?
46
45 KL-divergence & cross-entropy KL divergence: D(p||q) Cross-entropy: CH(p, q) = H(p) + D(p||q) Negative log likelihood
47
46 logistic regression regression model for binomial (binary) dependent variables fit a model using max likelihood (avg log-likelihood) any stats package will do it for you no R 2 measure test fit using “likelihood ratio” test stepwise logistic regression keep adding variables while data likelihood increases signif. use Bayesian information criterion to avoid overfitting
48
47 logistic regression
49
48 logistic model tree f g regression tree, but with logistic models on leaves f=0 f=1 g>10g<=10
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.