Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University ( infrastructure & architecture ) 1 RavenClaw dialog management information access Let’s Go! Bus Information, RoomLine guidance through procedures LARRI, IPA taskable agent Vera command-and-control TeamTalk systems built conference room reservations live schedules for 13 rooms in 2 buildings on campus size, location, a/v equipment recognition: sphinx-2 [3-gram] parsing: phoenix synthesis: cepstral theta demo: Roomline ( current research ) 3 belief updating problem: confidence scores provide an initial assessment for the reliability of the information obtained from the user. However, a system should leverage information available in subsequent user responses in order to update and improve the accuracy of its beliefs. goal: bridge confidence annotation and correction detection in a unified framework for belief updating in task oriented spoken dialog system approach: - machine learning (generalized linear models) - integrate features from multiple knowledge sources in the system - work with compressed beliefs - top hypothesis + other [ ASRU-2005 paper] - k hypotheses + other [ work in progress] results : data driven approach significantly outperforms common heuristics sample problem S: where would you like to fly from? U: [Boston/0.45]; [Austin/0.30] S: sorry, did you say you wanted to fly from Boston? U: [No/0.37] + [Aspen / 0.7] Updated belief = ? [Boston/?; Austin/?; Aspen/?] 4 rejection threshold optimization Bohus and Rudnicky - “Constructing Accurate Beliefs in Spoken Dialog Systems”, in ASRU-2005 transfer of confidence annotators across domains data-driven approach for tuning state-specific rejection thresholds in a spoken dialog system Bohus and Rudnicky - “A Principled Approach for Rejection Threshold Optimization in Spoken Dialog Systems”, to be presented at Interspeech work in progress, in collaboration with Antoine Raux Explicit Confirmation Did you say you wanted a room on Friday? Implicit Confirmation a room on Friday … for what time? misunderstanding recovery strategies AskRepeat Can you please repeat that? AskRephrase Could you please try to rephrase that? Reprompt Would you like a small or a large room? DetailedReprompt Sorry, I’m not sure I understood you correctly. Right now I need to know if you would prefer a small or a large room. Notify Sorry, I didn’t catch that … Yield Ø MoveOn Sorry, I did’t catch that. Once choice would be Wean Hall Would you like a reserva- tion for this room? YouCanSay Sorry, I didn’t catch that. Right now I’m trying to find out if you would prefer a small room or a large one. You can say ‘I want a small room’ or ‘I want a large room’. If the size of the room doesn’t matter to you, just say ‘I don’t care’. TerseYouCanSay Full-Help Sorry, I didn’t catch that. So far I found 5 rooms matching your constraints. Right now I’m trying to find out if you would prefer a small room or a large one. You can say ‘I want a small room’ or ‘I want a large room’. If the size of the room doesn’t matter to you, just say ‘I don’t care’. non-understanding recovery strategies updates following explicit confirmation updates following implicit confirmation initial heuristic proposed oracle Bohus and Rudnicky - “Sorry, I didn’t Catch That! – an Investigation of Non-understanding Errors and Recovery Strategies”, in SIGdial % 20% 30% 10% 20% 30% error rates migrate (adapt) a confidence annotator trained with data from domain A to domain B, without any labeled data in the new domain (B). question: can dialog performance be improved by using a better, more informed policy for engaging non-understanding recovery strategies? approach: a between-groups experiment - control group: system chooses a non-understanding strategy randomly (i.e. in an uninformed fashion); - wizard group: a human wizard chooses which strategy should be used whenever a non-understanding happens; - 23 participants in each condition - first-time users, balanced by gender x native language - each attempted a maximum of 10 scenario-based interactions - evaluated global dialog performance (task success) and various local non-understanding recovery performance metrics (see side panel) results : wizard policy outperforms uninformed recovery policy on a number of global and local metrics avg. task success rate 20% 40% 60% 80% 100% non-nativesnatives * avg. recovery WER 20% 40% 60% 80% non-nativesnatives * wizard policy uninformed policy avg. recovery concept utility non-nativesnatives * avg. recovery efficiency non-nativesnatives * * 2 RavenClaw error handling architecture RavenClaw: dialog management framework for complex, task-oriented domains - Dialog Task Specification (DTS): hierarchical plan which captures the domain-specific dialog control logic; - Dialog Engine “executes” a given DTS. platform for research - error handling [this poster] - multi-participant dialog (Thomas Harris) - turn-taking (Antoine Raux) error handling decision process RoomLine start_time: [start_time] [time] dialog engine dialog task specification GetQuery GetStartTime ExplConf(start_time) dialog stack date: [date] start_time: [start_time] [time] end_time: [end_time] [time] date: [date] start_time: [start_time] [time] end_time: [end_time] [time] location: [location] network: [with_network] → true [without_network] → false expectation agenda For when do you need the room? Let’s try two to four p.m. [time](two) [end_time](four) Did you say you wanted the room starting at two p.m.? System: User: Parse: System: error handling strategies RoomLine i:WelcomeGetQuery r:GetDate r:GetStartTime r:GetEndTime x: DoQueryDiscussResults learning policies for recovering fromnon-understandings question: can we learn a better policy from data? decision theoretic approach: - learn to predict likelihood of success for each strategy - use features available at runtime - stepwise logistic regression (good class posterior probabilities) - compute expected utility for each strategy - choose strategy with maximum expected utility - preliminary results promising, a new experiment needed for validation predicting likelihood of success Reprompt49.2% → 32.8% YouCanSay48.6% → 34.3% TerseYouCanSay43.5% → 32.6% MoveOn35.6% → 30.0% DetailedReprompt37.7% → 34.4% for 5 of 10 strategies models perform better than a majority baseline, on both soft and hard error majority baseline error → cross-validation error error handling strategies are implemented as library dialog agents new strategies can be plugged in as they are developed goal: task-independent, adaptive and scalable error handling architecture approach: - error handling strategies and error handling decision process are decoupled from the dialog task → reusability, uniformity, plug-and-play strategies, lessens development effort - error handling decision process implemented in a distributed fashion - local concept error handling decision process (handles potential misunderstandings) - local request error handling decision process (handles non-understandings) - currently implemented as POMDPs (for concepts) and MDPs (for request agents) date start_time end_time concept error handling MDP request error handling MDP concept error handling MDP request error handling MDP 5 6