Integrating Learning of Dialog Strategies and Semantic Parsing Aishwarya Padmakumar, Jesse Thomason, Raymond Mooney Department of Computer Science The University of Texas at Austin
Natural Language Interaction with Robots With the advances in AI, robots and various automated and semi-automated agents are increasingly becoming a part of our lives. For most users, the eaiest way to communicate with them, would be as they do with each other, in natural language.
Understanding Commands Go to Bob Brown’s office Bring coffee for Alice One of the basic things we would like to be able to do is to give high level commands such as “Bring coffee for Alice” or “Go to Bob Brown’s office”, and have such a robot be able to understand what we want it to do.
Goal of this work Dialog Management Semantic Parsing The goal of this work is to improve understanding of such commands by combining work in semantic parsing and dialog policy learning.
Semantic Parsing go to Alice ‘s office Semantic parsing transforms the natural language utterance into a logical form that the agent is capable of interpreting. go to Alice ‘s office
Advantages of semantic parsing Complete command Bring coffee for Bob Brown Action Patient Recipient Bob Brown’s office Noun Phrase Yes, you’re right Confirmation It can be used to identify the action and semantic roles in a complete command, the same component can understand other types of responses such as noun phrases or confimations.
Advantages of semantic parsing Alice Alice Ashcraft Ms Alice Ashcraft Alice’s office Alice Ashcraft’s office Ms Alice Ashcraft’s office It also exploits the compositionality of language to avoid learning too many unnecessary lexical entries
Clarification Dialogs Obtain specific missing information (Thomason et al., 2015) Weak supervision for improving parser (Artzi et al., 2011; Thomason et al., 2015) For understanding a command, there is no need for an agent to do it in one step. Prior work has shown that engaging in a clarification dialog where the agent is allowed to ask for specific missing information improves command understanding performance. Further, such dialogs can provide weak supervision to continuously update the parser..
Clarification Dialogs Obtain specific missing information (Thomason et al., 2015) Weak supervision for improving parser (Artzi et al., 2011; Thomason et al., 2015) For understanding a command, there is no need for an agent to do it in one step. Prior work has shown that engaging in a clarification dialog where the agent is allowed to ask for specific missing information improves command understanding performance. Further, such dialogs can provide weak supervision to continuously update the parser..
Dialog policy Ask for full command Ask for parameter Confirm Command In such clarification dialogs the agent can perform different actions such as asking for a full command, asking for a specific parameter value, or confirming a complete or partial command.
Dialog policy Mapping from dialog states to dialog actions Information from the dialog so far Probability of each command being what the user intends Static policy Difficult to update as parser changes Optimal policy may not be obvious to hand-code Deciding which of these to do when is done by a dialogue policy. Prior work on clarification dialogs with semantic parsing does this using a set of if-then rules. For example, you can imagine using a threshold on parser confidence scores to decide whether the the robot should confirm a command or directly execute it. The problem with this is that if the parser is being updated from dialogues, this threshold should ideally also be updated. But it would be difficult to keep doing that manually. Also, we might want to include other types of dialogue actions such as asking for new information to be added to the knowledge base, for example, where is Bob’s office, in which case it becomes less obvious what a good policy is.
Background: Reinforcement Learning Reinforcement learning - learning for sequential tasks with only implicit feedback Partially Observable Markov Decision Process - Agent (bt-1) Agent (bt) at ot Environment (st) Environment (st) Environment (st+1) Just like with the parser, we would like our policy also to adapt continuously based on interactions with users. For this, we turn to Reinforcement Learning. Specifically, the model we use is called a Partially Observable Markov Decision Process or POMDP. There is an agent interacting with an environment. At time t, the agent is in state s_t, but this is hidden from it. What it receives is a noisy version of s_t, called the observation o_t. Based on observations, the agent maintains a probability distribution over all states it could be in at time t, called the belief b_t. Based on b_t, the agent takes an action a_t, which causes it to transition to state s_{t+1}
Clarification Dialogs as POMDPs States: What the user is trying to convey via the dialog (commands) Observations: Output of the natural language understanding component Actions: Dialog actions (confirming, requesting a parameter value)
Why is policy learning challenging? Agent (bt-1) Agent (bt) Assumption: at ot Constant probability distribution Environment (st) Environment (st) Environment (st+1) Agent (bt-1) Agent (bt) Our system: at ot Variable probability distribution Environment (st) Environment (st) Environment (st+1)
Why is policy learning challenging? Agent (bt-1) Agent (bt) Assumption: at ot Constant probability distribution Environment (st) Environment (st) Environment (st+1) Non-stationary Environment Agent (bt-1) Agent (bt) Our system: at ot Variable probability distribution Environment (st) Environment (st) Environment (st+1)
Policy learning Kalman Temporal Differences (KTD) Q Learning (Geist and Pietquin, 2010) used to learn policy in non-stationary environment Dialog state tracking - HIS Model (Young et al., 2010) Policy learned over features extracted from belief state Because of this, we need to choose our policy learning algorithm carefully. We use the Kalman Temporal Differences Q Learning algorithm for this, as it allows us to effectively learn a policy in a non-stationary environment, from dialogs collected offline. This allows us to obtain training dialogs in batches via crowdsourcing
Experiments - Mechanical Turk We conducted experiments on Mechanical Turk. Our interface looked like this. In the prompt provided to users, we changed the suggested name (sometimes using nicknames or designations), and used visual prompts for the objects to avoid lexical priming of users. We had a verification step to ensure that we had human users who could understand the task, and asked users to fill a survey of their experience after completing the dialogue.
Hypotheses Combined parser and dialog learning is more useful than either alone. Changes in the parser need to be seen by the policy learner.
Systems Parser Learning Collect Dialogs Collect Dialogs Initial parser Update parser Final parser Initial policy Initial policy Initial policy Dialog Learning Initial parser Initial parser Initial parser Collect Dialogs Collect Dialogs Final policy Initial policy Update policy
Systems Initial policy Initial parser Parser and Dialog Learning - Full Collect Dialogs Final parser Final policy Update parser Update policy Initial policy Collect Dialogs Update parser Initial parser Final parser Parser and Dialog Learning - Batchwise Update policy Final policy
Experiments - Metrics Successful dialog - Objective metrics - User does not exit dialog prematurely Robot performs the correct action Objective metrics - % successful dialogs Dialog length - Average number of turns in successful dialogs Subjective metrics - The robot understood me The robot asked sensible questions
Results - Dialog Success 75 59 72 78 Higher is better Parser learning is mostly responsible for improvement in dialog success rate Best system: parser and dialog learning - batchwise We compared 4 systems - only parser learning, only dialog learning and two settings with both parser and dialog learning. The “batchwise” setting is where both components are updated after each batch of dialogs. In the “full” setting, we collect all 4 training batches at once with the initial components and perform one update using all these dialogues before testing. We can see that parse75 r learning contributes much more towards dialog success but dialog policy learning is more effective at reducing the dialog length. We also see that training both components is more useful, if done in a batchwise manner. The poor performance of the “full” setting is indicative of the fact that the effect of a non stationary environment is non-trivial.
Results - Dialog Length 12.43 11.73 12.76 10.61 Lower is better Dialog learning is mostly responsible for lowering dialog length Best system: parser and dialog learning - batchwise We compared 4 systems - only parser learning, only dialog learning and two settings with both parser and dialog learning. The “batchwise” setting is where both components are updated after each batch of dialogs. In the “full” setting, we collect all 4 training batches at once with the initial components and perform one update using all these dialogues before testing. We can see that parse75 r learning contributes much more towards dialog success but dialog policy learning is more effective at reducing the dialog length. We also see that training both components is more useful, if done in a batchwise manner. The poor performance of the “full” setting is indicative of the fact that the effect of a non stationary environment is non-trivial.
Results - user survey responses 3.28 Higher is better Best system: parser and dialog learning - full Can be conflated with number of questions the system asked 3.17 2.91 2.79 Survey responses were on a 0-4 Likert scale. Again we see that performing both parser and dialog policy learning results in better scores, but the “full” system is not always better than the “batchwise” system.
Results - user survey responses 2.93 2.55 2.79 3.30 Higher is better Best system: parser and dialog learning - batchwise Survey responses were on a 0-4 Likert scale. Again we see that performing both parser and dialog policy learning results in better scores, but the “full” system is not always better than the “batchwise” system.
Conclusion Simultaneous training of a semantic parser and dialog policy using implicit feedback from dialogs. Both batchwise > Parser learning, Dialog learning Changes in other components must be propagated to the policy via implicit feedback to reduce effects of non- stationary environment. Both batchwise > Both full
Integrating Learning of Dialog Strategies and Semantic Parsing Aishwarya Padmakumar, Jesse Thomason, Raymond Mooney Department of Computer Science The University of Texas at Austin