Learning a Policy for Opportunistic Active Learning Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney Department of Computer Science The University of Texas at Austin
Natural Language Interaction with Robots With the advances in AI, robots are increasingly becoming a part of human environments. For most users, the easiest way to communicate with them, would be as they do with each other, in natural language.
Objects in Human Environments Human environments are typically filled with a lot of small objects that robots will likely have to handle. These present a number of challenges.
Objects in Human Environments Diverse It is difficult to catalog all the possible objects one would see in a home or office, and some objects such as decorations and souveneirs are especially hard to describe.
Objects in Human Environments Diverse Transient Objects may be transient. This person will probably throw the coffee cup after finishing it. So we can’t rely on memorization.
Objects in Human Environments Diverse Transient Described using diverse perceptual properties “light empty yellow spiky container” “a yellow pineapple” “a full neon green water bottle” “a green water bottle that's heavy” People typically refer to objects in terms of their properties, rather than with unique names. Understanding such descriptions is important for robots to communicate with humans.
Understanding Object Descriptions Robots need be able to Ground language in perception. Handle novel perceptual concepts during operation. Since the objects and properties used will be diverse, robots need to be able to ground such descriptions using perception. They also need to be able to handle the use of novel perceptual concepts during operation.
Opportunistic Active Learning (Thomason et al., CoRL 2017) A framework for incorporating active learning queries into test time interactions. Demonstrated improvement in learning novel perceptual concepts to ground natural language descriptions of objects. Prior work introduced the framework of opportunistic active learning - where an agent asks locally convenient questions during an interaction that may not be immediately relevant, but are expected to improve performance in future tasks. They demonstrated that this helped improve a robot’s performance in understanding natural language descriptions of objects.
Goal of this Work Learning a dialog policy for an interactive object retrieval task. Opportunistic Active Learning Grounded Language Learning This Work Reinforcement Learning We extend this work by learning a policy for an interactive task of grounding object descriptions using reinforcement learning. Our goal is to learn to trade-off learning new perceptual concepts through opportunistic active learning, with completing dialogs quickly with successful object retrieval.
Task Setup Based on task from prior work (Thomason et al., CoRL 2017) Our contribution - Setting it up in simulation using the Visual Genome dataset.
Task Setup Active Training Set Dialog Active Test Set Test_1 Test_2 Our task setup looks like this. In each set interactions, there is a set of objects called an active test set
Task Setup Robot: Describe the object I should find. Active Training Set Dialog Active Test Set Test_1 Test_2 Robot: Describe the object I should find. Human: A white umbrella. Test_3 Target Description Test_4 Target The robot is given a target description of one of the objects in this set. The robot needs to identify the object being described.
Task Setup Active Training Set Dialog Active Test Set Robot: Describe the object I should find. Human: A white umbrella. Test_1 Test_2 Train_2 Test_3 Train_5 Train_3 Train_8 Test_4 Train_6 Train_7 There is another set of objects called the active training set.
Task Setup Active Training Set Dialog Active Test Set Train_1 Train_4 Robot: Describe the object I should find. Human: A white umbrella. Test_1 Test_2 Train_2 Robot: Is there something in Train_6 that can be described as yellow? Human: No. Test_3 Train_5 Train_3 Train_8 Test_4 Label Query Train_6 Train_7 The robot can ask questions about objects in the active training set before it makes a guess. Label queries are about whether a concept applies to an object
Task Setup Active Training Set Dialog Active Test Set Train_1 Train_4 Robot: Describe the object I should find. Human: A white umbrella. Test_1 Test_2 Train_2 Robot: Is there something in Train_6 that can be described as yellow? Human: No. Test_3 Train_5 Train_3 Train_8 Test_4 Opportunistic Query Train_6 Train_7 A query is said to be opportunistic if it is not directly relevant to the current guess, for example asking about yellow when you have to find a white umbrella.
Task Setup Active Training Set Dialog Active Test Set Train_1 Train_4 Robot: Describe the object I should find. Human: A white umbrella. Robot: Is there something in Train_6 that can be described as yellow? Human: No. Test_1 Test_2 Train_2 Test_3 Train_5 Robot: Can you show me an image with something that can be described as white? Human: Train_1 Train_3 Train_8 Test_4 Train_6 Train_7 Example queries ask for an example of a label in the training set. Example Query
Task Setup Robot: My guess is Test_4 Human: Correct Guess Active Training Set Dialog Active Test Set Train_1 Train_4 Robot: Describe the object I should find. Human: A white umbrella. Robot: Is there something in Train_6 that can be described as yellow? Human: No. Robot: Can you show me an image with something that can be described as white? Human: Train_1 Test_1 Test_2 Train_2 Test_3 Train_5 Train_3 Robot: My guess is Test_4 Human: Correct Train_8 Test_4 Train_6 Train_7 Guess Example queries ask for an example of a label in the training set.
Goal of the Task Learn to maximize the fraction of successful guesses across interactions by Learning when to ask queries, and when to stop and guess. Learning to choose between different possible queries. The goal is to learn to maximize the fraction of successful guesses across interactions by learning when to query, which queries to ask, and when to stop and guess.
Active Learning ? We want the robot to use active learning to identify questions, that would be most useful to improve its model.
Why Opportunistic Queries? Robot may have good models for on-topic concepts. No useful on-topic queries. Some off-topic concepts may be more important because they are used in more interactions. Why would we want to ask off-topic questions? The robot may have good models for blue and mug, or there may be nothing around it for which it is uncertain whether the object is blue or a mug. Or it might have seen that a lot of people ask for tall objects so that is more improtant to learn.
Opportunistic Active Learning - Challenges Some other object might be a better candidate for the question Purple? But the setting is challenging for many reasons. It is only considering the objects current available in the context of the interaction, and there may be a better choice of query in the future.
Opportunistic Active Learning - Challenges The question interrupts another task and may be seen as unnatural Find a green bottle. Would you use the word “metallic” to describe this object? The question could irritate the user because it appears irrelevant.
Opportunistic Active Learning - Challenges The information needs to be useful for a future task. Red? Or it might learn a concept say red because it has seen many people ask for red things. But after it learns this, other users may not ask for red things.
Grounding Model A white umbrella {white, umbrella} SVM white/ not white Pretrained CNN SVM umbrella/ not umbrella W3e use the following model to ground object descriptions. Given a language description, we assume all words except stopwords are perceptual predicates. For each of these we learn a binary SVM over images, that uses a feature representation from VGGNet pretrained on ImageNet.
Grounding Model Active Test Set To guess, we choose the object in the active test set that maximizes this score
Grounding Model Classifier decision Active Test Set in {-1, 1} The score that weights classifier decisions d(p_1, o)
Classifier confidence Grounding Model Classifier decision in {-1, 1} Classifier confidence in (0, 1) Active Test Set By the confidence of the classifier C(p_i), which is F1 estimated by cross validation
Grounding Model Classifier decision in {-1, 1} Classifier confidence Active Test Set Summed over predicates in description And this is summed over all perceptual predicates in the decription
Active Learning Agent starts with no classifiers. Labeled examples are acquired through questions and used to train the classifiers. Agent needs to learn a policy to balance active learning with task completion.
Modelling the Dialog as an MDP Dialog Agent +100 for correct guess -100 for incorrect guess -1 per to shorten dialogs State: Action: Target description Active Train and test objects Agent’s perceptual classifiers Label query Example Query Guess Reward: User The model the dialog as an MDP where the state consists of Target description Train and test objects Agent’s perceptual classifiers And the possible actions are guessing, and the current possible label and example queries - which are any predicate it knows for any object in the active training set. We set up a reward function to maximize the number of correct guesses while also keeping dialogs as short as possible.
Challenges What information about classifiers should be represented? Features based on active learning metrics Variable number of queries and classifiers Create features for state-action pairs Large action space Sample a beam of promising queries
Feature Groups Query features - Active learning metrics used to determine whether a query is useful. Examples - Current estimated F1 of classifier Margin of object for classifier (for label query) We have two main groups of features. Query features are based on active learning metrics such as uncertainty sampling and density weighting, and are used to determine whether a query would be useful to the agent. Guess features use the predictions and confidences of classifiers to determine whether a guess would be correct.
Feature Groups Guess features - Features that use the predictions and confidences of classifiers to determine whether a guess will be correct. Examples - Highest score among regions in the active test set. Average estimated F1 of classifiers of concepts in description We have two main groups of features. Query features are based on active learning metrics such as uncertainty sampling and density weighting, and are used to determine whether a query would be useful to the agent. Guess features use the predictions and confidences of classifiers to determine whether a guess would be correct.
Experiment Setup Policy learning using REINFORCE. Baseline - A hand-coded dialog policy that asks a fixed number of questions selected using the same sampling distribution. In our experiments we simulated dialogs using the visual genome dataset. We used the annotated region descriptions to provide target descriptions, and annotated objects and attributes to answer queries. We used the REINFORCE algorithm for policy learning, and compared to a baseline static policy that asks a fixed number of questions in every dialog. The questions asked were sampled from the same distribution as that used by the policy.
Experiment Phases Initialization - Collect experience using the baseline to initialize the policy. Training - Improve the policy from on-policy experience. Testing - Policy weights are fixed, and we run a new set of interactions, starting with no classifiers, over an independent test set with different predicates. We initialized the policy by collecting experience from the baseline, and then trained it with on-policy experience. Each of these consisted of 10 batches of 100 dialogs each. Following this, we froze the policy weights, reset the system to start with no classifiers, and ran 10 batches of interactions on a test set having novel predicates in descriptions. We report results comparing the last batch of this interaction with a similar batch using the static policy. We also report results ablating the two major groups of features.
Results Systems evaluated on dialog success rate and average dialog length. We prefer high success rate and low dialog length (top left corner)
Results Static
Results Learned policy is more successful than the baseline, while also using shorter dialogs on average. Learned Static
Results If we ablate either group of features, the success rate drops considerably but dialogs are also much shorter. In both cases, the system chooses to ask very few queries. Learned - Query - Guess Static
Summary We can learn a dialog policy that learns to acquire knowledge of predicates through opportunistic active learning. The learned policy is more successful at object retrieval than a static baseline, using fewer dialog turns on average. In summary, we demonstrate that we can learn a policy for an interactive object retrieval task involving opportunistic active learning. The learned policy is more successful at object retrieval than a static baseline, using fewer dialog turns on average.
Learning a Policy for Opportunistic Active Learning Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney Department of Computer Science The University of Texas at Austin
Policy Representation
Query Features Does the concept have a classifier? Current estimated F1 of the classifier Fraction of previous dialogs in which the predicate has been used, and the agent’s success rate in these. Is the query opportunistic?
Query Features Margin of object Density of object Fraction of k nearest neighbours of the object which are unlabelled
Guess Features Lowest, highest, second highest, and average estimated F1 among classifiers of concepts in the description. Highest score among regions in the active test set, and the differences between this and the second highest, and average scores respectively. An indicator of whether the two most confident classifiers agree on the decision of the top scoring region.
Sampling distribution