Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density Amy McGovern Andrew Barto
Abstract The paper presents a method to automatically discover subgoals Based on the idea of mining a set of behavioral trajectories to look for commonalities Commonalities are assumed to be subgoals/bottlenecks
Motivating Application Agent should recognize doorway as a bottleneck Doorway links two strongly connected regions By adding an option to reach a doorway the rooms become more closely connected S G D A two-room gridworld environment
Multiple Instance Learning A bottleneck is a region in agent’s observation space that is visited on all successful paths but not on unsuccessful paths Problem of finding bottleneck regions is treated as a multiple instance learning problem In this problem, a target concept is identified on the basis of bags of instances: positive bag corresponds to a successful trajectory, negative bag corresponds to an unsuccessful trajectory
Diverse Density The most diversely dense region is the region with instances from most positive bags and least negative bags For a target concept diverse density is defined as: which yields: where is defined as a Gaussian based on the distance from the particular instance to the target concept Find the concept with the highest DD value
Options Framework Option is a macro-action which, when chosen, executes until a termination condition is satisfied Defined as a triple where I is the option’s input set of states, is option’s policy and is the termination condition An option bases its policy on its own internal value function An option is a way of reaching a subgoal
Online subgoal discovery At the end of each run, the agent creates a new bag and searches for diverse density peaks (concepts with high DD) Bottlenecks (diverse density peaks) appear in the initial stages of learning and persist throughout learning Use a running average of how often a concept c appears as a peak. At the end of each trajectory the running average is updated by:
Forming new options An option is created for a subgoal found at time step t in the trajectory Option’s input I set can be initialized by adding the set of states visited by the agent from time (t – n) to t where n is a parameter is set to 1 when the subgoal is reached or when the agent is no longer in I, and is set to 0 otherwise The reward function for the policy is to give a reward of -1 on each step and 0 when the option terminates. The agent is rewarded negatively for leaving the input set. The policy uses the same state space as the overall problem The option’s value function is learned using experience replay with the saved trajectories
Pseudocode for subgoal discovery Init full trajectory database to For each trial Interact with environment/Learn using RL Add observed full trajectory to database Create positive or negative bag from state trajectory Search for diverse density peaks For each peak concept c found Update the running average by If c is above threshold and passes the static filter Create a new option o = of reaching concept c Init I by examining trajectory database Set,init policy using experience replay
Macro Q- learning Q-learning Macro Q-learning
Experimental Results Two-room gridworld, four-room gridworld No negative bags were created The agent was limited to creating only one option per run Comparison of options vs. an appropriate multi-step policy