Advancing Motivated Learning with Goal Creation James Graham1, Janusz A. Starzyk1,2, Zhen Ni3 and Haibo He3 1School of Electrical Engineering and Computer Science Ohio University, Athens, OH, USA 2University of Information Technology and Management Rzeszow, Poland 3Electrical, Computer, and Biomedical Engineering University of Rhode Island, Kingston, RI, USA
Overview Introduction Enhancements to Motivated Learning Bias calculation Use of desirability and availability Probabilistic goal selection Desired Resource Levels Resource level as an optimization problem Resource dependencies Desirability calulations Comparison to RL algorithms Conclusions Give some example of biologically inspired designs? JTG - 12/10/14 - SSCI - CIHLI
Motivated Learning Controlled by underlying “primitive” motivations Builds on motivations to create additional “abstract” motivations Motivation Hierarchy Unlike in RL, focus is not on maximizing externally set rewards, but on intrinsic rewards and creating mission related new goals and motivations. Intrinsic Intrinsic General purpose? What machine learning is used for. ML is part of machine learning. Autonomous learning, rovers, etc. Intrinsic Extrinsic JTG - 12/10/14 - SSCI - CIHLI
Improvements to ML Bias/Pain calculations Resource availability Learning to select actions Probabilistic goal selection Determining desired resource levels JTG - 12/10/14 - SSCI - CIHLI
Significance of bias signals Initially we only have primitive needs (no biases) Bias is a foundation for the creation of new needs0 Bias is a preference for or aversion to something (resource or action) Bias results from an existing need being helped or hurt by a resource or action Level of bias is measured related to the availability of a resource or likelihood of an action Measures presented in next slide JTG - 12/10/14 - SSCI - CIHLI
Bias based on availability and desirability Availability based bias A bias signal triggers an abstract pain and is defined depending on the type of perceived situation. Bias reflects the level of need associated with the resource/action Bias based on availability is more general than than calculated strictly for an action or a resource. It can also be adjusted for desirability. Rd is a desired resource value (at a sensory input si) Rc is a current resource value A is the Availability calculation dc is the current distance to another agent dd is a desired (a comfortable) distance to another agent JTG - 12/10/14 - SSCI - CIHLI
Bias based on availability and desirability Bias for a desired resource Bias for a desired action Bias for an undesired action Bias for an undesired resource Shown equation was chosen because it is linear (but won’t hit INF) In the case of desired resource we want non-zero bias for both too little and to much of a resource. We also prefer “exponential” (nonlinear)” growth of bias for desired situations. Why use linear calc. for undesired situations? JTG - 12/10/14 - SSCI - CIHLI
Probabilistic goal selection Uses normalized wPG weights to select actions based on probability. However, previous wPG calculation could lead to weight saturation at αg, so we used the following: This causes the weights to saturate at (3/π)atan(ds/dŝ) (ds/dŝ) measures how useful action is at restoring resource How does the agent learn goals in the first place (before selecting them) Lamda_i is pain reduction should be lambda_p Transition from previous slide. How do we go from bias to action calculation? Bias is used to calculate pain. Changes in pain reflect how action affect the agent. We need to update weights between pain and goals Weights are used to select actions/goals. Previous slide was deterministic, here we include a level of randomness by using wpg weights to select actions based on probability We explore a wider range of possible actions instead of settling on the first valid action found. JTG - 12/10/14 - SSCI - CIHLI
Probabilistic goal selection WPG weights Weights will saturate as determined by (ds/dŝ ) tend toward zero Figure 1 shows wpg weights without probabilistic selection Figure 2 shows wpg weights where there are 3 valid actions for a specific pain. Show that Probabalistic ML tests all actions and learns three different actions that can reduce pain compared to only one action learned in non-probabalistic approach JTG - 12/10/14 - SSCI - CIHLI
Probabilistic goal selection Here we show how wbp weights are affected by the different goal selction approaches. While significantly noisier due to the effect of probabilistically choosing more “incorrect” actions, the wBPweights of Fig. 7 indicate that agent is able to discover the usefulness of resources significantly earlier in the simulation when using probabilistic goal section. Wbp weight levels remain similar (most sig. resources remain high) Object 8 is not discovered at all on left, while on right it is discovered ~2500 When actions are not taken according to policy we have higher randomness in wbp weight adjustment (more “invalid” actions for the discovered needs) Without probabilistic selection With probabilistic selection JTG - 12/10/14 - SSCI - CIHLI
Determining desired resource levels Desired values should be set according to the agent’s needs. To begin, the agent is given the initial “primitive” resource level, Rdp. The agent must learn the rate at which “desired” resources are used (∆p). The agent can use its knowledge of the environment to set the desired resource levels. Resource levels are established only for resources that the agent cares about. The frequency of performing tasks cannot be too great as the agent’s time is limited. The agent also needs to “learn”. Gone over selecting actions. How does the agent know how much to use. Agent must set the desired resource levels Desired resource levels and current levels effect bias and JTG - 12/10/14 - SSCI - CIHLI
Determining desired resource levels To establish the optimum level of desired resources we solve the optimization problem subject to constraints and sum of all frequencies is less than 1 where the restoration frequency is Add notes What is alpha i? Resource I F_s_hat (node frequency) is the resource alpha s_hat times the sum of all lower level frequencies dependent on it JTG - 12/10/14 - SSCI - CIHLI
Determining desired resource levels - example The agent starts with levels for multiple resource set to the initially observed environment state. As it learns to use specific resources it adjusts the levels at which it wants to maintain said resources. Each resource equilibrates to a different level Initial setting was not optimum. We can observe some shuffling in level order JTG - 12/10/14 - SSCI - CIHLI
Reinforcement Learning Reinforcement learning maximizes external reward Learns approximating value functions Usually a single function May include “subgoal” generation and “curiosity” Primarily reactive Objectives are set by the designer JTG - 12/10/14 - SSCI - CIHLI
Motivated Learning Controlled by underlying motivations Uses existing motivations to create additional “abstract” motivations ML focus is not on maximizing externally set objectives (as is RL), but on learning new motivations, and building and supporting its internal reward system Minimax – minimize pain Primarily deliberative JTG - 12/10/14 - SSCI - CIHLI
Comparison to other RL algorithms Algorithms tested: Q-learning SARSA Hierarchical RL – MAXQ Neural Fitted Q Iteration (NFQ) TD-FALCON JTG - 12/10/14 - SSCI - CIHLI
Comparison to other RL algorithms – test environment Testing environment is a simplified version of what we use in NeoAxis. In NeoAxis we have pains, tasks, triggering pains, and (maybe) NACs. Comparison test is a “Black Box” that has no NACs and is run as a simplified environment making RL algorithms more compatible and easier to interface. Images from current NeoAxis implementation should be used for this and the follow NeoAxis slides We have pains, tasks, resources, triggering pains, and (maybe) NACs. Black box scenario that can fit “all” algorithms. JTG - 12/10/14 - SSCI - CIHLI
Comparison to other RL algorithms - results Algorithms tested: Q-learning, SARASA, HRL, ML ML HRL, Q-Learning, SARSA Plot or normalized average reward ML can work in more general environments NFQ TD-Falcon JTG - 12/10/14 - SSCI - CIHLI
NFQ Results Note highlighted lines and see both when the occur and their general profile Some test very well, others poorly. Due to oscillation, average is worse than Sarsa, etc. JTG - 12/10/14 - SSCI - CIHLI
Conclusion Designed and implemented several enhancements to the Motivated Learning architecture Bias calculations Goal Selection Setting desired resource levels Compared ML to several RL algorithms using a basic test environment and simple reward scenario. ML achieved higher average reward faster than other algorithms tested JTG - 12/10/14 - SSCI - CIHLI
Questions? JTG - 12/10/14 - SSCI - CIHLI
Bias signal calculation for resources For resource related pain Rd is a desired resource value (at a sensory input si) Rc is a current resource value ε is a small positive number γ regulates how quickly pain increases δr=1 when the resource is desired, δr=-1 when it is not; δr=0 otherwise Shown equation was chosen because it is linear (but won’t hit INF) A bias signal triggers an abstract pain and is defined depending on the type of perceived situation. Bias reflects the level of need associated with the resource/action JTG - 12/10/14 - SSCI - CIHLI
Learning and selecting actions Goals are selected based on pain-goal weights: δp indicates how the associated pain changed ∆a , outside of μg ensures the weights stay below the ceiling of αg=1 μg determines the rate of chance Transition from previous slide. How do we go from bias to action calculation? Bias is used to calculate pain. Changes in pain reflect how action affect the agent. We need to update weights between pain and goals Weights are used to select actions/goals. JTG - 12/10/14 - SSCI - CIHLI
Comparing Reinforcement Learning to Motivated Learning Compare ML and RL Reinforcement Learning Motivated Learning Single value function Multiple value functions Measurable awards Internal immeasurable rewards Predictable Unpredictable Objectives set by designer Sets its own (“abstract”) objectives Maximizes the reward Solves minimax problem Potentially unstable Always stable Always active Acts only when needed JTG - 12/10/14 - SSCI - CIHLI