Reinforcement Learning and Soar Shelley Nason. Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Hierarchical Reinforcement Learning Amir massoud Farahmand
Primary ICT Assessment What does good assessment look like? The ICT Assessment Toolkit © NEWLICT North East and West London ICT Consultants' Group.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Optimal Policies for POMDP Presented by Alp Sardağ.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Joost Broekens, Doug DeGroot, LIACS, University of Leiden, The Netherlands Emergent Representations and Reasoning in Adaptive Agents Joost Broekens, Doug.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Planning under Uncertainty
1 Learning from Behavior Performances vs Abstract Behavior Descriptions Tolga Konik University of Michigan.
Soar-RL and Reinforcement Learning Introducing talks by Shiwali Mohan, Mitchell Keith Bloch & Nick Gorski.
Impact of Working Memory Activation on Agent Design John Laird, University of Michigan 1.
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
Soar-RL Discussion: Future Directions, Open Questions, and Why You Should Use It 31 st Soar Workshop 1.
Models of Human Performance Dr. Chris Baber. 2 Objectives Introduce theory-based models for predicting human performance Introduce competence-based models.
Soar-RL: Reinforcement Learning and Soar Shelley Nason.
Reinforcement Learning
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
Design and Implementation of General Purpose Reinforcement Learning Agents Tyler Streeter November 17, 2005.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Reinforcement learning (Chapter 21)
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
SOAR A cognitive architecture By: Majid Ali Khan.
Beyond Chunking: Learning in Soar March 22, 2003 John E. Laird Shelley Nason, Andrew Nuxoll and a cast of many others University of Michigan.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
3/14/20161 SOAR CIS 479/579 Bruce R. Maxim UM-Dearborn.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Does the brain compute confidence estimates about decisions?
Knowledge Representation and Reasoning
Reinforcement learning (Chapter 21)
Knowledge Representation and Reasoning
Reinforcement learning (Chapter 21)
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
Unit# 9: Computer Program Development
Announcements Homework 3 due today (grace period through Friday)
Chapter 3: The Reinforcement Learning Problem
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CIS 488/588 Bruce R. Maxim UM-Dearborn
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 3: The Reinforcement Learning Problem
Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P.
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CIS 488/588 Bruce R. Maxim UM-Dearborn
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Reinforcement Learning and Soar Shelley Nason

Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative value of a (numeric) reward signal  Includes techniques for solving the temporal credit assignment problem  Well-suited to trial and error search in the world As applied to Soar, provides alternative for handling tie impasses

The goal for Soar-RL Reinforcement learning should be architectural, automatic and general-purpose (like chunking) Ultimately avoid  Task-specific hand-coding of features  Hand-decomposed task or reward structure  Programmer tweaking of learning parameters  And so on

Advantages to Soar from RL Non-explanation-based, trial and error learning – RL does not require any model of operator effects to improve action choice. Ability to handle probabilistic action effects –  An action may lead to success sometimes & failure other times. Unless Soar can find a way to distinguish these cases, it cannot correctly decide whether to take this action.  RL learns the expected return following an action, so can make potential utility vs. probability of success tradeoffs.

Representational additions to Soar: Rewards Learning from rewards instead of in terms of goals makes some tasks easier, especially:  Taking into account costs and rewards along the path to a goal & thereby pursuing optimal paths.  Non-episodic tasks – If learning in a subgoal, subgoal may never end. Or may end too early.

Representational additions to Soar: Rewards Rewards are numeric values created at specified place in WM. The architecture watches this location and collects its rewards. Source of rewards  productions included in agent code  written directly to io-link by environment

Representational additions to Soar: Numeric preferences Need the ability to associate numeric values with operator choices Symbolic vs. Numeric preferences:  Symbolic – Op 1 is better than Op 2  Numeric – Op 1 is this much better than Op 2 Why is this useful? Exploration.  Maybe top-ranked operator not actually best.  Therefore, useful to keep track of the expected quality of the alternatives.

Representational additions to Soar: Numeric preferences Numeric preference: Sp {avoid*monster (state ^task gridworld ^has_monster ^operator ) ( ^name move ^direction )  ( ^operator = -10)} New decision phase:  Process all reject/better/best/etc. preferences  Compute value for remaining candidate operators by summing numeric preferences  Choose operator by Boltzmann softmax

Fitting within RL framework The sum over numeric preferences has a natural interpretation as an action value Q(s,a), the expected discounted sum of future rewards, given that the agent takes action a from state s. Action a is operator Representation of state s is working memory (including sensor values, memories, results of reasoning)

Q(s,a) as linear combination of Boolean features (state ^task gridworld ^current_location 5 ^destination_location 14 ^operator +) ( ^name move ^direction east) (state ^task gridworld ^has-monster east ^operator +) ( ^name move ^direction east) (state ^task gridworld ^previous_cell ^operator ) ( ^name move ^direction ) ( ^operator = 4) ( ^operator = -10) ( ^operator = -3)

Example: Numeric preferences fired for O1 sp {MoveToX (state ^task gridworld ^current_location ^destination_location ^operator +) ( ^name move ^direction )  ( ^operator = 0)} sp {AvoidMonster (state ^task gridworld ^has-monster east ^operator +) ( ^name move ^direction east)  ( ^operator = -10)} Q(s,O1) = = 14 = 5 = east

Example: The next decision cycle sp {MoveToX (state ^task gridworld ^current_location 14 ^destination_location 5 ^operator +) ( ^name move ^direction east)  ( ^operator = 0)} sp {AvoidMonster (state ^task gridworld ^has-monster east ^operator +) ( ^name move ^direction east)  ( ^operator = -10)} Q(s,O1) = -10 O1 reward r = -5

Example: The next decision cycle sp {MoveToX (state ^task gridworld ^current_location 14 ^destination_location 5 ^operator +) ( ^name move ^direction east)  ( ^operator = 0)} sp {AvoidMonster (state ^task gridworld ^has-monster east ^operator +) ( ^name move ^direction east)  ( ^operator = -10)} Q(s,O1) = -10 O1 reward r = -5 O2 sum of numeric prefs. Q(s’,O2) = 2

Example: The next decision cycle sp {MoveToX (state ^task gridworld ^current_location 14 ^destination_location 5 ^operator +) ( ^name move ^direction east)  ( ^operator = 0)} sp {AvoidMonster (state ^task gridworld ^has-monster east ^operator +) ( ^name move ^direction east)  ( ^operator = -10)} Q(s,O1) = -10 O1 reward r = -5 O2 sum of numeric prefs. Q(s’,O2) = 2

Example: Updating the value for O1 Sarsa update- Q(s,O1)  Q(s,O1) + α[r + λQ(s’,O2) – Q(s,O1)] = 1.36 sp {|RL-1| (state ^task gridworld ^current_location 14 ^destination_location 5 ^operator +) ( ^name move ^direction east)  ( ^operator = 0)} sp {AvoidMonster (state ^task gridworld ^has-monster east ^operator +) ( ^name move ^direction east)  ( ^operator = -10)}

Example: Updating the value for O1 Sarsa update- Q(s,O1)  Q(s,O1) + α[r + λQ(s’,O2) – Q(s,O1)] = 1.36 sp {|RL-1| (state ^task gridworld ^current_location 14 ^destination_location 5 ^operator +) ( ^name move ^direction east)  ( ^operator = 0.68)} sp {AvoidMonster (state ^task gridworld ^has-monster east ^operator +) ( ^name move ^direction east)  ( ^operator = -9.32)}

Eaters Results

Future tasks Automatic feature generation (i.e., LHS of numeric preferences)  Likely to start with over-general features & add conditions if rule’s value doesn’t converge Improved exploratory behavior  Automatically handle parameter controlling randomness in action choice  Locally shift away from exploratory acts when confidence in numeric preferences is high Task decomposition & more sophisticated reward functions  Task-independent reward functions

Task decomposition: The need for hierarchy  Primitive operators: Move- west, Move-north, etc.  Higher level operators: Move-to-door(room,door)  Learning a flat policy over primitive operators is bad because  No subgoals (agent should be looking for door)  No knowledge reuse if goal is moved Move- to-door Move- west

Task decomposition: Hierarchical RL with Soar impasses Soar operator no-change impasse S1 S2 O1 O1 O1 O1 O5 O2 O3 O4 Rewards Next Action Subgoal reward

Task Decomposition: How to define subgoals Move-to-door(east) should terminate upon leaving room, by whichever door How to indicate whether goal has concluded successfully? Pseudo-reward, i.e., +1 if exit through east door -1 if exit through south door

Task Decomposition: Hierarchical RL and subgoal rewards Reward may be complicated function of particular termination state, reflecting progress toward ultimate goal But reward must be given at time of termination, to separate subtask learning from learning in higher tasks Frequent rewards are good But secondary rewards must be given carefully, so as to be optimal with respect to primary reward

Reward Structure Time Reward ActionActionActionAction

Reward Structure Time Reward Operator Action Action Operator Action

Conclusions As compared to last year, the programmer’s ability to construct features with which to associate operator values is much more flexible, making the RL component a more useful tool. Much work left to be done on automating parts of the RL component.