Soar-RL: Reinforcement Learning and Soar Shelley Nason.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Advertisements

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Reinforcement Learning
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
Applying Machine Learning to Circuit Design David Hettlinger Amy Kerr Todd Neller.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement Learning
Reinforcement learning
Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density Amy McGovern Andrew Barto.
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Practical Reinforcement Learning in Continuous Space William D. Smart Brown University Leslie Pack Kaelbling MIT Presented by: David LeRoux.
Latent Learning in Agents iCML 03 Robotics/Vision Workshop Rati Sharma.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Impact of Working Memory Activation on Agent Design John Laird, University of Michigan 1.
1 Soar Semantic Memory Yongjia Wang University of Michigan.
Distributed Q Learning Lars Blackmore and Steve Block.
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
Soar-RL Discussion: Future Directions, Open Questions, and Why You Should Use It 31 st Soar Workshop 1.
Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Reinforcement Learning and Soar Shelley Nason. Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative.
Collaborative Reinforcement Learning Presented by Dr. Ying Lu.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Self-organizing Learning Array based Value System — SOLAR-V Yinyin Liu EE690 Ohio University Spring 2005.
Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer.
Radial Basis Function Networks
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Texas Holdem Poker With Q-Learning. First Round (pre-flop) PlayerOpponent.
Reinforcement Learning
Introduction Many decision making problems in real life
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning (RL) Consider an “agent” embedded in an environmentConsider an “agent” embedded in an environment Task of the agentTask of the agent.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
Joseph Xu Soar Workshop Learning Modal Continuous Models.
Distributed Q Learning Lars Blackmore and Steve Block.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Reinforcement Learning
3/14/20161 SOAR CIS 479/579 Bruce R. Maxim UM-Dearborn.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Presentation By SANJOG BHATTA Student ID : July 28’ 2009.
Announcements Homework 3 due today (grace period through Friday)
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Soar-RL: Reinforcement Learning and Soar Shelley Nason

Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative value of a (numeric) reward signal In Soar terminology, RL learns operator comparison knowledge

A learning method for low-knowledge situations Non-explanation-based, trial and error learning – RL does not require any model of operator effects to improve action choice. Additional requirement – rewards. Therefore RL component should be automatic and general-purpose. Ultimately avoid  Task-specific hand-coding of features  Hand-decomposed task or reward structure  Programmer tweaking of learning parameters  And so on

Q-values Q(s,a): the expected discounted sum of future rewards, given that the agent takes action a from state s, and follows a particular policy thereafter Given optimal Q-function, selecting action with highest Q-value at each state yields optimal policy state action state action state action Etc. reward λ*reward action

Representing the Q-function In Soar-RL, Q-function stored as productions, testing state and operator, and asserting numeric preferences. Sp{RL-rule (state ^operator +) …  ( ^operator = )} During decision phase, the Q-value of an operator O is taken to be the sum of all numeric preferences asserted for O.

Learning Q-value represented as concatenation: StateXAction  Set of Features  Value Q-learning: Move the value of Q(s t,a t ) toward r t + γ * max a Q(s t+1,a). Bootstrapping: Update the prediction at one step using the prediction at the next step. Automatic Feature Generation Q-learning (updating parameters of linear function) state action state reward

Current Work- Automatic Feature generation Constructing rule conditions with which to associate values Since values stored with RL rules, some RL rule must fire for every state-action pair for bootstrapping to work Sufficient distinctions required that agent will not confuse state-action pairs with significantly different Q-values Want rules that take advantage of opportunities for generalization

Waterjug Task- A reasonable set of rules One for each state-action pair, for instance- Sp {RL-0003 :rl (state ^jug ^jug ^operator +) ( ^volume 5 ^contents 5) ( ^volume 3 ^contents 0) ( ^name fill ^jug )  ( ^operator = 0)} 46 RL rules +1

How to generate rule automatically? Rule could be built from WM, for instance- sp {RL-1 :rl (state ^name waterjug ^jug ^jug ^operator + ^superstate nil ^type state) ( ^contents 0 ^free 5 ^volume 5) ( ^contents 0 ^free 3 ^volume 3) ( ^name fill ^jug )  ( ^operator = 0)}

But we want generalization… 0,0 0,3 3,0 3,3 5,1 0,1 1,0 1,3 4,0 4,3 5,2 0,2 2,0 2,3 5,0 5,3 +1

Adaptive representations System constructs feature set so that more distinctions in parts of state-action space requiring more distinctions. Specific-to-general: Collect instances and cluster according to similar values. General-to-specific: Add distinctions when area with single representation appears to contain multiple values.

General-to-specific Our most general rules Rules made from operator proposals Sp {RL-1 :rl (state ^name waterjug ^jug ^operator +) ( ^free 3) ( ^jug ^name fill) --> ( ^operator = 0)} Only generated when no rule fires for the selected operator

0,0 0,3 3,0 3,3 5,1 0,1 1,0 1,3 4,0 4,3 5,2 0,2 2,0 2,3 5,0 5,3 If jug has contents 3, then pour from jug into other jug. Specialization – Example of overgeneral representation +1

Predicted Q-values at the state (3,0) Goal achieved

Predicted Q-value for state (3,0)

How to fix – Add following rule If there is a jug with volume 3 and contents 3, pour this jug into the other jug.

How to fix – Add following rule If there is a jug with volume 3 and contents 3, pour this jug into the other jug.

Designing a specialization procedure 1. How to decide whether to specialize a given rule. 2. Given that we have chosen to specialize a rule, what conditions should we add to the rule? 3. (optional) In what, if any, cases should a rule be eliminated?

Question 2 (What conditions to add to a rule) – Proposed Answer Trying an activation-based scheme.  When an (instantiated) rule R decides to specialize, it finds the most activated WME, w = (ID ATTR VALUE).  Traces upward through WM, to find a shortest path from ID to some identifier in the rule’s instantiation  w and the WMEs in the trace add themselves to the conditions in R to form a new rule R’  If R’ is not a duplicate of some existing rule, R’ is added to the Rete (without removing R).

Question 1 (Should a given rule be specialized) – Proposed answer Track weights (numeric preferences) of rules. Weights should converge when rules sufficient, so stop specializing when weights stop moving.

Tracking weights of rule RL-3

# steps in run

# rules

Conclusions Nuggets – Did work (at least on Waterjug) That is, came to follow the optimal policy. Coal – Makes too many rules 1. Specializes rules that don’t need specialization. 2. Specializations not always useful, since chosen heuristically 3. Will try to explain non-determinism by making rules.