Structure in Reinforcement Learning

Slides:



Advertisements
Similar presentations
A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.
Advertisements

Heuristic Search techniques
Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]
Hierarchical Reinforcement Learning Amir massoud Farahmand
Learning to Interpret Natural Language Instructions Monica Babeş-Vroman +, James MacGlashan *, Ruoyuan Gao +, Kevin Winner * Richard Adjogah *, Marie desJardins.
Recursion vs. Iteration The original Lisp language was truly a functional language: –Everything was expressed as functions –No local variables –No iteration.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.
1 Life: play and win in 20 trillion moves Stuart Russell Computer Science Division, UC Berkeley Joint work with Ron Parr, David Andre, Andy Zimdars, Carlos.
Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.
Planning under Uncertainty
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Reinforcement Learning and Soar Shelley Nason. Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
SE-565 Software System Requirements More UML Diagrams.
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
Hierarchical Reinforcement Learning Ronald Parr Duke University ©2005 Ronald Parr From ICML 2005 Rich Representations for Reinforcement Learning Workshop.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
Solving POMDPs through Macro Decomposition
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
1 Life: play and win in 20 trillion moves Stuart Russell Based on work with Ron Parr, David Andre, Carlos Guestrin, Bhaskara Marthi, and Jason Wolfe.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
1 Hierarchical Learning and Planning Stuart Russell Computer Science Division, UC Berkeley Joint work with Ron Parr, David Andre, Bhaskara Marthi, Andy.
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
CS 326 Programming Languages, Concepts and Implementation
Chapter 6: Temporal Difference Learning
István Szita & András Lőrincz
Reinforcement Learning (1)
Computer Engg, IIT(BHU)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Constraint Satisfaction Problems vs. Finite State Problems
Reinforcement learning (Chapter 21)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Concurrent Hierarchical Reinforcement Learning
CS 188: Artificial Intelligence Fall 2006
Chapter 3: The Reinforcement Learning Problem
SNU BioIntelligence Lab.
Closure Representations in Higher-Order Programming Languages
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
Reinforcement Learning Dealing with Partial Observability
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Structure in Reinforcement Learning 11/19/2018 9:28:50 PM Structure in Reinforcement Learning Stuart Russell Computer Science Division, UC Berkeley Joint work with Ron Parr, David Andre, Bhaskara Marthi, Andy Zimdars, David Latham, Carlos Guestrin

Outline Structure in behavior Temporal abstraction and ALisp Temporal decomposition of value functions Concurrent behavior and concurrent ALisp Threadwise decomposition of value functions Putting it all together

Structured behavior The world is very complicated, but we manage somehow Behavior is usually very structured and deeply nested: Moving my tongue Shaping this syllable Saying this word Saying this sentence Making a point about nesting Explaining structured behavior Giving this talk

Structured behavior The world is very complicated, but we manage somehow Behavior is usually very structured and deeply nested: Moving my tongue Shaping this syllable Saying this word Saying this sentence Making a point about nesting Explaining structured behavior Giving this talk Modularity/Independence: Selection of tongue motions proceeds independently of most of the variables relevant to giving this talk, given the choice of word.

Example problem: Stratagus 11/19/2018 9:28:50 PM Example problem: Stratagus Challenges Large state space Large action space Complex policies

Running example Peasants can move, pickup and dropoff 11/19/2018 9:28:50 PM Running example Peasants can move, pickup and dropoff Penalty for collision Cost-of-living each step Reward for dropping off resources Goal : gather 10 gold + 10 wood

Outline Structure in behavior Temporal abstraction and ALisp Temporal decomposition of value functions Concurrent behavior and concurrent ALisp Threadwise decomposition of value functions Putting it all together

Reinforcement Learning 11/19/2018 9:28:50 PM Reinforcement Learning a Learning algorithm s,r Policy

Q-functions Represent policies implicitly using a Q-function 11/19/2018 9:28:50 PM Q-functions s a Q(s,a) ... Peas1Loc=(2,3),Peas2Loc=(4,2),Gold=8,Wood=3 (East,Pickup) 7.5 Peas1Loc=(4,4), Peas2Loc=(6,3), Gold=4, Wood=7 (West, North) -12 Peas1Loc=(5,1), Peas2Loc=(1,2), Gold=8, Wood=3 (South, West) 1.9 Fragment of example Q-function Represent policies implicitly using a Q-function Qπ(s,a) = “Expected total reward if I do action a in environment state s and follow policy π thereafter” Talk about Q-function being large to represent, and that it can be decomposed either temporally or threadwise Mention that Q-function refers to environment state and primitive actions

Hierarchy in RL It is natural to think about abstract states such as ``at home,'‘ ``in Soda Hall,'' etc., arranged in an abstraction hierarchy Idea 1: set up a decision process among the abstract states

Hierarchy in RL It is natural to think about abstract states such as ``at home,'‘ ``in Soda Hall,'' etc., arranged in an abstraction hierarchy Idea 1: set up a decision process among the abstract states Fails! Transitions between abstract states are usually non-Markovian

Hierarchy in RL Idea 2: define abstract actions such as ``drive to work'', ``park'' etc. Set up a decision process with choice states and abstract actions Resulting decision problem is semi-Markov (Forestier & Varaiya, 1978) Temporal abstraction in RL (HAMs (Parr & Russell); Options (Sutton & Precup); MAXQ (Dietterich))

Partial programs Idea 3: choices only in some states => prior constraints on behavior => partially specified programs Partial programming language = programming language + free choices ALisp (Andre & Russell, 2002) Concurrent ALisp (Marthi et al., 2005) (cf. GoLog and ConGoLog) Analogy to Bayes Nets Domain experts supply structure (partial programs) => factored representation of Q-function (Dietterich, 2000)

Hierarchical RL and ALisp 11/19/2018 9:28:50 PM Hierarchical RL and ALisp Partial program a Learning algorithm s,r Completion

Single-threaded Alisp program 11/19/2018 9:28:50 PM (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) Animation, fade outs… Program state Motivate Q(\omega,…) here Before walking through program, Alisp is an extension of Lisp that adds a choose operator Mention nav-choice wd have to be path-planning alg in complete program Program state includes Program counter Call stack Global variables

Q-functions Represent completions using Q-function 11/19/2018 9:28:50 PM Q-functions ω u Q(ω,u) ... At nav-choice, Pos=(2,3), Dest=(6,5) North 15 At resource-choice, Gold=7, Wood=3 Gather-wood -42 Example Q-function Represent completions using Q-function Joint state ω = env state + program state MDP + partial program = SMDP over {ω} Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter”

Outline Structure in behavior Temporal abstraction and ALisp Temporal decomposition of value functions Concurrent behavior and concurrent ALisp Threadwise decomposition of value functions Putting it all together

Temporal Q-decomposition 11/19/2018 9:28:50 PM Top Top Top GatherWood GatherGold Nav(Forest1) Nav(Mine2) Qr Qc Qe Q Temporal decomposition Q = Qr+Qc+Qe where Qr(ω,u) = reward while doing u Qc(ω,u) = reward in current subroutine after doing u Qe(ω,u) = reward after current subroutine

Q-learning for ALisp Temporal decomposition => state abstraction E.g., while navigating, Qr doesn’t depend on gold reserves State abstraction => fewer parameters, faster learning, better generalization Algorithm for learning temporally decomposed Q-function (Andre+Russell 02) Yields optimal completion of partial program

Outline Structure in behavior Temporal abstraction and ALisp Temporal decomposition of value functions Concurrent behavior and concurrent ALisp Threadwise decomposition of value functions Putting it all together

Multieffector domains 11/19/2018 9:28:50 PM Multieffector domains Domains where the action is a vector Each component of action vector corresponds to an effector Stratagus: each unit/building is an effector #actions exponential in #effectors Define multieffector domains, remove ALISP

HRL in multieffector domains 11/19/2018 9:28:50 PM HRL in multieffector domains Single Alisp program Partial program must essentially implement multiple control stacks Independencies between tasks not used Temporal decomposition is lost Separate Alisp program for each effector Hard to achieve coordination among effectors Our approach - single multithreaded partial program to control all effectors

Threads and effectors Threads = tasks Get-gold Get-wood Threads = tasks Each effector assigned to a thread Threads can be created/destroyed Effectors can be moved between threads Set of effectors can change over time Defend-Base Get-wood

Concurrent Alisp syntax 11/19/2018 9:28:50 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Say choose, get-state are as before

Concurrent Alisp syntax 11/19/2018 9:28:50 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Say choose, get-state are as before

Concurrent Alisp syntax 11/19/2018 9:28:50 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Say choose, get-state are as before

Concurrent Alisp syntax 11/19/2018 9:28:50 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Say choose, get-state are as before

Concurrent Alisp syntax 11/19/2018 9:28:50 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Picture on the side explaining concepts Language constructs are an important contribution, should add more emphasis to this part of the talk Say choose, get-state are as before

An example Concurrent ALisp program An example single-threaded ALisp program 11/19/2018 9:28:50 PM (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

Concurrent Alisp semantics 11/19/2018 9:28:50 PM Peasant 1 defun gather-wood () (setf dest (choose *forests*)) (nav dest) (action ‘get-wood) (nav *home-base-loc*) (action ‘put-wood)) (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) Thread 1 Making joint choice Running Waiting for joint action Paused (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Show text for Wait for other agent Coordinated choice Explain why joint choices are necessary Peasant 2 Thread 2 Waiting for joint action Running Paused Making joint choice Environment timestep 26 Environment timestep 27

Concurrent Alisp semantics 11/19/2018 9:28:50 PM Threads execute independently until they hit a choice or action Wait until all threads are at a choice or action If all effectors have been assigned an action, do that joint action in environment Otherwise, make joint choice Can be converted to a semi-Markov decision process assuming partial program is deadlock-free and has no race conditions Perhaps

Q-functions 11/19/2018 9:28:50 PM ω u Q(ω,u) ... Peas1 at NavChoice, Peas2 at DropoffGold, Peas3 at ForestChoice, Pos1=(2,3), Pos3=(7,4), Gold=12, Wood=14 (Peas1:East, Peas3:Forest2) 15.7 Example Q-function To complete partial program, at each choice state ω, need to specify choices for all choosing threads So Q(ω,u) as before, except u is a joint choice

Making joint choices at runtime Large state space Large choice space At state ω, want to choose argmaxu Q(ω,u) # joint choices exponential in # choosing threads Use relational linear value function approximation with coordination graphs (Guestrin et al 2003) to represent, maximize Q efficiently

Basic learning procedure 11/19/2018 9:28:50 PM Apply Q-learning on equivalent SMDP Boltzmann exploration, maximization on each step done efficiently using coordination graph Theorem: in tabular case, converges to the optimal completion of the partial program

Outline Structure in behavior Temporal abstraction and ALisp Temporal decomposition of value functions Concurrent behavior and concurrent ALisp Threadwise decomposition of value functions Putting it all together

Problems with basic learning method Temporal decomposition of Q-function lost No credit assignment among threads Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly Peasant 2 thinks he’s done very well!! Significantly slows learning as number of peasants increases

Threadwise decomposition 11/19/2018 9:28:50 PM Threadwise decomposition Idea : decompose reward among threads (Russell+Zimdars, 2003) E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants Qj(ω,u) = “Expected total reward received by thread j if we make joint choice u and make globally optimal choices thereafter” Threadwise Q-decomposition Q = Q1+…Qn This solves both problems Mention that this can be hierarchical

Learning threadwise decomposition 11/19/2018 9:28:50 PM Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread Remention that it can be hierarchical r decomp a Action state

Learning threadwise decomposition 11/19/2018 9:28:50 PM Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax Q (ω,·) = u Choice state ω

Outline Structure in behavior Temporal abstraction and ALisp Temporal decomposition of value functions Concurrent behavior and concurrent ALisp Threadwise decomposition of value functions Putting it all together

Threadwise and temporal decomposition 11/19/2018 9:28:50 PM Peasant 3 thread Peasant 2 thread Top thread Peasant 1 thread Qj = Qj,r + Qj,c + Qj,e where Qj,r (ω,u) = Expected reward gained by thread j while doing u Qj,c (ω,u) = Expected reward gained by thread j after u but before leaving current subroutine Qj,e (ω,u) = Expected reward gained by thread j after current subroutine

Resource gathering with 15 peasants 11/19/2018 9:28:50 PM Resource gathering with 15 peasants Reward of learnt policy Zoom subgraph Appear one at a time Colors Words on the side Flat Num steps learning (x 1000)

Resource gathering with 15 peasants 11/19/2018 9:28:50 PM Resource gathering with 15 peasants Undecomposed Reward of learnt policy Zoom subgraph Appear one at a time Colors Words on the side Flat Num steps learning (x 1000)

Resource gathering with 15 peasants 11/19/2018 9:28:50 PM Resource gathering with 15 peasants Threadwise Undecomposed Reward of learnt policy Zoom subgraph Appear one at a time Colors Words on the side Flat Num steps learning (x 1000)

Resource gathering with 15 peasants 11/19/2018 9:28:50 PM Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Zoom subgraph Appear one at a time Colors Words on the side Flat Num steps learning (x 1000)

Conclusion Temporal abstraction + temporal value decomposition + concurrency + threadwise value decomposition + relational function approximation + coordination graphs + reward shaping => somewhat effective model-free learning Current directions Model-based learning and lookahead Temporal logic as a behavioral constraint language Partial observability (belief state = program state) Complex motor control tasks