Download presentation
Presentation is loading. Please wait.
1
Concurrent Hierarchical Reinforcement Learning
11/19/2018 9:29:53 PM Concurrent Hierarchical Reinforcement Learning Bhaskara Marthi, Stuart Russell, David Latham UC Berkeley Carlos Guestrin CMU
2
Example problem: Stratagus
11/19/2018 9:29:53 PM Example problem: Stratagus Challenges Large state space Large action space Complex policies
3
Writing agents for Stratagus
11/19/2018 9:29:53 PM Writing agents for Stratagus Game company : N person-years to write “AI” for game Hierarchical RL : concurrent partial program + learning optimal completion Remind audience that games need AI to allow people to play against computer Standard RL – learn from tabula rasa
4
Outline Running example Background Our contributions Results
11/19/2018 9:29:53 PM Outline Running example Background Our contributions Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition Results
5
Running example Peasants can move, pickup and dropoff
11/19/2018 9:29:53 PM Running example Peasants can move, pickup and dropoff Penalty for collision Cost-of-living each step Reward for dropping off resources Goal : gather 10 gold + 10 wood
6
Outline Running example Background Our contributions Results
11/19/2018 9:29:53 PM Outline Running example Background Our contributions Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition Results
7
Reinforcement Learning
11/19/2018 9:29:53 PM Reinforcement Learning a Learning algorithm s,r Policy
8
Q-functions Represent policies implicitly using a Q-function
11/19/2018 9:29:53 PM Q-functions s a Q(s,a) ... Peas1Loc=(2,3),Peas2Loc=(4,2),Gold=8,Wood=3 (East,Pickup) 7.5 Peas1Loc=(4,4), Peas2Loc=(6,3), Gold=4, Wood=7 (West, North) -12 Peas1Loc=(5,1), Peas2Loc=(1,2), Gold=8, Wood=3 (South, West) 1.9 Fragment of example Q-function Represent policies implicitly using a Q-function Qπ(s,a) = “Expected total reward if I do action a in environment state s and follow policy π thereafter” Talk about Q-function being large to represent, and that it can be decomposed either temporally or threadwise Mention that Q-function refers to environment state and primitive actions
9
Hierarchical RL and ALisp
11/19/2018 9:29:53 PM Hierarchical RL and ALisp Partial program a Learning algorithm s,r Completion
10
Single-threaded Alisp program
11/19/2018 9:29:53 PM (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) Animation, fade outs… Program state Motivate Q(\omega,…) here Before walking through program, Alisp is an extension of Lisp that adds a choose operator Mention nav-choice wd have to be path-planning alg in complete program Program state includes Program counter Call stack Global variables
11
Q-functions Represent completions using Q-function
11/19/2018 9:29:53 PM Q-functions ω u Q(ω,u) ... At nav-choice, Pos=(2,3), Dest=(6,5) North 15 At resource-choice, Gold=7, Wood=3 Gather-wood -42 Example Q-function Represent completions using Q-function Joint state ω = env state + program state MDP + partial program = SMDP over {ω} Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter”
12
Temporal Q-decomposition
11/19/2018 9:29:53 PM Top Top Top GatherWood GatherGold Nav(Forest1) Nav(Mine2) Qr Qc Qe Q Temporal decomposition Q = Qr+Qc+Qe where Qr(ω,u) = reward while doing u Qc(ω,u) = reward in current subroutine after doing u Qe(ω,u) = reward after current subroutine
13
Q-learning for ALisp Temporal decomposition => state abstraction
E.g., while navigating, Qr doesn’t depend on gold reserves State abstraction => fewer parameters, faster learning, better generalization Algorithm for learning temporally decomposed Q-function (Andre+Russell 02) Yields optimal completion of partial program
14
Outline Running example Background Our contributions Results
11/19/2018 9:29:53 PM Outline Running example Background Our contributions Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition Results
15
Multieffector domains
11/19/2018 9:29:53 PM Multieffector domains Domains where the action is a vector Each component of action vector corresponds to an effector Stratagus: each unit/building is an effector #actions exponential in #effectors Define multieffector domains, remove ALISP
16
HRL in multieffector domains
11/19/2018 9:29:53 PM HRL in multieffector domains Single Alisp program Partial program must essentially implement multiple control stacks Independencies between tasks not used Temporal decomposition is lost Separate Alisp program for each effector Hard to achieve coordination among effectors Our approach - single multithreaded partial program to control all effectors
17
Threads and effectors Threads = tasks
Get-gold Get-wood Threads = tasks Each effector assigned to a thread Threads can be created/destroyed Effectors can be moved between threads Set of effectors can change over time Defend-Base Get-wood
18
Concurrent Alisp syntax
11/19/2018 9:29:53 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Say choose, get-state are as before
19
Concurrent Alisp syntax
11/19/2018 9:29:53 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Say choose, get-state are as before
20
Concurrent Alisp syntax
11/19/2018 9:29:53 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Say choose, get-state are as before
21
Concurrent Alisp syntax
11/19/2018 9:29:53 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Say choose, get-state are as before
22
Concurrent Alisp syntax
11/19/2018 9:29:53 PM Extension of ALisp New/modified operations : (action (e1 a1) ... (en an)) each effector ei does ai (spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it (reassign effectors dest-thread) reassign effectors to thread dest-thread (my-effectors) return list of effectors assigned to calling thread Program should also specify how to assign new effectors on each step to a thread Picture on the side explaining concepts Language constructs are an important contribution, should add more emphasis to this part of the talk Say choose, get-state are as before
23
An example Concurrent ALisp program
An example single-threaded ALisp program 11/19/2018 9:29:53 PM (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
24
Concurrent Alisp semantics
11/19/2018 9:29:53 PM Peasant 1 defun gather-wood () (setf dest (choose *forests*)) (nav dest) (action ‘get-wood) (nav *home-base-loc*) (action ‘put-wood)) (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) Thread 1 Making joint choice Running Waiting for joint action Paused (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Show text for Wait for other agent Coordinated choice Explain why joint choices are necessary Peasant 2 Thread 2 Waiting for joint action Running Paused Making joint choice Environment timestep 26 Environment timestep 27
25
Concurrent Alisp semantics
11/19/2018 9:29:53 PM Threads execute independently until they hit a choice or action Wait until all threads are at a choice or action If all effectors have been assigned an action, do that joint action in environment Otherwise, make joint choice Can be converted to a semi-Markov decision process assuming partial program is deadlock-free and has no race conditions Perhaps
26
Q-functions 11/19/2018 9:29:53 PM ω u Q(ω,u) ... Peas1 at NavChoice, Peas2 at DropoffGold, Peas3 at ForestChoice, Pos1=(2,3), Pos3=(7,4), Gold=12, Wood=14 (Peas1:East, Peas3:Forest2) 15.7 Example Q-function To complete partial program, at each choice state ω, need to specify choices for all choosing threads So Q(ω,u) as before, except u is a joint choice
27
Outline Running example Background Our contributions Results
11/19/2018 9:29:53 PM Outline Running example Background Our contributions Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition Results
28
Making joint choices at runtime
Large state space Large choice space At state ω, want to choose argmaxu Q(ω,u) # joint choices exponential in # choosing threads Use relational linear value function approximation with coordination graphs (Guestrin et al 2003) to represent, maximize Q efficiently
29
Basic learning procedure
11/19/2018 9:29:53 PM Apply Q-learning on equivalent SMDP Boltzmann exploration, maximization on each step done efficiently using coordination graph Theorem: in tabular case, converges to the optimal completion of the partial program
30
Outline Running example Background Our contributions Results
11/19/2018 9:29:53 PM Outline Running example Background Our contributions Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition Results
31
Problems with basic learning method
Temporal decomposition of Q-function lost No credit assignment among threads Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly Peasant 2 thinks he’s done very well!! Significantly slows learning as number of peasants increases
32
Threadwise decomposition
11/19/2018 9:29:53 PM Threadwise decomposition Idea : decompose reward among threads (Russell+Zimdars, 2003) E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants Qj(ω,u) = “Expected total reward received by thread j if we make joint choice u and make globally optimal choices thereafter” Threadwise Q-decomposition Q = Q1+…Qn This solves both problems Mention that this can be hierarchical
33
Learning threadwise decomposition
11/19/2018 9:29:53 PM Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread Remention that it can be hierarchical r decomp a Action state
34
Learning threadwise decomposition
11/19/2018 9:29:53 PM Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax Q (ω,·) = u Choice state ω
35
Outline Running example Background Our contributions Results
11/19/2018 9:29:53 PM Outline Running example Background Our contributions Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition Results
36
Threadwise and temporal decomposition
11/19/2018 9:29:53 PM Peasant 3 thread Peasant 2 thread Top thread Peasant 1 thread Qj = Qj,r + Qj,c + Qj,e where Qj,r (ω,u) = Expected reward gained by thread j while doing u Qj,c (ω,u) = Expected reward gained by thread j after u but before leaving current subroutine Qj,e (ω,u) = Expected reward gained by thread j after current subroutine
37
Outline Running example Background Our contributions Results
11/19/2018 9:29:53 PM Outline Running example Background Our contributions Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition Results
38
Resource gathering with 15 peasants
11/19/2018 9:29:53 PM Resource gathering with 15 peasants Reward of learnt policy Zoom subgraph Appear one at a time Colors Words on the side Flat Num steps learning (x 1000)
39
Resource gathering with 15 peasants
11/19/2018 9:29:53 PM Resource gathering with 15 peasants Undecomposed Reward of learnt policy Zoom subgraph Appear one at a time Colors Words on the side Flat Num steps learning (x 1000)
40
Resource gathering with 15 peasants
11/19/2018 9:29:53 PM Resource gathering with 15 peasants Threadwise Undecomposed Reward of learnt policy Zoom subgraph Appear one at a time Colors Words on the side Flat Num steps learning (x 1000)
41
Resource gathering with 15 peasants
11/19/2018 9:29:53 PM Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Zoom subgraph Appear one at a time Colors Words on the side Flat Num steps learning (x 1000)
42
Conclusion Contributions
Concurrent ALisp language for partial programs Algorithms for choice selection, learning Threadwise and temporal decomposition of Q-function Encode prior knowledge and use structure in Q-function for fast learning in very complex multieffector domains Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.