Download presentation
Presentation is loading. Please wait.
1
RL methods in practice Alekh Agarwal
2
Recap In the last lecture we saw: Markov Decision Processes
Value iteration methods for known MDPs RMAX algorithm for unknown MDPs Crucial assumption: Number of states is small
3
From theory to practice
Agent receives high-dimensional observations Want policies: observations β¦ action Observation as state does not scale Want policies and value functions to generalize across related observations
4
Markov Decision Processes (MDPs)
π₯ 1 ~ Ξ 1 Take action π 1 , Observe π 1 ( π 1 ) New state π₯ 2 ~Ξ( π₯ 1 , π 1 ) Challenge: Allow large number of unique states π₯
5
From observations to states
State is sufficient for history, observations are arbitrary First-person view does not include whatβs behind you Might not be sufficient to capture rewards and transitions Typically, π₯ is a modeling choice, can use multiple observations Last 4 observations (i.e. frames) form π₯ in Atari games Current observation + landmarks seen along the way Birdβs-eye view instead of first-person view We will call π context instead of observation to emphasize the difference Assuming a sensible choice of π₯ is made, we will focus on learning
6
Two broad paradigms Value function approximation
Policy improvement/search
7
Value Function Approximation
Agent receives context π₯. Find a function of (π₯,π) similar to the optimal value function π β Intuition: Pick Q such that similar π₯ β² π get similar predictions Function Approximation: Given a class Q of mappings π: π₯,π β¦π
, find the best approximation for π β
8
Quality of approximation
Recall the Bellman optimality equations for π β : π π‘ β π ,π = πΈ π, π β² π+ π π‘+1 β ( π β² ) | π ,π = πΈ π, π β² π+ max π β² π π‘+1 β π β² , π β² | π ,π We will rewrite the equations with the context π₯ instead of state π . Drop the subscript π‘, as it can be baked into the context. Find a function πβQ which satisfies the equations approximately
9
π-learning Given a dataset of π»-step trajectories:
π₯ 1 π , π 1 π π 1 π ,β¦, π₯ π» π , π π» π , π π» π π=1 π Let π πππ be our current π β approximation. Want π πππ€ to satisfy: π=1 π π‘=1 π» π πππ€ π₯ π‘ π , π π‘ π β π π‘ π β max π π πππ€ π₯ π‘+1 π ,π =0
10
π-learning Given a dataset of π»-step trajectories:
π₯ 1 π , π 1 π π 1 π ,β¦, π₯ π» π , π π» π , π π» π π=1 π Let π πππ be our current π β approximation. Find πΈ πππ to minimize: π πππ€ = argmin πβQ π=1 π π‘=1 π» π π₯ π‘ π , π π‘ π β π π‘ π β max π π πππ π₯ π‘+1 π ,π 2
11
π-learning Given a dataset of π»-step trajectories:
π₯ 1 π , π 1 π π 1 π ,β¦, π₯ π» π , π π» π , π π» π π=1 π Let π πππ be our current π β approximation. Find πΈ πππ to minimize: π πππ€ = argmin πβQ π=1 π π‘=1 π» π π₯ π‘ π , π π‘ π β π π‘ π β max π π πππ π₯ π‘+1 π ,π 2 Reasonable if π πππ and π πππ€ are similar
12
π-learning Let π πππ be our current π β approximation. Want π πππ€ to satisfy: π=1 π π‘=1 π» π π₯ π‘ π , π π‘ π β π π‘ π β max π π πππ π₯ π‘+1 π ,π 2 Reasonable if π πππ and π πππ€ are similar. In tabular settings, iterate: π πππ€ π₯ π‘ , π π‘ = π πππ π₯ π‘ , π π‘ βπΏ π πππ π₯ π‘ , π π‘ β π π‘ β max π π πππ π₯ π‘+1 ,π
13
π-learning with function approximation
Suppose π π₯,π = π€ π π(π₯,π) π€ πππ€ = π€ πππ βπΏ π€ πππ π π π₯ π‘ , π π‘ β π π‘ β max π π€ πππ π π π₯ π‘+1 ,π π( π₯ π‘ , π π‘ ) Suppose π π₯,π = π π π₯,π for some parameters π π πππ€ = π πππ βπΏ π ππ π π π₯ π‘ , π π‘ β π π‘ β max π π πππ π₯ π‘+1 ,π 2
14
π-learning with function approximation
Suppose π π₯,π = π π π₯,π for some parameters π π πππ€ = π πππ βπΏ π ππ π π π₯ π‘ , π π‘ β π π‘ β max π π πππ π₯ π‘+1 ,π 2 (Stochastic) gradient descent on π=1 π π‘=1 π» π π₯ π‘ π , π π‘ π β π π‘ π β max π π πππ π₯ π‘+1 π ,π 2
15
Fitted π-Iteration (πΉππΌ)
Given a dataset of π»-step trajectories: π₯ 1 π , π 1 π π 1 π ,β¦, π₯ π» π , π π» π , π π» π π=1 π Repeat until convergence: Find π to minimize: π=1 π π‘=1 π» π π₯ π‘ π , π π‘ π β π π‘ π β max π π πππ π₯ π‘+1 π ,π 2 Set π πππ =π Can use any regression method in the inner loop
16
π-learning properties
Generalizes across related π₯ similar to supervised learning Effectively solving sequence of regression problems computationally Convergence can be slow/never πΉππΌ converges to π β given good exploration and conditions on Q No prescription for exploration and data collection
17
How to explore π-greedy exploration: Let π be the current approximation for π β . π π‘ = argmax π π( π₯ π‘ ,π) with probability 1βπ, uniform at random with probability π Softmax/Boltzmann exploration: π π π‘ =π β exp ππ π₯ π‘ , π Both are reasonable when horizon is small
18
Ensemble exploration Structure of optimal exploration in contextual bandits Create an ensemble of candidate optimal policies Randomize amongst their chosen actions e.g.: EXP4, ILTCB, Thompson Sampling,β¦ Adapts to arbitrary policy/value-function classes Can we extend this idea to RL?
19
Bootstrapped π-learning (Osband et al., 2016)
Given a dataset of π»-step trajectories: π₯ 1 π , π 1 π π 1 π ,β¦, π₯ π» π , π π» π , π π» π π=1 π Bootstrap resample πΎ datasets, each of size π Use π-learning or πΉππΌ to fit π π to dataset πΎ Build the next dataset picking π π randomly for each trajectory Randomize across, not within trajectories
20
Exploration in RL No formal guarantees for these methods with high-dimensional contexts With many unique contexts, no method can learn in ππππ¦( π΄ ,π», log |Q| ) samples Recent methods to do well under certain assumptions Still an active and ongoing research area
21
Two broad paradigms Value function approximation
Policy improvement/search
22
Can we directly optimize π(π,π) over the parameters of π?
Policy search We know how to define the value of a policy: π π,π = πΈ π 1 , π 1 , π 1 ,β¦, π π» , π π» , π π» βΌπ,π π 1 +β¦+ π π» = π‘=1 π» πΈ π βΌ π π‘ π π π π ,π π(π ,π) Here π π‘ π is the distribution over states at time step π‘, if we choose all actions according to a stochastic policy π, and transitions are according to π Can we directly optimize π(π,π) over the parameters of π?
23
Policy Gradient (Sutton et al., 2000)
ππ π,π ππ = π‘=1 π» πΈ π βΌ π π‘ π π ππ π ,π ππ π π‘ π (π ,π) The gradient only involves the dependence of π on π, not that of the state distributions π π‘ π If we can evaluate, optimize π by gradient ascent over parameters No explicit reliance on small number of unique states!
24
Proof sketch π π π‘ π π ,π ππ = π ππ π π π ,π π π‘ π (π ,π,π) = π ππ π ,π ππ π π‘ π π ,π,π + π π ,π π π π‘ π π ,π,π ππ = π ππ π ,π ππ π π‘ π π ,π,π + π π ,π π (πΈ[ π π‘ +π π‘+1 π π β² ,π ]) ππ = π ππ π ,π ππ π π‘ π π ,π,π + π π ,π π πΈ π β² [π π‘+1 π π β² ,π ] ππ
25
Proof sketch (contd.) π π π‘ π π ,π ππ = π ππ π ,π ππ π π‘ π π ,π,π + π π ,π π πΈ π β² [π π‘+1 π π β² ,π ] ππ πΈ π βΌ π π‘ π π π π‘ π π ,π ππ = π ππ π ,π ππ π π‘ π π ,π,π + πΈ π βΌ π π‘+1 π π π π‘+1 π π ,π ππ Unfolding this recursion completes the proof.
26
Evaluating policy gradients
ππ π,π ππ = π‘=1 π» πΈ π βΌ π π‘ π π ππ π ,π ππ π π‘ π (π ,π) Take π‘ steps according to π, gives state π βΌ π π‘ π
27
Evaluating policy gradients
ππ π,π ππ = π‘=1 π» πΈ π βΌ π π‘ π π ππ π ,π ππ π π‘ π (π ,π) Take π‘ steps according to π, gives state π βΌ π π‘ π Choose a random action π in state π . Compute derivative of π(π ,π)
28
Evaluating policy gradients
ππ π,π ππ = π‘=1 π» πΈ π βΌ π π‘ π π ππ π ,π ππ π π‘ π (π ,π) Take π‘ steps according to π, gives state π βΌ π π‘ π Choose a random action π in state π . Compute derivative of π(π ,π) Choose all subsequent actions according to π, compute cumulative reward from π‘ onwards. Gives unbiased estimate for π π‘ π (π ,π)
29
Evaluating policy gradients
ππ π,π ππ = π‘=1 π» πΈ π βΌ π π‘ π π ππ π ,π ππ π π‘ π (π ,π) Take π‘ steps according to π, gives state π βΌ π π‘ π Choose a random action π in state π . Compute derivative of π(π ,π) Choose all subsequent actions according to π, compute cumulative reward from π‘ onwards. Gives unbiased estimate for π π‘ π π ,π Repeat for each π‘=1,2,β¦,π» Generic scheme for unbiased gradients of π(π,π). Optimize by stochastic gradient ascent
30
Policy gradient properties
Convergence to local optimum for decaying step sizes No assumptions on number of states or actions Works with arbitrary differentiable policy classes Gradients can have high variance Doubly robust-style corrections. Actor-critic methods Convergence can be very slow Exploration determined by π, might not visit good states early on
31
Policy Improvement Policy gradient struggles from a poor initialization Suppose we have access to an expert at training time Algorithm can query the expertβs actions at any state/context Wants to find a policy at least as good as the expert Evaluated without expertβs help at test time
32
A general template Roll-in = Roll-out = π for policy gradient
Deviations Roll-in = Roll-out = π for policy gradient Given an expert policy π at training time, what are better choices? Roll-out π‘=π» π‘ π π‘=1 Roll-in
33
Behavior cloning Roll-in = Roll-out = π
Deviations Roll-in = Roll-out = π Train a policy to minimize πΈ π βΌ π π [1 π π β π π ) Roll-out π‘=π» π‘ π π‘=1 Roll-in
34
Behavior cloning Train a policy to minimize πΈ π βΌ π π [1 π π β π π )
Can be done with just demonstrations without access to expert Policy improvement = multiclass classification Leads to compounding errors if no π can imitate π well
35
Compounding errors π takes actions in red. Rewards only in leaf nodes.
If π makes mistake at root, no information on how to act in π 2 π 1 π 2 π 3 π 4 π 5 π 6 π 7 1 1
36
AggreVaTe (Ross and Bagnell, 2014)
Deviations Roll-in = π, Roll-out = π Train a policy to minimize πΈ π βΌ π π Q π s,π π Roll-out π‘=π» π‘ π π‘=1 Roll-in
37
AggreVaTe (Ross and Bagnell, 2014)
Roll-in = π, Roll-out = π Train a policy to minimize πΈ π βΌ π π Q π s,π π We have π βΌ π π as we roll-in with π No compounding errors Policy improvement = cost-sensitive classification Estimate π π using (multiple) roll-outs by π
38
Compounding errors revisited
π takes actions in red. Rewards only in leaf nodes. If π minimizes cost, indifferent at root. Roll-in with π means no training in π 2 . Fixed by roll-in with π π 1 π 2 π 3 π 4 π 5 π 6 π 7 1 1
39
AggreVaTe theory Suppose we have a good cost-sensitive classifier
Do at least π(π») rounds of AggreVaTe Value of policy returned by AggreVaTe β Expert policyβs value assuming such a policy exists in our class Can even improve upon the expert sometimes! Further improvements in the literature
40
Policy search overview
No expert Policy gradient, TRPO, Actor-critic variants Expert dataset Behavior cloning Expert policy AggreVaTe and improvements
41
Other ways of using expert
Assuming expert is optimal, find a reward function Can be done with trajectories, called inverse RL Access to expert policy, but no reward information DAgger: like behavior cloning, but roll-in with π to minimize πΈ π βΌ π π [1 π π β π π Use of any domain knowledge to build expert always preferred to policy search from scratch
42
Some practical issues All algorithms we saw today require exponentially many trajectories for hard problems Typical data requirements still quite large, unless strong expert Better exploration will help Long-horizon problems still data intensive
43
Partial Observability
We assumed a Markovian state or context Suppose we only have first-person view of agent Rewards and dynamics can depend on whole trajectory! Typically modeled as Partially Observable MDP (POMDP) Key difference: π-functions, π-functions and policies all depend on whole trajectory instead of just the observed state Typically much harder statistically and computationally
44
Hierarchical RL Long-horizon problems are hard
Can benefit if trajectories have repeated sub-patterns or sub-tasks First learn how to do sub-tasks well, compose to solve original problem Many formalisms: Options General value functions RL with sub-goals
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.