Presentation is loading. Please wait.

Presentation is loading. Please wait.

RL methods in practice Alekh Agarwal.

Similar presentations


Presentation on theme: "RL methods in practice Alekh Agarwal."β€” Presentation transcript:

1 RL methods in practice Alekh Agarwal

2 Recap In the last lecture we saw: Markov Decision Processes
Value iteration methods for known MDPs RMAX algorithm for unknown MDPs Crucial assumption: Number of states is small

3 From theory to practice
Agent receives high-dimensional observations Want policies: observations ↦ action Observation as state does not scale Want policies and value functions to generalize across related observations

4 Markov Decision Processes (MDPs)
π‘₯ 1 ~ Ξ“ 1 Take action π‘Ž 1 , Observe π‘Ÿ 1 ( π‘Ž 1 ) New state π‘₯ 2 ~Ξ“( π‘₯ 1 , π‘Ž 1 ) Challenge: Allow large number of unique states π‘₯

5 From observations to states
State is sufficient for history, observations are arbitrary First-person view does not include what’s behind you Might not be sufficient to capture rewards and transitions Typically, π‘₯ is a modeling choice, can use multiple observations Last 4 observations (i.e. frames) form π‘₯ in Atari games Current observation + landmarks seen along the way Bird’s-eye view instead of first-person view We will call 𝒙 context instead of observation to emphasize the difference Assuming a sensible choice of π‘₯ is made, we will focus on learning

6 Two broad paradigms Value function approximation
Policy improvement/search

7 Value Function Approximation
Agent receives context π‘₯. Find a function of (π‘₯,π‘Ž) similar to the optimal value function 𝑄 ⋆ Intuition: Pick Q such that similar π‘₯ β€² 𝑠 get similar predictions Function Approximation: Given a class Q of mappings 𝑄: π‘₯,π‘Ž ↦𝑅, find the best approximation for 𝑄 ⋆

8 Quality of approximation
Recall the Bellman optimality equations for 𝑄 ⋆ : 𝑄 𝑑 ⋆ 𝑠,π‘Ž = 𝐸 π‘Ÿ, 𝑠 β€² π‘Ÿ+ 𝑉 𝑑+1 ⋆ ( 𝑠 β€² ) | 𝑠,π‘Ž = 𝐸 π‘Ÿ, 𝑠 β€² π‘Ÿ+ max π‘Ž β€² 𝑄 𝑑+1 ⋆ 𝑠 β€² , π‘Ž β€² | 𝑠,π‘Ž We will rewrite the equations with the context π‘₯ instead of state 𝑠. Drop the subscript 𝑑, as it can be baked into the context. Find a function π‘„βˆˆQ which satisfies the equations approximately

9 𝑄-learning Given a dataset of 𝐻-step trajectories:
π‘₯ 1 𝑖 , π‘Ž 1 𝑖 π‘Ÿ 1 𝑖 ,…, π‘₯ 𝐻 𝑖 , π‘Ž 𝐻 𝑖 , π‘Ÿ 𝐻 𝑖 𝑖=1 𝑛 Let 𝑄 π‘œπ‘™π‘‘ be our current 𝑄 ⋆ approximation. Want 𝑄 𝑛𝑒𝑀 to satisfy: 𝑖=1 𝑛 𝑑=1 𝐻 𝑄 𝑛𝑒𝑀 π‘₯ 𝑑 𝑖 , π‘Ž 𝑑 𝑖 βˆ’ π‘Ÿ 𝑑 𝑖 βˆ’ max π‘Ž 𝑄 𝑛𝑒𝑀 π‘₯ 𝑑+1 𝑖 ,π‘Ž =0

10 𝑄-learning Given a dataset of 𝐻-step trajectories:
π‘₯ 1 𝑖 , π‘Ž 1 𝑖 π‘Ÿ 1 𝑖 ,…, π‘₯ 𝐻 𝑖 , π‘Ž 𝐻 𝑖 , π‘Ÿ 𝐻 𝑖 𝑖=1 𝑛 Let 𝑄 π‘œπ‘™π‘‘ be our current 𝑄 ⋆ approximation. Find 𝑸 π’π’†π’˜ to minimize: 𝑄 𝑛𝑒𝑀 = argmin π‘„βˆˆQ 𝑖=1 𝑛 𝑑=1 𝐻 𝑄 π‘₯ 𝑑 𝑖 , π‘Ž 𝑑 𝑖 βˆ’ π‘Ÿ 𝑑 𝑖 βˆ’ max π‘Ž 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑+1 𝑖 ,π‘Ž 2

11 𝑄-learning Given a dataset of 𝐻-step trajectories:
π‘₯ 1 𝑖 , π‘Ž 1 𝑖 π‘Ÿ 1 𝑖 ,…, π‘₯ 𝐻 𝑖 , π‘Ž 𝐻 𝑖 , π‘Ÿ 𝐻 𝑖 𝑖=1 𝑛 Let 𝑄 π‘œπ‘™π‘‘ be our current 𝑄 ⋆ approximation. Find 𝑸 π’π’†π’˜ to minimize: 𝑄 𝑛𝑒𝑀 = argmin π‘„βˆˆQ 𝑖=1 𝑛 𝑑=1 𝐻 𝑄 π‘₯ 𝑑 𝑖 , π‘Ž 𝑑 𝑖 βˆ’ π‘Ÿ 𝑑 𝑖 βˆ’ max π‘Ž 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑+1 𝑖 ,π‘Ž 2 Reasonable if 𝑄 π‘œπ‘™π‘‘ and 𝑄 𝑛𝑒𝑀 are similar

12 𝑄-learning Let 𝑄 π‘œπ‘™π‘‘ be our current 𝑄 ⋆ approximation. Want 𝑄 𝑛𝑒𝑀 to satisfy: 𝑖=1 𝑛 𝑑=1 𝐻 𝑄 π‘₯ 𝑑 𝑖 , π‘Ž 𝑑 𝑖 βˆ’ π‘Ÿ 𝑑 𝑖 βˆ’ max π‘Ž 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑+1 𝑖 ,π‘Ž 2 Reasonable if 𝑄 π‘œπ‘™π‘‘ and 𝑄 𝑛𝑒𝑀 are similar. In tabular settings, iterate: 𝑄 𝑛𝑒𝑀 π‘₯ 𝑑 , π‘Ž 𝑑 = 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑 , π‘Ž 𝑑 βˆ’π›Ώ 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑 , π‘Ž 𝑑 βˆ’ π‘Ÿ 𝑑 βˆ’ max π‘Ž 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑+1 ,π‘Ž

13 𝑄-learning with function approximation
Suppose 𝑄 π‘₯,π‘Ž = 𝑀 𝑇 πœ™(π‘₯,π‘Ž) 𝑀 𝑛𝑒𝑀 = 𝑀 π‘œπ‘™π‘‘ βˆ’π›Ώ 𝑀 π‘œπ‘™π‘‘ 𝑇 πœ™ π‘₯ 𝑑 , π‘Ž 𝑑 βˆ’ π‘Ÿ 𝑑 βˆ’ max π‘Ž 𝑀 π‘œπ‘™π‘‘ 𝑇 πœ™ π‘₯ 𝑑+1 ,π‘Ž πœ™( π‘₯ 𝑑 , π‘Ž 𝑑 ) Suppose 𝑄 π‘₯,π‘Ž = 𝑄 πœƒ π‘₯,π‘Ž for some parameters πœƒ πœƒ 𝑛𝑒𝑀 = πœƒ π‘œπ‘™π‘‘ βˆ’π›Ώ πœ• πœ•πœƒ 𝑄 πœƒ π‘₯ 𝑑 , π‘Ž 𝑑 βˆ’ π‘Ÿ 𝑑 βˆ’ max π‘Ž 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑+1 ,π‘Ž 2

14 𝑄-learning with function approximation
Suppose 𝑄 π‘₯,π‘Ž = 𝑄 πœƒ π‘₯,π‘Ž for some parameters πœƒ πœƒ 𝑛𝑒𝑀 = πœƒ π‘œπ‘™π‘‘ βˆ’π›Ώ πœ• πœ•πœƒ 𝑄 πœƒ π‘₯ 𝑑 , π‘Ž 𝑑 βˆ’ π‘Ÿ 𝑑 βˆ’ max π‘Ž 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑+1 ,π‘Ž 2 (Stochastic) gradient descent on 𝑖=1 𝑛 𝑑=1 𝐻 𝑄 π‘₯ 𝑑 𝑖 , π‘Ž 𝑑 𝑖 βˆ’ π‘Ÿ 𝑑 𝑖 βˆ’ max π‘Ž 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑+1 𝑖 ,π‘Ž 2

15 Fitted 𝑄-Iteration (𝐹𝑄𝐼)
Given a dataset of 𝐻-step trajectories: π‘₯ 1 𝑖 , π‘Ž 1 𝑖 π‘Ÿ 1 𝑖 ,…, π‘₯ 𝐻 𝑖 , π‘Ž 𝐻 𝑖 , π‘Ÿ 𝐻 𝑖 𝑖=1 𝑛 Repeat until convergence: Find 𝑄 to minimize: 𝑖=1 𝑛 𝑑=1 𝐻 𝑄 π‘₯ 𝑑 𝑖 , π‘Ž 𝑑 𝑖 βˆ’ π‘Ÿ 𝑑 𝑖 βˆ’ max π‘Ž 𝑄 π‘œπ‘™π‘‘ π‘₯ 𝑑+1 𝑖 ,π‘Ž 2 Set 𝑄 π‘œπ‘™π‘‘ =𝑄 Can use any regression method in the inner loop

16 𝑄-learning properties
Generalizes across related π‘₯ similar to supervised learning Effectively solving sequence of regression problems computationally Convergence can be slow/never 𝐹𝑄𝐼 converges to 𝑄 ⋆ given good exploration and conditions on Q No prescription for exploration and data collection

17 How to explore πœ–-greedy exploration: Let 𝑄 be the current approximation for 𝑄 ⋆ . π‘Ž 𝑑 = argmax π‘Ž 𝑄( π‘₯ 𝑑 ,π‘Ž) with probability 1βˆ’πœ–, uniform at random with probability πœ– Softmax/Boltzmann exploration: 𝑃 π‘Ž 𝑑 =π‘Ž ∝ exp πœ†π‘„ π‘₯ 𝑑 , π‘Ž Both are reasonable when horizon is small

18 Ensemble exploration Structure of optimal exploration in contextual bandits Create an ensemble of candidate optimal policies Randomize amongst their chosen actions e.g.: EXP4, ILTCB, Thompson Sampling,… Adapts to arbitrary policy/value-function classes Can we extend this idea to RL?

19 Bootstrapped 𝑄-learning (Osband et al., 2016)
Given a dataset of 𝐻-step trajectories: π‘₯ 1 𝑖 , π‘Ž 1 𝑖 π‘Ÿ 1 𝑖 ,…, π‘₯ 𝐻 𝑖 , π‘Ž 𝐻 𝑖 , π‘Ÿ 𝐻 𝑖 𝑖=1 𝑛 Bootstrap resample 𝐾 datasets, each of size 𝑛 Use 𝑄-learning or 𝐹𝑄𝐼 to fit 𝑄 π‘˜ to dataset 𝐾 Build the next dataset picking 𝑄 π‘˜ randomly for each trajectory Randomize across, not within trajectories

20 Exploration in RL No formal guarantees for these methods with high-dimensional contexts With many unique contexts, no method can learn in π‘π‘œπ‘™π‘¦( 𝐴 ,𝐻, log |Q| ) samples Recent methods to do well under certain assumptions Still an active and ongoing research area

21 Two broad paradigms Value function approximation
Policy improvement/search

22 Can we directly optimize 𝑉(πœ‹,𝑀) over the parameters of πœ‹?
Policy search We know how to define the value of a policy: 𝑉 πœ‹,𝑀 = 𝐸 𝑠 1 , π‘Ž 1 , π‘Ÿ 1 ,…, 𝑠 𝐻 , π‘Ž 𝐻 , π‘Ÿ 𝐻 βˆΌπ‘€,πœ‹ π‘Ÿ 1 +…+ π‘Ÿ 𝐻 = 𝑑=1 𝐻 𝐸 π‘ βˆΌ 𝑑 𝑑 πœ‹ π‘Ž πœ‹ 𝑠,π‘Ž π‘Ÿ(𝑠,π‘Ž) Here 𝑑 𝑑 πœ‹ is the distribution over states at time step 𝑑, if we choose all actions according to a stochastic policy πœ‹, and transitions are according to 𝑀 Can we directly optimize 𝑉(πœ‹,𝑀) over the parameters of πœ‹?

23 Policy Gradient (Sutton et al., 2000)
πœ•π‘‰ πœ‹,𝑀 πœ•πœƒ = 𝑑=1 𝐻 𝐸 π‘ βˆΌ 𝑑 𝑑 πœ‹ π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ (𝑠,π‘Ž) The gradient only involves the dependence of πœ‹ on πœƒ, not that of the state distributions 𝑑 𝑑 πœ‹ If we can evaluate, optimize πœ‹ by gradient ascent over parameters No explicit reliance on small number of unique states!

24 Proof sketch πœ• 𝑉 𝑑 πœ‹ 𝑠,𝑀 πœ•πœƒ = πœ• πœ•πœƒ π‘Ž πœ‹ 𝑠,π‘Ž 𝑄 𝑑 πœ‹ (𝑠,π‘Ž,𝑀) = π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ 𝑠,π‘Ž,𝑀 + πœ‹ 𝑠,π‘Ž πœ• 𝑄 𝑑 πœ‹ 𝑠,π‘Ž,𝑀 πœ•πœƒ = π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ 𝑠,π‘Ž,𝑀 + πœ‹ 𝑠,π‘Ž πœ• (𝐸[ π‘Ÿ 𝑑 +𝑉 𝑑+1 πœ‹ 𝑠 β€² ,𝑀 ]) πœ•πœƒ = π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ 𝑠,π‘Ž,𝑀 + πœ‹ 𝑠,π‘Ž πœ• 𝐸 𝑠 β€² [𝑉 𝑑+1 πœ‹ 𝑠 β€² ,𝑀 ] πœ•πœƒ

25 Proof sketch (contd.) πœ• 𝑉 𝑑 πœ‹ 𝑠,𝑀 πœ•πœƒ = π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ 𝑠,π‘Ž,𝑀 + πœ‹ 𝑠,π‘Ž πœ• 𝐸 𝑠 β€² [𝑉 𝑑+1 πœ‹ 𝑠 β€² ,𝑀 ] πœ•πœƒ 𝐸 π‘ βˆΌ 𝑑 𝑑 πœ‹ πœ• 𝑉 𝑑 πœ‹ 𝑠,𝑀 πœ•πœƒ = π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ 𝑠,π‘Ž,𝑀 + 𝐸 π‘ βˆΌ 𝑑 𝑑+1 πœ‹ πœ• 𝑉 𝑑+1 πœ‹ 𝑠,𝑀 πœ•πœƒ Unfolding this recursion completes the proof.

26 Evaluating policy gradients
πœ•π‘‰ πœ‹,𝑀 πœ•πœƒ = 𝑑=1 𝐻 𝐸 π‘ βˆΌ 𝑑 𝑑 πœ‹ π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ (𝑠,π‘Ž) Take 𝑑 steps according to πœ‹, gives state π‘ βˆΌ 𝑑 𝑑 πœ‹

27 Evaluating policy gradients
πœ•π‘‰ πœ‹,𝑀 πœ•πœƒ = 𝑑=1 𝐻 𝐸 π‘ βˆΌ 𝑑 𝑑 πœ‹ π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ (𝑠,π‘Ž) Take 𝑑 steps according to πœ‹, gives state π‘ βˆΌ 𝑑 𝑑 πœ‹ Choose a random action π‘Ž in state 𝑠. Compute derivative of πœ‹(𝑠,π‘Ž)

28 Evaluating policy gradients
πœ•π‘‰ πœ‹,𝑀 πœ•πœƒ = 𝑑=1 𝐻 𝐸 π‘ βˆΌ 𝑑 𝑑 πœ‹ π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ (𝑠,π‘Ž) Take 𝑑 steps according to πœ‹, gives state π‘ βˆΌ 𝑑 𝑑 πœ‹ Choose a random action π‘Ž in state 𝑠. Compute derivative of πœ‹(𝑠,π‘Ž) Choose all subsequent actions according to πœ‹, compute cumulative reward from 𝑑 onwards. Gives unbiased estimate for 𝑄 𝑑 πœ‹ (𝑠,π‘Ž)

29 Evaluating policy gradients
πœ•π‘‰ πœ‹,𝑀 πœ•πœƒ = 𝑑=1 𝐻 𝐸 π‘ βˆΌ 𝑑 𝑑 πœ‹ π‘Ž πœ•πœ‹ 𝑠,π‘Ž πœ•πœƒ 𝑄 𝑑 πœ‹ (𝑠,π‘Ž) Take 𝑑 steps according to πœ‹, gives state π‘ βˆΌ 𝑑 𝑑 πœ‹ Choose a random action π‘Ž in state 𝑠. Compute derivative of πœ‹(𝑠,π‘Ž) Choose all subsequent actions according to πœ‹, compute cumulative reward from 𝑑 onwards. Gives unbiased estimate for 𝑄 𝑑 πœ‹ 𝑠,π‘Ž Repeat for each 𝑑=1,2,…,𝐻 Generic scheme for unbiased gradients of 𝑉(πœ‹,𝑀). Optimize by stochastic gradient ascent

30 Policy gradient properties
Convergence to local optimum for decaying step sizes No assumptions on number of states or actions Works with arbitrary differentiable policy classes Gradients can have high variance Doubly robust-style corrections. Actor-critic methods Convergence can be very slow Exploration determined by πœ‹, might not visit good states early on

31 Policy Improvement Policy gradient struggles from a poor initialization Suppose we have access to an expert at training time Algorithm can query the expert’s actions at any state/context Wants to find a policy at least as good as the expert Evaluated without expert’s help at test time

32 A general template Roll-in = Roll-out = πœ‹ for policy gradient
Deviations Roll-in = Roll-out = πœ‹ for policy gradient Given an expert policy πœ‡ at training time, what are better choices? Roll-out 𝑑=𝐻 𝑑 𝑠 𝑑=1 Roll-in

33 Behavior cloning Roll-in = Roll-out = πœ‡
Deviations Roll-in = Roll-out = πœ‡ Train a policy to minimize 𝐸 π‘ βˆΌ 𝑑 πœ‡ [1 πœ‹ 𝑠 β‰ πœ‡ 𝑠 ) Roll-out 𝑑=𝐻 𝑑 𝑠 𝑑=1 Roll-in

34 Behavior cloning Train a policy to minimize 𝐸 π‘ βˆΌ 𝑑 πœ‡ [1 πœ‹ 𝑠 β‰ πœ‡ 𝑠 )
Can be done with just demonstrations without access to expert Policy improvement = multiclass classification Leads to compounding errors if no πœ‹ can imitate πœ‡ well

35 Compounding errors πœ‡ takes actions in red. Rewards only in leaf nodes.
If πœ‹ makes mistake at root, no information on how to act in 𝑠 2 𝑠 1 𝑠 2 𝑠 3 𝑠 4 𝑠 5 𝑠 6 𝑠 7 1 1

36 AggreVaTe (Ross and Bagnell, 2014)
Deviations Roll-in = πœ‹, Roll-out = πœ‡ Train a policy to minimize 𝐸 π‘ βˆΌ 𝑑 πœ‹ Q πœ‡ s,πœ‹ 𝑠 Roll-out 𝑑=𝐻 𝑑 𝑠 𝑑=1 Roll-in

37 AggreVaTe (Ross and Bagnell, 2014)
Roll-in = πœ‹, Roll-out = πœ‡ Train a policy to minimize 𝐸 π‘ βˆΌ 𝑑 πœ‹ Q πœ‡ s,πœ‹ 𝑠 We have π‘ βˆΌ 𝑑 πœ‹ as we roll-in with πœ‹ No compounding errors Policy improvement = cost-sensitive classification Estimate 𝑄 πœ‡ using (multiple) roll-outs by πœ‡

38 Compounding errors revisited
πœ‡ takes actions in red. Rewards only in leaf nodes. If πœ‹ minimizes cost, indifferent at root. Roll-in with πœ‡ means no training in 𝑠 2 . Fixed by roll-in with πœ‹ 𝑠 1 𝑠 2 𝑠 3 𝑠 4 𝑠 5 𝑠 6 𝑠 7 1 1

39 AggreVaTe theory Suppose we have a good cost-sensitive classifier
Do at least 𝑂(𝐻) rounds of AggreVaTe Value of policy returned by AggreVaTe β‰ˆ Expert policy’s value assuming such a policy exists in our class Can even improve upon the expert sometimes! Further improvements in the literature

40 Policy search overview
No expert Policy gradient, TRPO, Actor-critic variants Expert dataset Behavior cloning Expert policy AggreVaTe and improvements

41 Other ways of using expert
Assuming expert is optimal, find a reward function Can be done with trajectories, called inverse RL Access to expert policy, but no reward information DAgger: like behavior cloning, but roll-in with πœ‹ to minimize 𝐸 π‘ βˆΌ 𝑑 πœ‹ [1 πœ‹ 𝑠 β‰ πœ‡ 𝑠 Use of any domain knowledge to build expert always preferred to policy search from scratch

42 Some practical issues All algorithms we saw today require exponentially many trajectories for hard problems Typical data requirements still quite large, unless strong expert Better exploration will help Long-horizon problems still data intensive

43 Partial Observability
We assumed a Markovian state or context Suppose we only have first-person view of agent Rewards and dynamics can depend on whole trajectory! Typically modeled as Partially Observable MDP (POMDP) Key difference: 𝑄-functions, 𝑉-functions and policies all depend on whole trajectory instead of just the observed state Typically much harder statistically and computationally

44 Hierarchical RL Long-horizon problems are hard
Can benefit if trajectories have repeated sub-patterns or sub-tasks First learn how to do sub-tasks well, compose to solve original problem Many formalisms: Options General value functions RL with sub-goals


Download ppt "RL methods in practice Alekh Agarwal."

Similar presentations


Ads by Google