Download presentation
Presentation is loading. Please wait.
Published byRidwan Susman Modified over 6 years ago
1
Apprenticeship Learning via Inverse Reinforcement Learning
Pieter Abbeel Andrew Y. Ng Stanford University
2
Motivation Typical RL setting
Given: system model, reward function Return: policy optimal with respect to the given model and reward function Reward function might be hard to exactly specify E.g. driving well on a highway: need to trade-off Distance, speed, lane preference 12/7/2018 NIPS 2003, Workshop
3
Apprenticeship Learning
= task of learning from observing an expert/teacher Previous work: Mostly try to mimic teacher by learning the mapping from states to actions directly Lack of strong performance guarantees Our approach Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function Reduces the problem to solving the control problem with given reward Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000) 12/7/2018 NIPS 2003, Workshop
4
Preliminaries Markov Decision Process (S,A,T, ,D,R) Value of a policy
R(s)=wT(s) : S [0,1]k : k-dimensional feature vector Value of a policy Uw() = E [t t R(st)|] = E [t t wT(st)|] = wT E [t t (st)|] Feature distribution () () = E [t t (st)|] є 1/(1- ) [0,1]k Uw() = wT() 12/7/2018 NIPS 2003, Workshop
5
Algorithm For t = 1,2,… Uw(E)- Uw(t) z
IRL step: Estimate expert’s reward function R(s)= wT(s) by solving following QP maxz,w z s.t. Uw(E)- Uw(t) z for j=0..t-1(linearconstraint in w) ||w||2 1 RL step: compute optimal policy t for this reward w. 12/7/2018 NIPS 2003, Workshop
6
Algorithm (2) 2 E (2) (1) W(2) W(1) (0) 1 12/7/2018
NIPS 2003, Workshop
7
Feature Distribution Closeness and Performance
If we can find a policy such that || () - E ||2 then we have for any underlying reward R*(s) =w*T(s) (with ||w||2 1) |Uw*() - Uw*(E)| = | w*T () - w*T E | 12/7/2018 NIPS 2003, Workshop
8
Theoretical Results: Convergence
Let an MDP\R, k-dimensional feature vector be given. Then after at most O(poly(k, 1/)) iterations the algorithm outputs a policy that performs nearly as well as the teacher, as evaluated on the unknown reward function R*=w*T(s): Uw*() Uw*(E) - . 12/7/2018 NIPS 2003, Workshop
9
Theoretical Results: Sampling
In practice, we have to use sampling estimates for the feature distribution of the expert. We still have -optimal performance with high probability for number of samples O(poly(k,1/)) 12/7/2018 NIPS 2003, Workshop
10
Experiments: Gridworld (ctd)
128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. Expert optimal w.r.t. some weights w* 12/7/2018 NIPS 2003, Workshop
11
Experiments: Car Driving
12/7/2018 NIPS 2003, Workshop
12
Car Driving Results Collision Offroad Left Left Lane Middle Lane
Collision Offroad Left Left Lane Middle Lane Right Lane Offroad Right 1 Feature Distr. Expert 0.1325 0.2033 0.5983 0.0658 Feature Distr. Learned 5.00E-05 0.0004 0.0904 0.2286 0.604 0.0764 Weights Learned 0.0077 0.0078 0.0318 2 0.1167 0.0633 0.4667 0.47 0.1332 0.1045 0.3196 0.5759 0.234 0.0092 0.0487 0.0576 3 0.0033 0.7058 0.2908 0.7447 0.2554 0.0929 0.0081 4 0.06 0.0569 0.2666 0.7334 0.1079 0.059 0.0564 5 0.0542 0.0094 0.8126 -0.51
13
Conclusion Our algorithm returns policy with performance as good as the expert as evaluated according to the expert’s unknown reward function Reduced the problem to solving the control problem with given reward Algorithm guaranteed to converge in poly(k,1/) iterations Sample complexity poly(k,1/) 12/7/2018 NIPS 2003, Workshop
14
Appendix: Different View
Bellman LP for solving MDPs Min. V c’V s.t. s,a V(s) R(s,a) + s’ P(s,a,s’)V(s’) Dual LP Max. s,a (s,a)R(s,a) s.t. s c(s) - a (s,a) + s’,a P(s’,a,s) (s’,a) =0 Apprenticeship Learning as QP Min. i (E,i - s,a (s,a)i(s))2 s.t. 12/7/2018 NIPS 2003, Workshop
15
Different View (ctd.) Our algorithm is equivalent to iteratively
linearize QP at current point (IRL step) solve resulting LP (RL step) Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). 12/7/2018 NIPS 2003, Workshop
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.