Download presentation
Presentation is loading. Please wait.
1
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University
2
Motivation Typical control setting –Given: system model, reward function –Return: controller optimal with respect to the given model and reward function Reward function might be hard to exactly specify E.g. driving well on a highway: need to trade-off –Distance, speed, lane preference
3
Apprenticeship Learning = task of learning from an expert/teacher Previous work: –Mostly try to directly mimic teacher by learning the mapping from states to actions directly –Lack of strong performance guarantees Our approach –Returns policy with performance as good as the expert on the expert’s unknown reward function –Reduces the problem to solving the control problem with given reward –Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000)
4
Preliminaries Markov Decision Process (MDP) (S,A,T, ,D,R) R(s)=w T (s) : reward function, ||w|| 2 1 : S [0,1] k : k-dimensional feature vector Policy : S A U w ( ) = E [ t t R(s t )| ] = E [ t t w T (s t )| ] = w T E [ t t (s t )| ] note: U w ( ) linear in w
5
Algorithm Iterate –IRL step: Estimate expert’s reward function R(s)= w T (s) by solving following QP max t,w t such that U w ( E )- U w ( j ) t for j=0..i-1 (linear constraint in w) ||w|| 2 1 –RL step: compute optimal policy i for this reward w.
6
Theoretical Results: Convergence Let an MDP\R, k-dimensional feature vector be given. Then after at most O( k/[(1- ) ] 2 log (k/[(1- ) )] ) = O(poly(k, 1/ )) iterations the algorithm outputs a policy that performs nearly as well as the teacher, as evaluated on the unknown reward function R*=w* T (s): U w* ( ) U w* ( E ) - .
7
Theoretical Results: Sampling In practice, we have to use sampling estimates for the feature distribution of the expert. We still have -optimal performance w.p. (1- ) for number of samples m 9k/(2[(1- ) ] 2 ) log 2k/ = poly(k,1/ ,1/ )
8
Experiments: Gridworld 128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. Expert optimal w.r.t. some weights w*
9
Experiments: Gridworld (ctd)
10
Experiments: Car Driving Illustrate how different driving styles can be learned (videos)
11
Conclusion Our algorithm returns policy with performance as good as the expert on the expert’s unknown reward function Reduced the problem to solving the control problem with given reward Algorithm guaranteed to converge in poly(k, 1/ ) iterations Sample complexity poly(k,1/ ,1/ )
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.