Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University
Motivation Typical control setting –Given: system model, reward function –Return: controller optimal with respect to the given model and reward function Reward function might be hard to exactly specify E.g. driving well on a highway: need to trade-off –Distance, speed, lane preference, …
Apprenticeship Learning = task of learning from an expert/teacher Previous work: –Mostly try to directly mimic teacher by learning the mapping from states to actions directly –Lack of strong performance guarantees Our approach –Returns policy with performance as good as the expert on the expert’s unknown reward function –Reduces the problem to solving the control problem with given reward –Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000)
Preliminaries Markov Decision Process (MDP) (S,A,T, ,D,R) –S : finite set of states –A : set of actions –T = {P sa } : state transition probabilities – є [0,1) : discount factor –D : initial state distribution –R(s)=w T (s) : reward function : S [0,1] k : k-dimensional feature vector Policy : S A
Value of a Policy U( ) U w ( ) = E [ t t R(s t )| ] = E [ t t w T (s t )| ] = w T E [ t t (s t )| ] Define feature distribution ( ) – ( ) = E [ t t (s t )| ] є 1/(1- ) [0,1] k So U w ( ) = w T ( ) Optimal policy = arg max U w ( )
Feature Distribution Closeness and Performance Assume the feature distribution of the expert/teacher E is given. If we can find a policy such that || ( ) - E || 2 then we have for any underlying reward R*(s) =w* T (s) (||w|| 1 1) |U w* ( ) - U w* ( E )| = | w* T ( ) - w* T E |
Algorithm Input: MDP\R, E 1: Randomly pick a policy 0, set i=1 2: Compute t i = max t,w t such that: w T ( E - ( j )) t for j=0..i-1 3: If t i terminate 4: Compute i = arg max U w ( ) 5: Compute ( i ) 6: Set i=i+1, go to step 2 Return: set of policies { j }, and we have j such that: w* T ( j ) w* T E -
Theoretical Results: Convergence Let an MDP\R, k-dimensional feature vector be given. Then the algorithm will terminate with t i after at most O( k/[(1- ) ] 2 log (k/[(1- ) )] ) iterations.
Theoretical Results: Sampling In practice, we have to use sampling estimates for the feature distribution of the expert. We still have -optimal performance w.p. (1- ) for number of samples m 9k/(2[(1- ) ] 2 ) log 2k/
Experiments: Gridworld 128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. Expert optimal w.r.t. some weights w*
Experiments: Gridworld (ctd)
Experiments: Car Driving Illustrate how different driving styles can be learned (videos)
Conclusion Returns policy with performance as good as the expert on the expert’s unknown reward function Reduces the problem to solving the control problem with given reward Algorithm guaranteed to converge in polynomial number of iterations Sample complexity poly(k,1/(1- ) ,1/ )