Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Slides:



Advertisements
Similar presentations
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Advertisements

Markov Decision Process
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Decision Theoretic Planning
1. Algorithms for Inverse Reinforcement Learning 2
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Markov Decision Processes
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Discretization Pieter Abbeel UC Berkeley EECS
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
MDPs (cont) & Reinforcement Learning
Reinforcement learning (Chapter 21)
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Department of Computer Science Undergraduate Events More
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Generative Adversarial Imitation Learning
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Markov Decision Processes
Daniel Brown and Scott Niekum The University of Texas at Austin
CS 188: Artificial Intelligence
Apprenticeship Learning via Inverse Reinforcement Learning
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University

Pieter Abbeel and Andrew Y. Ng Motivation Reinforcement learning (RL) gives powerful tools for solving MDPs. It can be difficult to specify the reward function. Example: Highway driving.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning Learning from observing an expert. Previous work: –Learn to predict expert’s actions as a function of states. –Usually lacks strong performance guarantees. –(E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …) Our approach: –Based on inverse reinforcement learning (Ng & Russell, 2000). –Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function.

Pieter Abbeel and Andrew Y. Ng Preliminaries Markov Decision Process (S,A,T, ,D,R) R(s)=w T  (s),  : S  [0,1] k : k-dimensional feature vector. W.l.o.g. we assume ||w|| 2 ≤ 1. Policy  : S  A Utility of a policy  for reward R=w T  U w (  ) = E [  t  t R(s t )|  ].

Pieter Abbeel and Andrew Y. Ng Algorithm For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= w T  (s) such that under R(s) the expert performs better than all previously found policies {  i }. RL step: Compute optimal policy  t for the estimated reward w.

Pieter Abbeel and Andrew Y. Ng Algorithm: IRL step Maximize , w:||w|| 2 ≤ 1  s.t. U w (  E )  U w (  i ) +  i=1,…,t-1  = margin of expert’s performance over the performance of previously found policies. U w (  ) = E [  t  t R(s t )|  ] = E [  t  t w T  (s t )|  ] = w T E [  t  t  (s t )|  ] = w T  (  )  (  ) = E [  t  t  (s t )|  ] are the “feature expectations”

Pieter Abbeel and Andrew Y. Ng Feature Expectation Closeness and Performance If we can find a policy  such that ||  (  E ) -  (  )|| 2  , then for any underlying reward R*(s) =w* T  (s), we have that |U w* (  E ) - U w* (  )| = |w* T  (  E ) - w* T  (  )|  ||w*|| 2 ||  (  E ) -  (  )|| 2  .

Pieter Abbeel and Andrew Y. Ng Algorithm 11 (0)(0) w (1) w (2) (1)(1) (2)(2) 22 w (3) U w (  ) = w T  (  ) (E)(E)

Pieter Abbeel and Andrew Y. Ng Theoretical Results: Convergence Theorem. Let an MDP (without reward function), a k-dimensional feature vector  and the expert’s feature expectations  (  E ) be given. Then after at most k/[(1-  )  ] 2 iterations, the algorithm outputs a policy  that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w* T  (s), i.e., U w* (  )  U w* (  E ) - .

Pieter Abbeel and Andrew Y. Ng Theoretical Results: Sampling In practice, we have to use sampling to estimate the feature expectations of the expert. We still have  -optimal performance with high probability if the number of observed samples is at least O(poly(k,1/  )). Note: the bound has no dependence on the “complexity” of the policy.

Pieter Abbeel and Andrew Y. Ng Gridworld Experiments Reward function is piecewise constant over small regions. Features  for IRL are these small regions. 128x128 grid, small regions of size 16x16.

Pieter Abbeel and Andrew Y. Ng Gridworld Experiments

Pieter Abbeel and Andrew Y. Ng Gridworld Experiments

Pieter Abbeel and Andrew Y. Ng Gridworld Experiments

Pieter Abbeel and Andrew Y. Ng Gridworld Experiments

Pieter Abbeel and Andrew Y. Ng Case study: Highway driving The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided. Input: Driving demonstration Output: Learned behavior

Pieter Abbeel and Andrew Y. Ng More driving examples In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Pieter Abbeel and Andrew Y. Ng Car driving results Collision Left Shoulder Left Lane Middle Lane Right Lane Right Shoulder  (expert)  (learned) W (learned)  (expert)  (learned) W (learned)  (expert)  (learned) W (learned)

Pieter Abbeel and Andrew Y. Ng Our algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function. Algorithm is guaranteed to converge in poly(k,1/  ) iterations. Sample complexity poly(k,1/  ). The algorithm exploits reward “simplicity” (vs. policy “simplicity” in previous approaches). [Poster: dual formulation; cheaper inverse RL step without the optimization.] Conclusions

Pieter Abbeel and Andrew Y. Ng Additional slides for poster (slides to come are additional material, not included in the talk, in particular: projection (vs. QP) version of the Inverse RL step; another formulation of the apprenticeship learning problem, and its relation to our algorithm)

Pieter Abbeel and Andrew Y. Ng Simplification of Inverse RL step: QP  Euclidean projection In the Inverse RL step –set  (i-1) = orthogonal projection of  E onto line through {  (i-1),  (  (i-1) ) } –set w (i) =  E -  (i-1) Note: the theoretical results on convergence and sample complexity hold unchanged for the simpler algorithm.

Pieter Abbeel and Andrew Y. Ng Algorithm (projection version) 11 EE (0)(0) w (1) (1)(1) 22

Pieter Abbeel and Andrew Y. Ng Algorithm (projection version) 11 EE (0)(0) w (1) w (2) (1)(1) (2)(2) 22  (1)

Pieter Abbeel and Andrew Y. Ng Algorithm (projection version) 11 EE (0)(0) w (1) w (2) (1)(1) (2)(2) 22 w (3)  (1)  (2)

Pieter Abbeel and Andrew Y. Ng Appendix: Different View Bellman LP for solving MDPs Min. V c’V s.t.  s,a V(s)  R(s,a) +   s’ P(s,a,s’)V(s’) Dual LP Max.  s,a (s,a)R(s,a) s.t.  s c(s) -  a (s,a) +   s’,a P(s’,a,s) (s’,a) =0 Apprenticeship Learning as QP Min.  i (  E,i -  s,a (s,a)  i (s)) 2 s.t.  s c(s) -  a (s,a) +   s’,a P(s’,a,s) (s’,a) =0

Pieter Abbeel and Andrew Y. Ng Different View (ctd.) Our algorithm is equivalent to iteratively linearize QP at current point (Inverse RL step), solve resulting LP (RL step). Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]

Pieter Abbeel and Andrew Y. Ng Slides that are different for poster (slides to come are slightly different for poster, but already “appeared” earlier)

Pieter Abbeel and Andrew Y. Ng Algorithm (QP version) 11 (0)(0) w (1) (1)(1) 22 U w (  ) = w T  (  ) (E)(E)

Pieter Abbeel and Andrew Y. Ng Algorithm (QP version) 11 (0)(0) w (1) w (2) (1)(1) (2)(2) 22 U w (  ) = w T  (  ) (E)(E)

Pieter Abbeel and Andrew Y. Ng Algorithm (QP version) 11 (0)(0) w (1) w (2) (1)(1) (2)(2) 22 w (3) U w (  ) = w T  (  ) (E)(E)

Pieter Abbeel and Andrew Y. Ng Gridworld Experiments

Pieter Abbeel and Andrew Y. Ng Case study: Highway driving (Videos available.) Input: Driving demonstration Output: Learned behavior

Pieter Abbeel and Andrew Y. Ng More driving examples (Videos available.)

Collision Offroad Left Left Lane Middle Lane Right Lane Offroad Right 1Feature Distr. Expert Feature Distr. Learned5.00E Weights Learned Feature Distr. Expert Feature Distr. Learned Weights Learned Feature Distr. Expert Feature Distr. Learned Weights Learned Feature Distr. Expert Feature Distr. Learned Weights Learned Feature Distr. Expert Feature Distr. Learned Weights Learned Car driving results (more detail)

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University