Apprenticeship Learning via Inverse Reinforcement Learning

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
1. Algorithms for Inverse Reinforcement Learning 2
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Markov Decision Processes
Infinite Horizon Problems
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Markov Decision Process (MDP)
M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)
Generative Adversarial Imitation Learning
Making complex decisions
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Markov Decision Processes
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Autonomous Cyber-Physical Systems: Reinforcement Learning for Planning
Daniel Brown and Scott Niekum The University of Texas at Austin
"Playing Atari with deep reinforcement learning."
CS 188: Artificial Intelligence
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
یادگیری تقویتی Reinforcement Learning
Instructor: Vincent Conitzer
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University

Motivation Typical RL setting Given: system model, reward function Return: policy optimal with respect to the given model and reward function Reward function might be hard to exactly specify E.g. driving well on a highway: need to trade-off Distance, speed, lane preference 12/7/2018 NIPS 2003, Workshop

Apprenticeship Learning = task of learning from observing an expert/teacher Previous work: Mostly try to mimic teacher by learning the mapping from states to actions directly Lack of strong performance guarantees Our approach Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function Reduces the problem to solving the control problem with given reward Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000) 12/7/2018 NIPS 2003, Workshop

Preliminaries Markov Decision Process (S,A,T, ,D,R) Value of a policy R(s)=wT(s)  : S  [0,1]k : k-dimensional feature vector Value of a policy Uw() = E [t t R(st)|] = E [t t wT(st)|] = wT E [t t (st)|] Feature distribution () () = E [t t (st)|] є 1/(1- ) [0,1]k Uw() = wT() 12/7/2018 NIPS 2003, Workshop

Algorithm For t = 1,2,… Uw(E)- Uw(t)  z IRL step: Estimate expert’s reward function R(s)= wT(s) by solving following QP maxz,w z s.t. Uw(E)- Uw(t)  z for j=0..t-1(linearconstraint in w) ||w||2  1 RL step: compute optimal policy t for this reward w. 12/7/2018 NIPS 2003, Workshop

Algorithm (2) 2 E (2) (1) W(2) W(1) (0) 1 12/7/2018 NIPS 2003, Workshop

Feature Distribution Closeness and Performance If we can find a policy  such that || () - E ||2   then we have for any underlying reward R*(s) =w*T(s) (with ||w||2  1) |Uw*() - Uw*(E)| = | w*T () - w*T E |   12/7/2018 NIPS 2003, Workshop

Theoretical Results: Convergence Let an MDP\R, k-dimensional feature vector  be given. Then after at most O(poly(k, 1/)) iterations the algorithm outputs a policy  that performs nearly as well as the teacher, as evaluated on the unknown reward function R*=w*T(s): Uw*()  Uw*(E) - . 12/7/2018 NIPS 2003, Workshop

Theoretical Results: Sampling In practice, we have to use sampling estimates for the feature distribution of the expert. We still have -optimal performance with high probability for number of samples O(poly(k,1/)) 12/7/2018 NIPS 2003, Workshop

Experiments: Gridworld (ctd) 128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. Expert optimal w.r.t. some weights w* 12/7/2018 NIPS 2003, Workshop

Experiments: Car Driving 12/7/2018 NIPS 2003, Workshop

Car Driving Results Collision Offroad Left Left Lane Middle Lane   Collision Offroad Left Left Lane Middle Lane Right Lane Offroad Right 1 Feature Distr. Expert 0.1325 0.2033 0.5983 0.0658 Feature Distr. Learned 5.00E-05 0.0004 0.0904 0.2286 0.604 0.0764 Weights Learned -0.0767 -0.0439 0.0077 0.0078 0.0318 -0.0035 2 0.1167 0.0633 0.4667 0.47 0.1332 0.1045 0.3196 0.5759 0.234 -0.1098 0.0092 0.0487 0.0576 -0.0056 3 0.0033 0.7058 0.2908 0.7447 0.2554 -0.1056 -0.0051 -0.0573 -0.0386 0.0929 0.0081 4 0.06 0.0569 0.2666 0.7334 0.1079 -0.0001 -0.0487 -0.0666 0.059 0.0564 5 0.0542 0.0094 -0.0108 -0.2765 0.8126 -0.51 -0.0153

Conclusion Our algorithm returns policy with performance as good as the expert as evaluated according to the expert’s unknown reward function Reduced the problem to solving the control problem with given reward Algorithm guaranteed to converge in poly(k,1/) iterations Sample complexity poly(k,1/) 12/7/2018 NIPS 2003, Workshop

Appendix: Different View Bellman LP for solving MDPs Min. V c’V s.t.  s,a V(s)  R(s,a) +  s’ P(s,a,s’)V(s’) Dual LP Max.  s,a (s,a)R(s,a) s.t. s c(s) - a (s,a) +  s’,a P(s’,a,s) (s’,a) =0 Apprenticeship Learning as QP Min.  i (E,i - s,a (s,a)i(s))2 s.t. 12/7/2018 NIPS 2003, Workshop

Different View (ctd.) Our algorithm is equivalent to iteratively linearize QP at current point (IRL step) solve resulting LP (RL step) Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). 12/7/2018 NIPS 2003, Workshop