Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Advertisements

Markov Decision Process
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Decision Theoretic Planning
1. Algorithms for Inverse Reinforcement Learning 2
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Non-Bayes classifiers. Linear discriminants, neural networks.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Confidence Based Autonomy: Policy Learning by Demonstration Manuela M. Veloso Thanks to Sonia Chernova Computer Science Department Carnegie Mellon University.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement learning (Chapter 21)
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)
Generative Adversarial Imitation Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Daniel Brown and Scott Niekum The University of Texas at Austin
Reinforcement Learning with Partially Known World Dynamics
Apprenticeship Learning via Inverse Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University

Motivation Typical control setting –Given: system model, reward function –Return: controller optimal with respect to the given model and reward function Reward function might be hard to exactly specify E.g. driving well on a highway: need to trade-off –Distance, speed, lane preference

Apprenticeship Learning = task of learning from an expert/teacher Previous work: –Mostly try to directly mimic teacher by learning the mapping from states to actions directly –Lack of strong performance guarantees Our approach –Returns policy with performance as good as the expert on the expert’s unknown reward function –Reduces the problem to solving the control problem with given reward –Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000)

Preliminaries Markov Decision Process (MDP) (S,A,T, ,D,R) R(s)=w T  (s) : reward function, ||w|| 2  1  : S  [0,1] k : k-dimensional feature vector Policy  : S  A U w (  ) = E [  t  t R(s t )|  ] = E [  t  t w T  (s t )|  ] = w T E [  t  t  (s t )|  ] note: U w (  ) linear in w

Algorithm Iterate –IRL step: Estimate expert’s reward function R(s)= w T  (s) by solving following QP max t,w t such that U w (  E )- U w (  j )  t for j=0..i-1 (linear constraint in w) ||w|| 2  1 –RL step: compute optimal policy  i for this reward w.

Theoretical Results: Convergence Let an MDP\R, k-dimensional feature vector  be given. Then after at most O( k/[(1-  )  ] 2 log (k/[(1-  )  )] ) = O(poly(k, 1/  )) iterations the algorithm outputs a policy  that performs nearly as well as the teacher, as evaluated on the unknown reward function R*=w* T  (s): U w* (  )  U w* (  E ) - .

Theoretical Results: Sampling In practice, we have to use sampling estimates for the feature distribution of the expert. We still have  -optimal performance w.p. (1-  ) for number of samples m  9k/(2[(1-  )  ] 2 ) log 2k/  = poly(k,1/ ,1/  )

Experiments: Gridworld 128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. Expert optimal w.r.t. some weights w*

Experiments: Gridworld (ctd)

Experiments: Car Driving Illustrate how different driving styles can be learned (videos)

Conclusion Our algorithm returns policy with performance as good as the expert on the expert’s unknown reward function Reduced the problem to solving the control problem with given reward Algorithm guaranteed to converge in poly(k, 1/  ) iterations Sample complexity poly(k,1/ ,1/  )