Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Decision Theoretic Planning
1. Algorithms for Inverse Reinforcement Learning 2
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Markov Decision Processes
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
Reinforcement Learning
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
1 Introduction to Reinforcement Learning Freek Stulp.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
Markov Decision Process (MDP)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Markov Decision Process (MDP)
Generative Adversarial Imitation Learning
Making complex decisions
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Markov Decision Processes
Daniel Brown and Scott Niekum The University of Texas at Austin
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning with Partially Known World Dynamics
Apprenticeship Learning via Inverse Reinforcement Learning
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
CS 188: Artificial Intelligence Spring 2006
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University

Motivation Typical control setting –Given: system model, reward function –Return: controller optimal with respect to the given model and reward function Reward function might be hard to exactly specify E.g. driving well on a highway: need to trade-off –Distance, speed, lane preference, …

Apprenticeship Learning = task of learning from an expert/teacher Previous work: –Mostly try to directly mimic teacher by learning the mapping from states to actions directly –Lack of strong performance guarantees Our approach –Returns policy with performance as good as the expert on the expert’s unknown reward function –Reduces the problem to solving the control problem with given reward –Algorithm inspired by Inverse Reinforcement Learning (Ng and Russell, 2000)

Preliminaries Markov Decision Process (MDP) (S,A,T, ,D,R) –S : finite set of states –A : set of actions –T = {P sa } : state transition probabilities –  є [0,1) : discount factor –D : initial state distribution –R(s)=w T  (s) : reward function  : S  [0,1] k : k-dimensional feature vector Policy  : S  A

Value of a Policy U(  ) U w (  ) = E [  t  t R(s t )|  ] = E [  t  t w T  (s t )|  ] = w T E [  t  t  (s t )|  ] Define feature distribution  (  ) –  (  ) = E [  t  t  (s t )|  ] є 1/(1-  ) [0,1] k So U w (  ) = w T  (  ) Optimal policy  = arg max  U w (  )

Feature Distribution Closeness and Performance Assume the feature distribution of the expert/teacher  E is given. If we can find a policy  such that ||  (  ) -  E || 2   then we have for any underlying reward R*(s) =w* T  (s) (||w|| 1  1) |U w* (  ) - U w* (  E )| = | w* T  (  ) - w* T  E |  

Algorithm Input: MDP\R,  E 1: Randomly pick a policy  0, set i=1 2: Compute t i = max t,w t such that: w T (  E -  (  j ))  t for j=0..i-1 3: If t i   terminate 4: Compute  i = arg max  U w (  ) 5: Compute  (  i ) 6: Set i=i+1, go to step 2 Return: set of policies {  j }, and we have  j such that: w* T  (  j )  w* T  E - 

Theoretical Results: Convergence Let an MDP\R, k-dimensional feature vector  be given. Then the algorithm will terminate with t i   after at most O( k/[(1-  )  ] 2 log (k/[(1-  )  )] ) iterations.

Theoretical Results: Sampling In practice, we have to use sampling estimates for the feature distribution of the expert. We still have  -optimal performance w.p. (1-  ) for number of samples m  9k/(2[(1-  )  ] 2 ) log 2k/ 

Experiments: Gridworld 128x128 gridworld, 4 actions (4 compass directions), 70% success (otherwise random among other neighbouring squares) Non-overlapping regions of 16x16 cells are the features. A small number have non-zero (positive) rewards. Expert optimal w.r.t. some weights w*

Experiments: Gridworld (ctd)

Experiments: Car Driving Illustrate how different driving styles can be learned (videos)

Conclusion Returns policy with performance as good as the expert on the expert’s unknown reward function Reduces the problem to solving the control problem with given reward Algorithm guaranteed to converge in polynomial number of iterations Sample complexity poly(k,1/(1-  ) ,1/  )