Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Advertisements

Markov Decision Process
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
Reinforcement Learning
1. Algorithms for Inverse Reinforcement Learning 2
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.
Markov Decision Processes
Planning under Uncertainty
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,
Reinforcement learning
Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Our acceleration prediction model Predict accelerations: f : learned from data. Obtain velocity, angular rates, position and orientation from numerical.
Reinforcement Learning (1)
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text:
Reinforcement Learning Yishay Mansour Tel-Aviv University.
MDPs (cont) & Reinforcement Learning
Reinforcement learning (Chapter 21)
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
István Szita & András Lőrincz
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Markov Decision Processes
Apprenticeship Learning via Inverse Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University

Pieter Abbeel and Andrew Y. Ng Overview Reinforcement learning in systems with unknown dynamics. Algorithms such as E 3 (Kearns and Singh, 2002) learn the dynamics by using exploration policies. Aggressive exploration is dangerous for many systems. We show that in apprenticeship learning, when we have a teacher demonstration of the task, this explicit exploration step is unnecessary and instead we can just use exploitation policies.

Pieter Abbeel and Andrew Y. Ng Markov Decision Process (MDP), (S, A, P sa, H, s 0, R). Policy  : S ! A. Utility of a policy  U(  ) = E [  R(s t ) |  ]. Goal: find policy  that maximizes U(  ). Reinforcement learning formalism H t=0

Pieter Abbeel and Andrew Y. Ng Accurate dynamics model P sa Motivating example Textbook model Specification Accurate dynamics model P sa Collect flight data. Textbook model Specification Learn model from data. How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?

Pieter Abbeel and Andrew Y. Ng Learning the dynamical model State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Have good model of dynamics? NO “Explore” YES “Exploit”

Pieter Abbeel and Andrew Y. Ng Aggressive manual exploration

Pieter Abbeel and Andrew Y. Ng Learning the dynamical model State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Have good model of dynamics? NO “Explore” YES “Exploit” Exploration policies are impractical: they do not even try to perform well. Can we avoid explicit exploration and just exploit?

Pieter Abbeel and Andrew Y. Ng Apprenticeship learning of the model Expert human pilot flight (a 1, s 1, a 2, s 2, a 3, s 3, ….) Learn P sa Dynamics model P sa Reinforcement learning max E[R(s 0 )+…+R(s H )] Control policy  (a 1, s 1, a 2, s 2, a 3, s 3, ….) Autonomous flight Learn P sa Duration? Performance? Number of iterations?

Pieter Abbeel and Andrew Y. Ng Typical scenario Initially: all state-action pairs are inaccurately modeled. Accurately modeled state-action pair. Inaccurately modeled state-action pair.

Pieter Abbeel and Andrew Y. Ng Typical scenario (2) Teacher demonstration. Not frequentl y visited by teacher’s policy. Frequentl y visited by teacher’s policy. Accurately modeled state-action pair. Inaccurately modeled state-action pair.

Pieter Abbeel and Andrew Y. Ng Typical scenario (3) Accurately modeled state-action pair. Inaccurately modeled state-action pair. First exploitation policy. Not frequentl y visited by teacher’s policy. Frequentl y visited by teacher’s policy. Frequently visited by first exploitation policy.

Pieter Abbeel and Andrew Y. Ng Typical scenario (4) Accurately modeled state-action pair. Inaccurately modeled state-action pair. Second exploitation policy. Not frequentl y visited by teacher’s policy. Frequentl y visited by teacher’s policy. Frequently visited by second exploitation policy.

Pieter Abbeel and Andrew Y. Ng Typical scenario (5) Accurately modeled state-action pair. Inaccurately modeled state-action pair. Third exploitation policy. Not frequentl y visited by teacher’s policy. Frequentl y visited by teacher’s policy. Frequently visited by third exploitation policy.  Model accurate for exploitation policy.  Model accurate for teacher’s policy.  Exploitation policy better than teacher in model. Also better than teacher in real world. Done.

Pieter Abbeel and Andrew Y. Ng Two dynamics models Discrete dynamics: Finite S and A. Dynamics P sa are described by state transition probabilities P(s’|s,a). Learn dynamics from data using maximum likelihood. Continuous, linear dynamics: Continuous valued states and actions. (S = < n S, A = < n A ). s t+1 = G  (s t ) + H a t + w t. Estimate G, H from data using linear regression.

Pieter Abbeel and Andrew Y. Ng Performance guarantees Let any ,  > 0 be given. Theorem. F or U(  ) ¸ U(  T ) -  within N=O(poly(1/ ,1/ ,H,R max,  )) iterations with probability 1- , it suffices: N teacher =  (poly(1/ ,1/ ,H,R max,  )), N exploit =  (poly(1/ ,1/ ,H,R max,  )). To perform as well as teacher, it suffices: a poly number of iterations a poly number of teacher demonstrations a poly number of trials with each exploitation policy.  = |S|,|A| (discrete),  = n S,n A,||G|| Fro,||H|| Fro (continuous). Take-home message: so long as a demonstration is available, it is not necessary to explicitly explore; it suffices to only exploit.

Pieter Abbeel and Andrew Y. Ng From initial pilot demonstrations, our model/simulator P sa will be accurate for the part of the state space (s,a) visited by the pilot. Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s policy  T. Consequently, there is at least one policy (namely  T ) that looks capable of flying the helicopter well in our simulation. Thus, each time we solve the MDP using the current model/simulator P sa, we will find a policy that successfully flies the helicopter according to P sa. If, on the actual helicopter, this policy fails to fly the helicopter---despite the model P sa predicting that it should- --then it must be visiting parts of the state space that are inaccurately modeled. Hence, we get useful training data to improve the model. This can happen only a small number of times. Proof idea

Pieter Abbeel and Andrew Y. Ng IID = independent and identically distributed. Our algorithm All future states depend on current state. Exploitation policies depend on states visited. States visited depend on past exploitation policies. Exploitation policies depend on past exploitation policies. Very complicated non-IID sample generating process. Standard learning theory/convergence bounds (e.g., Hoeffding inequalities) cannot be used in our setting. Martingales, Azuma’s inequality, optional stopping theorem. Learning with non-IID samples

Pieter Abbeel and Andrew Y. Ng Related Work Schaal & Atkeson, 1994: open-loop policy as starting point for devil-sticking, slow exploration of state space. Smart & Kaelbling, 2000: model-free Q- learning, initial updates based on teacher. Supervised learning of a policy from demonstration, e.g., Sammut et al. (1992); Pomerleau (1989); Kuniyhoshi et al. (1994); Amit & Mataric (2002),… Apprenticeship learning for unknown reward function (Abbeel & Ng, 2004).

Pieter Abbeel and Andrew Y. Ng Conclusion Reinforcement learning in systems with unknown dynamics. Algorithms such as E 3 (Kearns and Singh, 2002) learn the dynamics by using exploration policies, which are dangerous/impractical for many systems. We show that this explicit exploration step is unnecessary in apprenticeship learning, when we have an initial teacher demonstration of the task. We attain near-optimal performance (compared to the teacher) simply by repeatedly executing “exploitation policies'' that try to maximize rewards. In finite-state MDPs, our algorithm scales polynomially in the number of states; in continuous-state linearly parameterized dynamical systems, it scales polynomially in the dimension of the state space.

Pieter Abbeel and Andrew Y. Ng End of talk, additional slides for poster after this

Pieter Abbeel and Andrew Y. Ng Dynamics model: s t+1 = G  (s t ) + H a t + w t Parameter estimates after k samples: (G (k),H (k) )= arg min G,H loss (k) (G,H) = arg min G,H  ( s t+1 – (G  (s t ) + H a t )) 2 Consider: Z (k) = loss (k) (G,H) – E[loss (k) (G,H)] Then: E[Z (k) | history up to time k-1] = Z (k-1) Thus: Z (0), Z (1), … is a martingale sequence. Using Azuma’s inequality (a standard martingale result) we prove convergence. Samples from teacher t=0 k

Pieter Abbeel and Andrew Y. Ng Consider: Z (k) = exp(loss (k) (G *,H * ) – loss (k) (G,H)) Then: E[Z (k) | history up to time k-1] = Z (k-1) Thus: Z (0), Z (1), … is a martingale sequence. Using the optional stopping theorem (a standard martingale result) we prove true parameters G *,H * outperform G, H with high probability for all k=0,1, … Samples from exploitation policies