An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.

Slides:



Advertisements
Similar presentations
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,
Advertisements

Reinforcement Learning
1. Algorithms for Inverse Reinforcement Learning 2
1 An Application of Reinforcement Learning to Aerobatic Helicopter Greg McChesney Texas Tech University Apr 08, 2009 CS5331: Autonomous.
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Learning Parameterized Maneuvers for Autonomous Helicopter Flight Jie Tang, Arjun Singh, Nimbus Goehausen, Pieter Abbeel UC Berkeley.
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.
Learning from Demonstrations Jur van den Berg. Kalman Filtering and Smoothing Dynamics and Observation model Kalman Filter: – Compute – Real-time, given.
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,
Reinforcement learning
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Reinforcement Learning Tutorial
Inverse Reinforcement Learning Pieter Abbeel UC Berkeley EECS
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.
Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discriminative Training of Kalman Filters P. Abbeel, A. Coates, M
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Our acceleration prediction model Predict accelerations: f : learned from data. Obtain velocity, angular rates, position and orientation from numerical.
Reinforcement Learning (1)
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.
Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
1 Introduction to Reinforcement Learning Freek Stulp.
Reinforcement learning (Chapter 21)
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Generative Adversarial Imitation Learning
Reinforcement learning (Chapter 21)
István Szita & András Lőrincz
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Daniel Brown and Scott Niekum The University of Texas at Austin
Policy Gradient in Continuous Time
Learning Preferences on Trajectories via Iterative Improvement
Reinforcement Learning with Partially Known World Dynamics
Apprenticeship Learning via Inverse Reinforcement Learning
یادگیری تقویتی Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Overview Autonomous helicopter flight is widely accepted to be a highly challenging control/reinforcement learning (RL) problem. Human expert pilots significantly outperform autonomous helicopters. Apprenticeship learning algorithms use expert demonstrations to obtain good controllers. Our experimental results significantly extend the state of the art in autonomous helicopter aerobatics.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Apprenticeship learning: uses an expert demonstration to help select the model and the reward function. Apprenticeship learning and RL Reward Function R Reinforcement Learning Control policy  Unknown dynamics: flight data is required to obtain an accurate model. Hard to specify the reward function for complex tasks such as helicopter aerobatics. Dynamics Model P sa

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Learning the dynamical model State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Have good model of dynamics? NO “Explore” YES “Exploit”

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Learning the dynamical model State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Have good model of dynamics? NO “Explore” YES “Exploit” Exploration policies are impractical: they do not even try to perform well. Can we avoid explicit exploration and just exploit?

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Aggressive manual exploration

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Apprenticeship learning of the model Expert human pilot flight (a 1, s 1, a 2, s 2, a 3, s 3, ….) Learn P sa (a 1, s 1, a 2, s 2, a 3, s 3, ….) Autonomous flight Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  [Abbeel & Ng, 2005] Theorem. The described procedure will return policy as good as the expert’s policy in a polynomial number of iterations.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Learning the dynamics model Details of algorithm for learning dynamics model: Gravity subtraction [Abbeel, Ganapathi & Ng, 2005] Lagged criterion [Abbeel & Ng, 2004]

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Autonomous nose-in funnel

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Autonomous tail-in funnel

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Apprenticeship learning: reward Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa Hard to specify the reward function for complex tasks such as helicopter aerobatics.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Example task: flip Ideal flip: rotate 360 degrees around horizontal axis going right to left through the helicopter g g g g g gg g g T T T T T T T T

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Example task: flip (2) Specify flip task as: Idealized trajectory Reward function that penalizes for deviation. Reward function +

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Example of a bad reward function

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Apprenticeship learning for the reward function Our approach: Observe expert’s demonstration of task. Infer reward function from demonstration. [see also Ng & Russell, 2000] Algorithm: Iterate for t = 1, 2, … Inverse RL step: Estimate expert’s reward function R ( s )= w T  ( s ) such that under R(s) the expert outperforms all previously found policies {  i }. RL step: Compute optimal policy  t for the estimated reward function.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Theoretical Results: Convergence Theorem. After a number of iterations polynomial in the number of features and the horizon, the algorithm outputs a policy  that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w* T  (s). [Abbeel & Ng, 2004]

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Overview Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Optimal control algorithm Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989] An efficient algorithm to (locally) optimize a policy for continuous state/action spaces.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng DDP design choices and lessons learned Simplest reward function: penalize for deviation from the target states for each time. Penalize for high frequency control inputs significantly improves the controllers. To allow aggressive maneuvering, we use a two-step procedure: Make a plan off-line. Penalize for high frequency deviations from the planned inputs. Penalize for integrated orientation error. [See paper for details.] Process noise has little influence on controllers’ performance. Observation noise and delay in observations greatly affect the controllers’ performance. Insufficient: resulting controllers perform very poorly.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Autonomous stationary flips

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Autonomous stationary rolls

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Related work Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Conclusion Apprenticeship learning for the dynamics model avoids explicit exploration in our experiments. Procedure based on inverse RL for the reward function gives performance similar to human pilots. Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Acknowledgments Ben Tse, Garett Oku, Antonio Genova. Mark Woodward, Tim Worley.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Continuous flips

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Funnel results First high-speed autonomous funnels. Speed: 5m/s. Nominal pitch angle: 30 degrees. 30 o