Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.

Slides:



Advertisements
Similar presentations
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,
Advertisements

1. Algorithms for Inverse Reinforcement Learning 2
Literature Review: Safe Landing Zone Identification Presented by Keith Sevcik.
1 An Application of Reinforcement Learning to Aerobatic Helicopter Greg McChesney Texas Tech University Apr 08, 2009 CS5331: Autonomous.
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Learning Parameterized Maneuvers for Autonomous Helicopter Flight Jie Tang, Arjun Singh, Nimbus Goehausen, Pieter Abbeel UC Berkeley.
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.
Learning from Demonstrations Jur van den Berg. Kalman Filtering and Smoothing Dynamics and Observation model Kalman Filter: – Compute – Real-time, given.
Reinforcement learning (Chapter 21)
DARPA Mobile Autonomous Robot SoftwareMay Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial.
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,
Reinforcement learning
Fusing Machine Learning & Control Theory With Applications to Smart Buildings & ActionWebs UC Berkeley ActionWebs Meeting November 03, 2010 By Jeremy Gillula.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Inverse Reinforcement Learning Pieter Abbeel UC Berkeley EECS
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Department of Engineering, Control & Instrumentation Research Group 22 – Mar – 2006 Optimisation Based Clearance of Nonlinear Flight Control Laws Prathyush.
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Discriminative Training of Kalman Filters P. Abbeel, A. Coates, M
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Our acceleration prediction model Predict accelerations: f : learned from data. Obtain velocity, angular rates, position and orientation from numerical.
Reinforcement Learning (1)
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Better Robots 1 The Goal: More Robots Enabling Fewer Soldiers Military “robots” today lack autonomy –Currently, many soldiers control one robot –Want few.
Algorithm Implementation: Safe Landing Zone Identification Presented by Noah Kuntz.
Kalman filter and SLAM problem
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Controlling and Configuring Large UAV Teams Paul Scerri, Yang Xu, Jumpol Polvichai, Katia Sycara and Mike Lewis Carnegie Mellon University and University.
Algorithms for Robot-based Network Deployment, Repair, and Coverage Gaurav S. Sukhatme Center for Robotics and Embedded.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.
Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
1 Introduction to Reinforcement Learning Freek Stulp.
Reinforcement learning (Chapter 21)
City College of New York 1 John (Jizhong) Xiao Department of Electrical Engineering City College of New York Mobile Robot Control G3300:
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
Generative Adversarial Imitation Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement Learning in POMDPs Without Resets
Reinforcement learning (Chapter 21)
PETRA 2014 An Interactive Learning and Adaptation Framework for Socially Assistive Robotics: An Interactive Reinforcement Learning Approach Konstantinos.
Apprenticeship Learning Using Linear Programming
End-to-end Driving via Conditional Imitation Learning
Learning Preferences on Trajectories via Iterative Improvement
RL methods in practice Alekh Agarwal.
Apprenticeship Learning via Inverse Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
SENSOR BASED CONTROL OF AUTONOMOUS ROBOTS
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data collection: Aggressive exploration is dangerous.  Difficult to specify the proper reward function for a given task.  We present apprenticeship learning algorithms which use an expert demonstration which:  Do not require explicit exploration.  Do not require an explicit reward function specification.  Experimental results:  Demonstrate effectiveness of algorithms on a highly challenging control problem.  Significantly extend the state of the art in autonomous helicopter flight. In particular, first completion of autonomous stationary forward flips, stationary sideways rolls, nose-in funnels and tail-in funnels. Complex tasks: hard to specify the reward function. STANFORD An Application of Reinforcement Helicopter Pieter Abbeel, Adam Coates, Key question: How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process? State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Can we avoid explicit exploration? Have good model of dynamics? NO “Explore” YES “Exploit” Expert human pilot flight (a 1, s 1, a 2, s 2, a 3, s 3, ….) Learn P sa (a 1, s 1, a 2, s 2, a 3, s 3, ….) Autonomous flight Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  Take away message: In the apprenticeship learning setting, i.e., when we have an expert demonstration, we do not need explicit exploration to perform as well as the expert. Theorem. Assuming we have a polynomial number of teacher demonstrations, then the apprenticeship learning algorithm will return a policy that performs as well as the teacher within a polynomial number of iterations. [Abbeel & Ng, 2005 for more details.] Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  Apprenticeship Learning for the Reward Function  Reward function can be very difficult to specify. E.g., for our helicopter control problem we have:  R(s) = c 1 * (position error) 2 + c 2 * (orientation error) 2 + c 3 * (velocity error) 2 + c 4 *(angular rate error) 2 + … + c 25 * (inputs) 2.  Difficult to specify the proper reward function for a given task.  Can we avoid the need to specify the reward function?  Our approach: [Abbeel & Ng, 2004] is based on inverse reinforcement learning [Ng & Russell, 2000].  Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function in polynomial number of iterations.  Inverse RL Algorithm:  For t = 1,2,…  Inverse RL step: Estimate expert’s reward function R(s)= w T  (s) such that under R(s) the expert performs better than all previously found policies {  i }.  RL step: Compute optimal policy  t for the estimated reward w.  Related work:  Imitation learning: learn to predict expert’s actions as a function of states. Usually lacks strong performance guarantees. [E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …]  Max margin planning, Ratliff et al., Reinforcement Learning and Apprenticeship Learning Data collection: aggressive exploration is dangerous

Stationary Rolls Stationary Flips STANFORD Tail-In Funnels Nose-In Funnels Learning to Autonomous Flight Morgan Quigley and Andrew Y. Ng Experimental Results Video available. Conclusion  Apprenticeship learning for the dynamics model avoids explicit exploration in our experiments.  Procedure based on apprenticeship learning (inverse RL) for the reward function gives performance similar to human pilots.  Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels. ADDITIONAL REFERENCES (SPECIFIC TO AUTONOMOUS HELICOPTER FLIGHT) [1] J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In International Conference on Robotics and Automation. IEEE, [2] V. Gavrilets, I. Martinos, B. Mettler, and E. Feron. Control logic for automated aerobatic flight of miniature helicopter. In AIAA Guidance, Navigation and Control Conference, [3] M. La Civita. Integrated Modeling and Robust Control for Full-Envelope Flight of Robotic Helicopters. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, [4] M. La Civita, G. Papgeorgiou, W. C. Messner, and T. Kanade. Design and flight testing of a high-bandwidth H-infinity loop shaping controller for a robotic helicopter. Journal of Guidance, Control, and Dynamics, 29(2): , March-April [5] B. Mettler, M. Tischler, and T. Kanade. System identification of small-size unmanned helicopter dynamics. In American Helicopter Society, 55 th Forum, [6] Jonathan M. Roberts, Peter I. Corke, and Gregg Buskey. Low-cost flight control system for a small autonomous helicopter. In IEEE Int’l Conf. On Robotics and Automation, [7] S. Saripalli, J. Montgomery, and G. Sukhatme. Visually-guided landing of an unmanned aerial vehicle, RL/Optimal Control  We use differential dynamic programming.  We penalize for high frequency controls.  We include integrated orientation errors in the cost.  (see paper for more details)