Download presentation
Presentation is loading. Please wait.
1
Apprenticeship Learning for the Dynamics Model Overview Challenges in reinforcement learning for complex physical systems such as helicopters: Data collection: Aggressive exploration is dangerous. Difficult to specify the proper reward function for a given task. We present apprenticeship learning algorithms which use an expert demonstration which: Do not require explicit exploration. Do not require an explicit reward function specification. Experimental results: Demonstrate effectiveness of algorithms on a highly challenging control problem. Significantly extend the state of the art in autonomous helicopter flight. In particular, first completion of autonomous stationary forward flips, stationary sideways rolls, nose-in funnels and tail-in funnels. Complex tasks: hard to specify the reward function. STANFORD An Application of Reinforcement Helicopter Pieter Abbeel, Adam Coates, Key question: How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process? State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Can we avoid explicit exploration? Have good model of dynamics? NO “Explore” YES “Exploit” Expert human pilot flight (a 1, s 1, a 2, s 2, a 3, s 3, ….) Learn P sa (a 1, s 1, a 2, s 2, a 3, s 3, ….) Autonomous flight Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning Control policy Take away message: In the apprenticeship learning setting, i.e., when we have an expert demonstration, we do not need explicit exploration to perform as well as the expert. Theorem. Assuming we have a polynomial number of teacher demonstrations, then the apprenticeship learning algorithm will return a policy that performs as well as the teacher within a polynomial number of iterations. [Abbeel & Ng, 2005 for more details.] Dynamics Model P sa Reward Function R Reinforcement Learning Control policy Apprenticeship Learning for the Reward Function Reward function can be very difficult to specify. E.g., for our helicopter control problem we have: R(s) = c 1 * (position error) 2 + c 2 * (orientation error) 2 + c 3 * (velocity error) 2 + c 4 *(angular rate error) 2 + … + c 25 * (inputs) 2. Difficult to specify the proper reward function for a given task. Can we avoid the need to specify the reward function? Our approach: [Abbeel & Ng, 2004] is based on inverse reinforcement learning [Ng & Russell, 2000]. Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function in polynomial number of iterations. Inverse RL Algorithm: For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= w T (s) such that under R(s) the expert performs better than all previously found policies { i }. RL step: Compute optimal policy t for the estimated reward w. Related work: Imitation learning: learn to predict expert’s actions as a function of states. Usually lacks strong performance guarantees. [E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …] Max margin planning, Ratliff et al., 2006. Reinforcement Learning and Apprenticeship Learning Data collection: aggressive exploration is dangerous
2
Stationary Rolls Stationary Flips STANFORD Tail-In Funnels Nose-In Funnels Learning to Autonomous Flight Morgan Quigley and Andrew Y. Ng Experimental Results Video available. Conclusion Apprenticeship learning for the dynamics model avoids explicit exploration in our experiments. Procedure based on apprenticeship learning (inverse RL) for the reward function gives performance similar to human pilots. Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels. ADDITIONAL REFERENCES (SPECIFIC TO AUTONOMOUS HELICOPTER FLIGHT) [1] J. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In International Conference on Robotics and Automation. IEEE, 2001. [2] V. Gavrilets, I. Martinos, B. Mettler, and E. Feron. Control logic for automated aerobatic flight of miniature helicopter. In AIAA Guidance, Navigation and Control Conference, 2002. [3] M. La Civita. Integrated Modeling and Robust Control for Full-Envelope Flight of Robotic Helicopters. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 2003. [4] M. La Civita, G. Papgeorgiou, W. C. Messner, and T. Kanade. Design and flight testing of a high-bandwidth H-infinity loop shaping controller for a robotic helicopter. Journal of Guidance, Control, and Dynamics, 29(2):485-494, March-April 2006. [5] B. Mettler, M. Tischler, and T. Kanade. System identification of small-size unmanned helicopter dynamics. In American Helicopter Society, 55 th Forum, 1999. [6] Jonathan M. Roberts, Peter I. Corke, and Gregg Buskey. Low-cost flight control system for a small autonomous helicopter. In IEEE Int’l Conf. On Robotics and Automation, 2003. [7] S. Saripalli, J. Montgomery, and G. Sukhatme. Visually-guided landing of an unmanned aerial vehicle, 2003. RL/Optimal Control We use differential dynamic programming. We penalize for high frequency controls. We include integrated orientation errors in the cost. (see paper for more details)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.