Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.

Slides:



Advertisements
Similar presentations
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Advertisements

Yasuhiro Fujiwara (NTT Cyber Space Labs)
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,
1. Algorithms for Inverse Reinforcement Learning 2
Learning Parameterized Maneuvers for Autonomous Helicopter Flight Jie Tang, Arjun Singh, Nimbus Goehausen, Pieter Abbeel UC Berkeley.
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.
Learning from Demonstrations Jur van den Berg. Kalman Filtering and Smoothing Dynamics and Observation model Kalman Filter: – Compute – Real-time, given.
Beam Sensor Models Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.
Nonlinear Optimization for Optimal Control
Bayes Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the.
Visual Recognition Tutorial
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,
An Optimization Approach to Improving Collections of Shape Maps Andy Nguyen, Mirela Ben-Chen, Katarzyna Welnicka, Yinyu Ye, Leonidas Guibas Computer Science.
CS 188: Artificial Intelligence Fall 2009 Lecture 20: Particle Filtering 11/5/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
Fusing Machine Learning & Control Theory With Applications to Smart Buildings & ActionWebs UC Berkeley ActionWebs Meeting November 03, 2010 By Jeremy Gillula.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Inverse Reinforcement Learning Pieter Abbeel UC Berkeley EECS
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
Probability: Review TexPoint fonts used in EMF.
Discriminative Training of Kalman Filters P. Abbeel, A. Coates, M
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Discretization Pieter Abbeel UC Berkeley EECS
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Maximum Likelihood (ML), Expectation Maximization (EM)
High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Our acceleration prediction model Predict accelerations: f : learned from data. Obtain velocity, angular rates, position and orientation from numerical.
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.
Particle Filters++ TexPoint fonts used in EMF.
Sensys 2009 Speaker:Lawrence.  Introduction  Overview & Challenges  Algorithm  Travel Time Estimation  Evaluation  Conclusion.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
 1  Outline  stages and topics in simulation  generation of random variates.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Graphical models for part of speech tagging
EM and expected complete log-likelihood Mixture of Experts
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Multifactor GPs Suppose now we wish to model different mappings for different styles. We will add a latent style vector s along with x, and define the.
Low Level Control. Control System Components The main components of a control system are The plant, or the process that is being controlled The controller,
Motor Control. Beyond babbling Three problems with motor babbling: –Random exploration is slow –Error-based learning algorithms are faster but error signals.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
CS Statistical Machine learning Lecture 24
1 Multiagent Teamwork: Analyzing the Optimality and Complexity of Key Theories and Models David V. Pynadath and Milind Tambe Information Sciences Institute.
Reinforcement learning (Chapter 21)
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Generative Adversarial Imitation Learning
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Reinforcement Learning with Partially Known World Dynamics
Designing Neural Network Architectures Using Reinforcement Learning
Presentation transcript:

Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Motivating example How do we specify a task like this? ? ?

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Introduction Reward Function Reinforcement Learning Dynamics Model Data Trajectory + Penalty Function Policy We want a robot to follow a desired trajectory.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Key difficulties Often very difficult to specify trajectory by hand. Difficult to articulate exactly how a task is performed. The trajectory should obey the system dynamics. Use an expert demonstration as trajectory. But, getting perfect demonstrations is hard. Use multiple suboptimal demonstrations.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Outline Generative model for multiple suboptimal demonstrations. Learning algorithm that extracts: Intended trajectory High-accuracy dynamics model Experimental results: Enabled us to fly autonomous helicopter aerobatics well beyond the capabilities of any other autonomous helicopter.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Expert demonstrations: Airshow

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Graphical model  Intended trajectory satisfies dynamics.  Expert trajectory is a noisy observation of one of the hidden states.  But we don’t know exactly which one. Intended trajectory Expert demonstrations Time indices

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Learning algorithm Similar models appear in speech processing, genetic sequence alignment. See, e.g., Listgarten et. al., 2005 Maximize likelihood of the demonstration data over: Intended trajectory states Time index values Variance parameters for noise terms Time index distribution parameters

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Learning algorithm Make an initial guess for ¿. Alternate between: Fix ¿. Run EM on resulting HMM. Choose new ¿ using dynamic programming. If ¿ is unknown, inference is hard. If ¿ is known, we have a standard HMM.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Details: Incorporating prior knowledge Might have some limited knowledge about how the trajectory should look. Flips and rolls should stay in place. Vertical loops should lie in a vertical plane. Pilot tends to “drift” away from intended trajectory.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Results: Time-aligned demonstrations  White helicopter is inferred “intended” trajectory.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Results: Loops  Even without prior knowledge, the inferred trajectory is much closer to an ideal loop.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Recap Reward Function Reinforcement Learning Dynamics Model Data Trajectory + Penalty Function Policy

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Standard modeling approach Collect data Pilot attempts to cover all flight regimes. Build global model of dynamics 3G error!

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Errors aligned over time  Errors observed in the “crude” model are clearly consistent after aligning demonstrations.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Model improvement Key observation: If we fly the same trajectory repeatedly, errors are consistent over time once we align the data. There are many hidden variables that we can’t expect to model accurately. Air (!), rotor speed, actuator delays, etc. If we fly the same trajectory repeatedly, the hidden variables tend to be the same each time.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Trajectory-specific local models Learn locally-weighted model from aligned demonstration data. Since data is aligned in time, we can weight by time to exploit repeatability of hidden variables. For model at time t: W(t’) = exp(- (t – t’) 2 /  2 ) Suggests an algorithm alternating between: Learn trajectory from demonstration. Build new models from aligned data. Can actually infer an improved model jointly during trajectory learning.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Experiment setup Expert demonstrates an aerobatic sequence several times. Inference algorithm extracts the intended trajectory, and local models used for control. We use a receding-horizon DDP controller. Generates a sequence of closed-loop feedback controllers given a trajectory + quadratic penalty.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Related work Bagnell & Schneider, 2001; LaCivita, Papageorgiou, Messner & Kanade, 2002; Ng, Kim, Jordan & Sastry 2004a (2001); Roberts, Corke & Buskey, 2003; Saripalli, Montgomery & Sukhatme, 2003; Shim, Chung, Kim & Sastry, 2003; Doherty et al., Gavrilets, Martinos, Mettler and Feron, 2002; Ng et al., 2004b. Abbeel, Coates, Quigley and Ng, Maneuvers presented here are significantly more challenging and more diverse than those performed by any other autonomous helicopter.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Results: Autonomous airshow

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Results: Flight accuracy

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Conclusion Algorithm leverages multiple expert demonstrations to: Infer intended trajectory Learn better models along the trajectory for control. First autonomous helicopter to perform extreme aerobatics at the level of an expert human pilot.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Discussion

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Challenges The expert often takes suboptimal paths. E.g., Loops:

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Challenges The timing of each demonstration is different.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Learning algorithm Step 1: Find the time indices, and the distributional parameters We use EM, and a dynamic programming algorithm to optimize over the different parameters in alternation. Step 2: Find the most likely intended trajectory

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Example: prior knowledge Incorporating prior knowledge allows us to improve trajectory.

Adam Coates, Pieter Abbeel, Andrew Y. Ngheli.stanford.edu Results: Time alignment  Time-alignment removes variations in the expert’s timing.