Presentation is loading. Please wait.

Presentation is loading. Please wait.

Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.

Similar presentations


Presentation on theme: "Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley."— Presentation transcript:

1 Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

2 Pieter Abbeel Motivation for apprenticeship learning

3 Pieter Abbeel Preliminary: reinforcement learning. Apprenticeship learning algorithms. Experimental results on various robotic platforms. Outline

4 Pieter Abbeel Reinforcement learning (RL) System Dynamics P sa state s 0 s1s1 System dynamics P sa … System Dynamics P sa s T-1 sTsT s2s2 a0a0 a1a1 a T-1 reward R(s 0 ) R(s 2 )R(s T-1 )R(s 1 )R(s T )+++…++ Example reward function: R(s) = - || s – s * || Goal: Pick actions over time so as to maximize the expected score: E[ R ( s 0 ) + R ( s 1 ) + … + R ( s T )] Solution: policy  which specifies an action for each possible state for all times t = 0, 1, …, T.

5 Pieter Abbeel Model-based reinforcement learning Run RL algorithm in simulator. Control policy 

6 Pieter Abbeel Apprenticeship learning algorithms use a demonstration to help us find a good dynamics model, a good reward function, a good control policy. Reinforcement learning (RL) Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa

7 Pieter Abbeel Apprenticeship learning for the dynamics model Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa

8 Pieter Abbeel Accurate dynamics model P sa Motivating example Textbook model Specification Accurate dynamics model P sa Collect flight data. Textbook model Specification Learn model from data. How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?

9 Pieter Abbeel Learning the dynamical model State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Have good model of dynamics? NO “Explore” YES “Exploit”

10 Pieter Abbeel Learning the dynamical model State-of-the-art: E 3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Have good model of dynamics? NO “Explore” YES “Exploit” Exploration policies are impractical: they do not even try to perform well. Can we avoid explicit exploration and just exploit?

11 Pieter Abbeel [ICML 2005] Apprenticeship learning of the model Teacher: human pilot flight (a 1, s 1, a 2, s 2, a 3, s 3, ….) Learn P sa (a 1, s 1, a 2, s 2, a 3, s 3, ….) Autonomous flight Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  No explicit exploration, always try to fly as well as possible.

12 Pieter Abbeel Assuming a polynomial number of teacher demonstrations, then after a polynomial number of trials, with probability 1-  E [ sum of rewards | policy returned by algorithm ] ¸ E [ sum of rewards | teacher’s policy] - . Here, polynomial is with respect to 1/ , 1/ , the horizon T, the maximum reward R, the size of the state space. Theorem.

13 Pieter Abbeel Learning the dynamics model Details of algorithm for learning dynamics model: Exploiting structure from physics Lagged learning criterion [NIPS 2005, 2006]

14 Pieter Abbeel Helicopter flight results First high-speed autonomous funnels. Speed: 5m/s. Nominal pitch angle: 30 degrees. 30 o

15 Pieter Abbeel Autonomous nose-in funnel

16 Pieter Abbeel Accuracy

17 Pieter Abbeel Autonomous tail-in funnel

18 Pieter Abbeel Key points Unlike exploration methods, our algorithm concentrates on the task of interest. Bootstrapping off an initial teacher demonstration is sufficient to perform the task as well as the teacher.

19

20 Pieter Abbeel Apprenticeship learning: reward Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa

21 Pieter Abbeel Example task: driving

22 Pieter Abbeel Related work Previous work: Learn to predict teacher’s actions as a function of states. E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; … Assumes “policy simplicity.” Our approach: Assumes “reward simplicity” and is based on inverse reinforcement learning (Ng & Russell, 2000). Similar work since: Ratliff et al., 2006, 2007.

23 Pieter Abbeel Find R s.t. R is consistent with the teacher’s policy  * being optimal. Find R s.t.: Find w : Linear constraints in w, quadratic objective  QP. Very large number of constraints. Inverse reinforcement learning

24 Pieter Abbeel Algorithm For i = 1, 2, … Inverse RL step: RL step: (= constraint generation) Compute optimal policy  i for the estimated reward R w.

25 Pieter Abbeel Theorem. After at most nT 2 /  2 iterations our algorithm returns a policy  that performs as well as the teacher according to the teacher’s unknown reward function, i.e., Note: Our algorithm does not necessarily recover the teacher’s reward function R * --- which is impossible to recover. Theoretical guarantees [ICML 2004]

26 Pieter Abbeel Performance guarantee intuition Intuition by example: Let If the returned policy  satisfies Then no matter what the values of and are, the policy  performs as well as the teacher’s policy  *.

27 Pieter Abbeel Case study: Highway driving Input: Driving demonstration Output: Learned behavior The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

28 Pieter Abbeel More driving examples In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration. Driving demonstration Learned behavior

29 Pieter Abbeel Helicopter Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989] 25 features [NIPS 2007]

30 Pieter Abbeel Autonomous aerobatics [Show helicopter movie in Media Player.]

31 Pieter Abbeel Quadruped

32 Pieter Abbeel Reward function trades off: Height differential of terrain. Gradient of terrain around each foot. Height differential between feet. … (25 features total for our setup) Quadruped

33 Pieter Abbeel Teacher demonstration for quadruped Full teacher demonstration = sequence of footsteps. Much simpler to “teach hierarchically”: Specify a body path. Specify best footstep in a small area.

34 Pieter Abbeel Hierarchical inverse RL Quadratic programming problem (QP): quadratic objective, linear constraints. Constraint generation for path constraints.

35 Pieter Abbeel Training: Have quadruped walk straight across a fairly simple board with fixed-spaced foot placements. Around each foot placement: label the best foot placement. (about 20 labels) Label the best body-path for the training board. Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels. Test on hold-out terrains: Plan a path across the test-board. Experimental setup

36 Pieter Abbeel Quadruped on test-board [Show movie in Media Player.]

37

38 Pieter Abbeel Apprenticeship learning: RL algorithm Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa (Sloppy) demonstration (Crude) model Small number of real-life trials

39 Pieter Abbeel Experiments Two Systems: RC carFixed-wing flight simulator Control actions: throttle and steering.

40 Pieter Abbeel RC Car: Circle

41 Pieter Abbeel RC Car: Figure-8 Maneuver

42 Pieter Abbeel Conclusion Apprenticeship learning algorithms help us find better controllers by exploiting teacher demonstrations. Our current work exploits teacher demonstrations to find a good dynamics model, a good reward function, a good control policy.

43 Pieter Abbeel Acknowledgments J. Zico Kolter, Andrew Y. Ng Morgan Quigley, Andrew Y. Ng Andrew Y. Ng Adam Coates, Morgan Quigley, Andrew Y. Ng


Download ppt "Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley."

Similar presentations


Ads by Google