Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Benjamin Stephens Carnegie Mellon University 9 th IEEE-RAS International Conference on Humanoid Robots December 8, 2009 Modeling and Control of Periodic.

DARPA Mobile Autonomous Robot SoftwareMay Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial.

‘Initial state’ coordinations reproduce the instant flexibility for human walking By: Esmaeil Davoodi Dr. Fariba Bahrami In the name of GOD May, 2007 Reference:

Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.

Model Predictive Control for Humanoid Balance and Locomotion Benjamin Stephens Robotics Institute.

Control of Instantaneously Coupled Systems Applied to Humanoid Walking Eric C. Whitman & Christopher G. Atkeson Carnegie Mellon University.

Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.

Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.

An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.

1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.

High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Radial Basis Function Networks

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Locomotion in modular robots using the Roombots Modules Semester Project Sandra Wieser, Alexander Spröwitz, Auke Jan Ijspeert.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks Gerhard Neumann Master Thesis 2005 Institute für Grundlagen der Informationsverarbeitung.

Constraints-based Motion Planning for an Automatic, Flexible Laser Scanning Robotized Platform Th. Borangiu, A. Dogar, A. Dumitrache University Politehnica.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

NW Computational Intelligence Laboratory Implementing DHP in Software: Taking Control of the Pole-Cart System Lars Holmstrom.

BIPEDAL LOCOMOTION Prima Parte Antonio D'Angelo.

Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.

Whitman and Atkeson.  Present a decoupled controller for a simulated three-dimensional biped.  Dynamics broke down into multiple subsystems that are.

Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,

Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Low Level Control. Control System Components The main components of a control system are The plant, or the process that is being controlled The controller,

Motor Control. Beyond babbling Three problems with motor babbling: –Random exploration is slow –Error-based learning algorithms are faster but error signals.

ZMP-BASED LOCOMOTION Robotics Course Lesson 22.

Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University

Benjamin Stephens Carnegie Mellon University Monday June 29, 2009 The Linear Biped Model and Application to Humanoid Estimation and Control.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

Introduction to Biped Walking

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods Dr. Itamar Arel.

DARPA Mobile Autonomous Robot SoftwareLeslie Pack Kaelbling; January Adaptive Intelligent Mobile Robotics Leslie Pack Kaelbling Artificial Intelligence.

Reinforcement learning (Chapter 21)

Robot Intelligence Technology Lab. 10. Complex Hardware Morphologies: Walking Machines Presented by In-Won Park

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

Ali Ghadirzadeh, Atsuto Maki, Mårten Björkman Sept 28- Oct Hamburg Germany Presented by Jen-Fang Chang 1.

Autonomous Navigation of a

Vision-Guided Humanoid Footstep Planning for Dynamic Environments

Character Animation Forward and Inverse Kinematics

Multi-Policy Control of Biped Walking

Real Neurons Cell structures Cell body Dendrites Axon

Reinforcement learning (Chapter 21)

Reinforcement learning (Chapter 21)

Probabilistic Robotics

Deep reinforcement learning

"Playing Atari with deep reinforcement learning."

End-to-end Driving via Conditional Imitation Learning

Locomotion of Wheeled Robots

Robust Belief-based Execution of Manipulation Programs

Alternatives for Locomotion Control

Neuro-Computing Lecture 4 Radial Basis Function Network

Dr. Unnikrishnan P.C. Professor, EEE

Reinforcement Learning

Chapter 8: Generalization and Function Approximation

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

Synthesis of Motion from Simple Animations

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

Emir Zeylan Stylianos Filippou

Unsupervised Perceptual Rewards For Imitation Learning

Chapter 4 . Trajectory planning and Inverse kinematics

Area Coverage Problem Optimization by (local) Search

Presentation transcript:

Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006

Overview Policy Gradient Algorithms  RL for Quadrupal Locomotion  PEGASUS Algorithm Autonomous Helicopter Flight High Speed Obstacle Avoidance RL for Biped Locomotion  Poincare-Map RL  Dynamic Planning Hierarchical Approach  RL for Acquisition of Robot Stand-Up Behavior

RL for Quadruped Locomotion [Kohl04] Simple Policy-Gradient Example Optimize Gait for Sony-Aibo Robot Use Parameterized Policy  12 Parameters Front + rear locus (height, x-pos, y-pos) Height of the front and the rear of the body …

Quadruped Locomotion Policy: No notion of state – open loop control! Start with initial Policy Generate t = 15 random policies R i is  Evaluate Value of each policy on the real robot  Estimate gradient for each parameter  Update policy into the direction of the gradient

Quadruped Locomotion Estimation of the Walking Speed of a policy  Automated process of the Aibos  Each Policy is evaluated 3 times  One Iteration (3 x 15 evaluations) takes 7.5 minutes

Quadruped Gait: Results Better than the best known gait for AIBO!

Pegasus [Ng00] Policy Gradient Algorithms:  Use finite time horizon, evaluate Value  Value of a policy in a stochastic environment is hard to estimate => Stochastic Optimization Process PEGASUS:  For all policy evaluation trials use fixed set of start states (scenarios)  Use „fixed randomization“ for policy evaluation Only works for simulations!  The same conditions for each evaluation trial  => Deterministic Optimization Process! Can be solved by any optimization method Commonly Used: Gradient Ascent, Random Hill Climbing

Autonomous Helicopter Flight [Ng04a, Ng04b] Autonomously learn to fly an unmanned helicopter  $ => Catastrophic Exploration! Learn Dynamics from the observation of a Human pilot Use PEGASUS to:  Learn to Hover  Learn to fly complex maneuvers  Inverted Helicopter flight

Helicopter Flight: Model Indenfication 12 dimensional state space  World Coordinates (Position + Rotation) + Velocities 4-dimensional actions  2 rotor-plane pitch  Rotor blade tilt  Tail rotor tilt Actions are selected every 20 ms

Helicopter Flight: Model Indenfication Human pilot flies helicopter, data is logged  391s training data  reduced to 8 dimensions (position can be estimated from velocities) Learn transition probabilities P(st+1|st, at)  supervised learning with locally weighted linear regression  Model Gaussian noise for stochastic model Implemented a simulator for model validation

Helicopter Flight: Hover Control Desired hovering position : Very Simple Policy Class  Edges are optained by human prior knowledge  Learns more or less linear gains of the controller Quadratic Reward Function:  punishment for deviation of desired position and orientation

Helicopter Flight: Hover Control Results:  Better performance than Human Expert (red)

Helicopter Flight: Flying maneuvers Fly 3 manouvers from the most difficult RC helicopter competition class Trajectory Following:  punish distance from projected point on trajectory  Additional reward for making progress along the trajectory

Helicopter Flight: Results Videos:  Video1 Video2Video1Video2

Helicopter Flight: Inverse Flight Very difficult for humans  Unstable! Recollect data for inverse flight  Use same methods than before Learned in 4 days!  from data collection to flight experiment Stable inverted flight controller  sustained position Video

High Speed Obstacle Avoidance [Michels05] Obstacle Avoidance with RC car in unstructured Environments Estimate depth information from monocular cues Learn controller with PEGASUS for obstacle avoidance  Graphical Simulation : Does it work in the real environment?

Estimate Depths Information: Supervised Learning  Divide image into 16 horizontal stripes Use features of the strip and the neighbored strips as input vectors.  Target Values (shortest distance within a strip) either from simulation or laser range finders  Linear Regression Output of the vision system  angle of the strip with the largest distance  Distance of the strip

Obstacle Avoidance: Control Policy: 6 Parameters  Again, a very simple policy is used Reward:  Deviation of the desired speed, Number of crashes

Obstacle Avoidance: Results Using a graphical simulation to train the vision system also works for outdoor environments  VideoVideo

RL for Biped Robots Often used only for simplified planar models  Poincare-Map based RL [Morimoto04]  Dynamic Planning [Stilman05] Other Examples for RL in real robots:  Strongly Simplify the Problem: [Zhou03]

Poincare Map-Based RL Improve walking controllers with RL Poincare map: Intersection-points of an n-dimensional trajectory with an (n-1) dimensional Hyperplane  Predict the state of the biped a half cycle ahead at the phases :

Poincare Map Learn Mapping:  Input Space : x = (d, d‘) Distance between stance foot and body  Action Space : Modulate Via-Points of the joint trajectories Function Approximator: Receptive Field Locally Weighted Regression (RFWR) with a fixed grid

Via Points Nominal Trajectories from human walking patterns Control output is used to modulate via points with a circle  Hand selected via-points  Increment via-points of one joint by the same amount

Learning the Value function Reward Function:  0.1 if height of the robot > 0.35m  -0.1 else Standard SemiMDP update rules  Only need to learn the value function for and Model-Based Actor-Critic Approach  A … Actor  Update Rule:

Results: Stable walking performance after 80 trials  Beginning of Learning  End of Learning

Dynamic Programming for Biped Locomotion [Stilman05] 4-link planar robot Dynamic Programming for Reduced Dimensional Spaces  Manual temporal decomposition of the problem into phases of single and double support  Use intuitive reductions fo the state space for both phases

State-Increment Dynamic Programming 8-dimensional state space:  Discretize State-Space by coarse grid Use Dynamic Programming:  Interval ε is defined as the minimum time intervall required for any state index to change

State Space Considerations Decompose into 2 state space components (DS + SS)  Important disctinctions between the dynamcs of DS and SS  Periodic System: DP can not be applied separately to state space components Establish mapping between the components for the DS and SS transition

State Space Reduction Double Support:  Constant step length (df) Can not change during DS Can change after robot completes SS  Equivalent to 5-bar linkage model Entire state space can be described by 2 DoF (use k 1 and k 2 )  5-d state space 10x16x16x12x12 grid => States

State Space Reduction Single Support  Compass 2-link Model  Assume k 1 and k 2 are constant Stance knee angle k 1 has small range in human walking Swing knee k 2 has strong effect on d f, but can be prescribed in accordance with h 2 with little effect on the robot‘s CoM  4-D state space 35x35x18x18 grid => states

State-Space Reduction Phase Transitions  DS to SS transition occurs when the rear foot leaves the ground Mapping:  SS to DS transition occurs when the swing leg makes contact Mapping:

Action Space, Rewards Use discretized torques  DS: hip and both knee joints can accelerate the CoM Fix hip action to zero to gain better resolution for the knee joints Discretize 2-D action space from Nm into 7x7 intervalls  SS: Only choose hip torque 17 intervalls in the range of Nm State x Actions  x x17 = cells (!!) Reward: 

Results 11 hours of computation The computed policy locates a limit cycle through the space.

Performance under error Alter different properties of the robot in simulation Do not relearn the policy Wide range of disturbances are accepted  Even if the used model of the dynamics is incorrect!  Wide set of acceptable states allows the actual trajectory to be distinct from the expected limit cycle

RL for a CPG-driven biped Robot CPG-Controller:  Recurrent Neural-oscillator network  State dynamics  Sensory Input:  Torque output State of the system:  Neural State v + Physical State x Weights of the CPG have already been optimized [Taga01]

CPG Actor-Critic 2 modules: Actor + CPG  Wijact and Ajk are trained by RL, the rest of the system is fixed

CPG-Actor Critic Actor  Outputs indirect control u to the CPG coupled system (CPG + physical system)  The actor is a linear controller without any feedback connections Learning with the Natural Actor-Critic algorithm

CPG Actor-Critic: Experiments CPG-Driven Biped Robot

Learning of a Stand-up Behavior [Morimoto00] Learning to stand-up with a 3-linked planar robot. 6-D state space  Angles + Velocities Hierarchical Reinforcement Learning  Task decomposition by Sub-goals  Decompose task into: Non–linear problem in a lower dimensional space Nearly-linear problem in a high- dimensional space

Upper-level Learning Coarse Discretization of postures  No speed information in the state space (3-d state space): Actions: Select sub-goals  New Sub-goal

Upper-Level Learning Reward Function:  Reward success of stand-up  Reward also for the success of a subgoal  Choosing sub-goals which are easier to reach from the current state is prefered Use Q(lambda) learning to learn the sequence of sub-goals

Lower-level learning Lower level is free to choose at which speed to reach sub-goal (desired posture) 6-D state space  Use Incremental Normalized Gaussian networks (ING-net) as function approximator RBF network with rule for allocating new RBF-centers Action Space:  Torque-Vector:

Lower-level learning Reward:  -1.5 if the robot falls down Continuous time actor critic learning [Doya99]  Actor and Critic are learnt with ING-nets. Control Output:  Combination of linear servo controller and non-linear feedback controller

Results: Simulation Results  Hierarchical architecture 2x faster than plain architecture Real Robot  Before Learning Before Learning  During Learning During Learning  After Learning After Learning Learned on average in 749 trials (7/10 learning runs) Used on average 4.3 subgoals

The end For People who are interested in using RL:  RL-Toolbox Thank you

Literature [Kohl04] Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion, N. Kohl and P. Stone, 2005 [Ng00] PEGASUS : A policy search method for large MDPs and POMDPs, A. Ng and M. Jordan, 2000 [Ng04a] Autonomous inverted helicopter flight via reinforcement learning, A. Ng et al., 2004 [Ng04b] Autonomous helicopter flight via reinforcement learning, A. Ng et al., 2004 [Michels05] High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning, J. Michels, A. Saxena and A. Ng, 2005 [Morimoto04] A Simple Reinforcement Learning Algorithm For Biped Walking, J. Morimoto and C. Atkeson, 2004

Literature [Stilman05] Dynamic Programming in Reduced Dimensional Spaces: Dynamic Planning for Robust Biped Locomotion, M. Stilman, C. Atkeson and J. Kuffner, 2005 [Morimoto00] Acquisition of Stand-Up Behavior by a Real Robot using Hierarchical Reinforcement Learning, J. Morimoto and K. Doya, 2000 [Morimoto98] Hierarchical Reinforcement Learning of Low- Dimensional Subgoals and High-Dimensional Trajectories, J. Morimoto and K. Doya, 1998 [Zhou03] Dynamic Balance of a biped robot using fuzzy Reinforcement Learning Agents, C. Zhou and Q.Meng, 2003 [Doya99] Reinforcement Learning in Continuous Time And Space, K. Doya,1999