Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006
Overview Policy Gradient Algorithms RL for Quadrupal Locomotion PEGASUS Algorithm Autonomous Helicopter Flight High Speed Obstacle Avoidance RL for Biped Locomotion Poincare-Map RL Dynamic Planning Hierarchical Approach RL for Acquisition of Robot Stand-Up Behavior
RL for Quadruped Locomotion [Kohl04] Simple Policy-Gradient Example Optimize Gait for Sony-Aibo Robot Use Parameterized Policy 12 Parameters Front + rear locus (height, x-pos, y-pos) Height of the front and the rear of the body …
Quadruped Locomotion Policy: No notion of state – open loop control! Start with initial Policy Generate t = 15 random policies R i is Evaluate Value of each policy on the real robot Estimate gradient for each parameter Update policy into the direction of the gradient
Quadruped Locomotion Estimation of the Walking Speed of a policy Automated process of the Aibos Each Policy is evaluated 3 times One Iteration (3 x 15 evaluations) takes 7.5 minutes
Quadruped Gait: Results Better than the best known gait for AIBO!
Pegasus [Ng00] Policy Gradient Algorithms: Use finite time horizon, evaluate Value Value of a policy in a stochastic environment is hard to estimate => Stochastic Optimization Process PEGASUS: For all policy evaluation trials use fixed set of start states (scenarios) Use „fixed randomization“ for policy evaluation Only works for simulations! The same conditions for each evaluation trial => Deterministic Optimization Process! Can be solved by any optimization method Commonly Used: Gradient Ascent, Random Hill Climbing
Autonomous Helicopter Flight [Ng04a, Ng04b] Autonomously learn to fly an unmanned helicopter $ => Catastrophic Exploration! Learn Dynamics from the observation of a Human pilot Use PEGASUS to: Learn to Hover Learn to fly complex maneuvers Inverted Helicopter flight
Helicopter Flight: Model Indenfication 12 dimensional state space World Coordinates (Position + Rotation) + Velocities 4-dimensional actions 2 rotor-plane pitch Rotor blade tilt Tail rotor tilt Actions are selected every 20 ms
Helicopter Flight: Model Indenfication Human pilot flies helicopter, data is logged 391s training data reduced to 8 dimensions (position can be estimated from velocities) Learn transition probabilities P(st+1|st, at) supervised learning with locally weighted linear regression Model Gaussian noise for stochastic model Implemented a simulator for model validation
Helicopter Flight: Hover Control Desired hovering position : Very Simple Policy Class Edges are optained by human prior knowledge Learns more or less linear gains of the controller Quadratic Reward Function: punishment for deviation of desired position and orientation
Helicopter Flight: Hover Control Results: Better performance than Human Expert (red)
Helicopter Flight: Flying maneuvers Fly 3 manouvers from the most difficult RC helicopter competition class Trajectory Following: punish distance from projected point on trajectory Additional reward for making progress along the trajectory
Helicopter Flight: Results Videos: Video1 Video2Video1Video2
Helicopter Flight: Inverse Flight Very difficult for humans Unstable! Recollect data for inverse flight Use same methods than before Learned in 4 days! from data collection to flight experiment Stable inverted flight controller sustained position Video
High Speed Obstacle Avoidance [Michels05] Obstacle Avoidance with RC car in unstructured Environments Estimate depth information from monocular cues Learn controller with PEGASUS for obstacle avoidance Graphical Simulation : Does it work in the real environment?
Estimate Depths Information: Supervised Learning Divide image into 16 horizontal stripes Use features of the strip and the neighbored strips as input vectors. Target Values (shortest distance within a strip) either from simulation or laser range finders Linear Regression Output of the vision system angle of the strip with the largest distance Distance of the strip
Obstacle Avoidance: Control Policy: 6 Parameters Again, a very simple policy is used Reward: Deviation of the desired speed, Number of crashes
Obstacle Avoidance: Results Using a graphical simulation to train the vision system also works for outdoor environments VideoVideo
RL for Biped Robots Often used only for simplified planar models Poincare-Map based RL [Morimoto04] Dynamic Planning [Stilman05] Other Examples for RL in real robots: Strongly Simplify the Problem: [Zhou03]
Poincare Map-Based RL Improve walking controllers with RL Poincare map: Intersection-points of an n-dimensional trajectory with an (n-1) dimensional Hyperplane Predict the state of the biped a half cycle ahead at the phases :
Poincare Map Learn Mapping: Input Space : x = (d, d‘) Distance between stance foot and body Action Space : Modulate Via-Points of the joint trajectories Function Approximator: Receptive Field Locally Weighted Regression (RFWR) with a fixed grid
Via Points Nominal Trajectories from human walking patterns Control output is used to modulate via points with a circle Hand selected via-points Increment via-points of one joint by the same amount
Learning the Value function Reward Function: 0.1 if height of the robot > 0.35m -0.1 else Standard SemiMDP update rules Only need to learn the value function for and Model-Based Actor-Critic Approach A … Actor Update Rule:
Results: Stable walking performance after 80 trials Beginning of Learning End of Learning
Dynamic Programming for Biped Locomotion [Stilman05] 4-link planar robot Dynamic Programming for Reduced Dimensional Spaces Manual temporal decomposition of the problem into phases of single and double support Use intuitive reductions fo the state space for both phases
State-Increment Dynamic Programming 8-dimensional state space: Discretize State-Space by coarse grid Use Dynamic Programming: Interval ε is defined as the minimum time intervall required for any state index to change
State Space Considerations Decompose into 2 state space components (DS + SS) Important disctinctions between the dynamcs of DS and SS Periodic System: DP can not be applied separately to state space components Establish mapping between the components for the DS and SS transition
State Space Reduction Double Support: Constant step length (df) Can not change during DS Can change after robot completes SS Equivalent to 5-bar linkage model Entire state space can be described by 2 DoF (use k 1 and k 2 ) 5-d state space 10x16x16x12x12 grid => States
State Space Reduction Single Support Compass 2-link Model Assume k 1 and k 2 are constant Stance knee angle k 1 has small range in human walking Swing knee k 2 has strong effect on d f, but can be prescribed in accordance with h 2 with little effect on the robot‘s CoM 4-D state space 35x35x18x18 grid => states
State-Space Reduction Phase Transitions DS to SS transition occurs when the rear foot leaves the ground Mapping: SS to DS transition occurs when the swing leg makes contact Mapping:
Action Space, Rewards Use discretized torques DS: hip and both knee joints can accelerate the CoM Fix hip action to zero to gain better resolution for the knee joints Discretize 2-D action space from Nm into 7x7 intervalls SS: Only choose hip torque 17 intervalls in the range of Nm State x Actions x x17 = cells (!!) Reward:
Results 11 hours of computation The computed policy locates a limit cycle through the space.
Performance under error Alter different properties of the robot in simulation Do not relearn the policy Wide range of disturbances are accepted Even if the used model of the dynamics is incorrect! Wide set of acceptable states allows the actual trajectory to be distinct from the expected limit cycle
RL for a CPG-driven biped Robot CPG-Controller: Recurrent Neural-oscillator network State dynamics Sensory Input: Torque output State of the system: Neural State v + Physical State x Weights of the CPG have already been optimized [Taga01]
CPG Actor-Critic 2 modules: Actor + CPG Wijact and Ajk are trained by RL, the rest of the system is fixed
CPG-Actor Critic Actor Outputs indirect control u to the CPG coupled system (CPG + physical system) The actor is a linear controller without any feedback connections Learning with the Natural Actor-Critic algorithm
CPG Actor-Critic: Experiments CPG-Driven Biped Robot
Learning of a Stand-up Behavior [Morimoto00] Learning to stand-up with a 3-linked planar robot. 6-D state space Angles + Velocities Hierarchical Reinforcement Learning Task decomposition by Sub-goals Decompose task into: Non–linear problem in a lower dimensional space Nearly-linear problem in a high- dimensional space
Upper-level Learning Coarse Discretization of postures No speed information in the state space (3-d state space): Actions: Select sub-goals New Sub-goal
Upper-Level Learning Reward Function: Reward success of stand-up Reward also for the success of a subgoal Choosing sub-goals which are easier to reach from the current state is prefered Use Q(lambda) learning to learn the sequence of sub-goals
Lower-level learning Lower level is free to choose at which speed to reach sub-goal (desired posture) 6-D state space Use Incremental Normalized Gaussian networks (ING-net) as function approximator RBF network with rule for allocating new RBF-centers Action Space: Torque-Vector:
Lower-level learning Reward: -1.5 if the robot falls down Continuous time actor critic learning [Doya99] Actor and Critic are learnt with ING-nets. Control Output: Combination of linear servo controller and non-linear feedback controller
Results: Simulation Results Hierarchical architecture 2x faster than plain architecture Real Robot Before Learning Before Learning During Learning During Learning After Learning After Learning Learned on average in 749 trials (7/10 learning runs) Used on average 4.3 subgoals
The end For People who are interested in using RL: RL-Toolbox Thank you
Literature [Kohl04] Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion, N. Kohl and P. Stone, 2005 [Ng00] PEGASUS : A policy search method for large MDPs and POMDPs, A. Ng and M. Jordan, 2000 [Ng04a] Autonomous inverted helicopter flight via reinforcement learning, A. Ng et al., 2004 [Ng04b] Autonomous helicopter flight via reinforcement learning, A. Ng et al., 2004 [Michels05] High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning, J. Michels, A. Saxena and A. Ng, 2005 [Morimoto04] A Simple Reinforcement Learning Algorithm For Biped Walking, J. Morimoto and C. Atkeson, 2004
Literature [Stilman05] Dynamic Programming in Reduced Dimensional Spaces: Dynamic Planning for Robust Biped Locomotion, M. Stilman, C. Atkeson and J. Kuffner, 2005 [Morimoto00] Acquisition of Stand-Up Behavior by a Real Robot using Hierarchical Reinforcement Learning, J. Morimoto and K. Doya, 2000 [Morimoto98] Hierarchical Reinforcement Learning of Low- Dimensional Subgoals and High-Dimensional Trajectories, J. Morimoto and K. Doya, 1998 [Zhou03] Dynamic Balance of a biped robot using fuzzy Reinforcement Learning Agents, C. Zhou and Q.Meng, 2003 [Doya99] Reinforcement Learning in Continuous Time And Space, K. Doya,1999