Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn
Content Content / Introduction Actor-Critic Natural gradient Applications Conclusion References
Actor-Critic Separate memory structure for policy (Actor) and value function (Critic). After each action the critic evaluates the new state and returns an error. The actor and the critic are updated using this error. The Actor-Critic Architecture [2]
Actor-Critic: Notation The model in the article is based loosely on a MDP. - Discrete time - Continues state set: - Continues action set: The system: - Start state: drawn from a start-state distribution - At any state the actor chooses an action - The system transfers to a new state. - The system yields a reward after each action.
Actor-Critic: Functions Goal of the ‘system’ is to find a policy This goal is reached by optimizing the normalized expected return as a function of the inputs With the differential : Problem: The meat and bones of the article gets lost in convoluted functions. Solution: Use a (presumably) known model/system that can be improved using the same method [4].
Actor-Critic: Simplified model Actor: - Universal function approximator e.g. Multi Layer Perceptron (MLP). - Gets error from the critic. - Gradient descent! Critic: Baseline (based on example data or a constant) times a function containing learned information combined with the reward.
Natural Gradient : Vanilla Gradient Descent The critic returns an error which in combination with the function approximator can be used to create an error function The partial differential of this error function, the gradient can now be used to update the internal variables in the function approximator (and critic). Gradient descent [3]
Natural Gradient: definition An ‘alternative’ gradient to update the function approximator. Definition of the natural gradient: Where denotes the transposed Fisher Information Matrix (FIM). The FIM is a statistical construct that summarizes the mean and variation of the input data. Used in combination with the natural gradient FIM gives the direction of steepest descent [4].
Natural Gradient: Properties The natural gradient is a linear weighted version of the normal (vanilla) gradient. Convergence to a local minimum is guaranteed. By choosing a more direct path to optimal solution faster convergence is reached avoiding premature convergence. Covariant: Independent of the coordinate frame. Averages out stochasticity resulting in smaller datasets for estimating the correct data set. Gradient landscape for the ‘vanilla’ and natural gradient. Adapted from [1]
Natural Gradient: plateaus The natural gradient is a solution to escape from plateaus in the gradient landscape. Plateaus are parts where the gradients of a function are extremely small. It takes considerate time to traverse these and are well know ‘feature’ of gradient descent methods. Example function landscape showing multiple plateaus and the resulting error while traversing it with normal gradient steps (iterations) [5]
Applications: Cart-Pole Balancing Well known benchmark for reinforced learning [1] Unstable non-linear system that can be simulated. State: Action: Reward based on the current state with constant baseline. (Episodic Actor-Critic)
Applications: Cart-Pole Balancing Simulated experiment with a sample rate of 60 hertz, Comparing natural and vanilla gradient Actor-Critic algorithms. Results: The natural gradient implementation takes on average ten minutes to find an optimal solution. The vanilla gradient takes on average two hours. Expected return policy ≈ error averaged over 100 simulated runs
Applications: Baseball Optimizing nonlinear dynamic motor primitives for robotics. In plain English: Learning a robot to hit a ball. Shows the usage of a baseline for the critic: A teacher manipulating the robot. (LSTD-Q( λ ) Actor-Critic) State, action and reward not explicitly given but are based on the motor primitives (and presumably a camera input): Optimal (red), POMDP (dashed) and Actor-Critic motor primitives.
Applications: Baseball The task of the robot is to hit the ball so that is flies as far as possible. The robot has seven degrees of freedom. Initially the robot is taught by supervised learning and fails. Subsequently the performance is improved by the Natural Actor-critic.
Applications: Baseball Both learning methods eventually learn their version of the best solution. However the POMDB requires 10^6 learning steps compared to 10^3 for the Natural Actor-Critic. Remarkable is that the Natural Actor-critic subjectively has a solution that is closer to the teacher/optimal solution.
Conclusions A novel policy-gradient reinforcement learning method. Two distinct flavors: -Episodic with a constant as a baseline function in the critic - LSTD-Q( λ ) with a rich baseline (teacher) function. The improved functioning can be traced back to the usage of the improved natural gradient which uses statistical information of the input data to optimize changes in the used learning functions.
Conclusions The preliminary versions of the method have been implemented in wide range of real word applications: - Humanoid robots - Trafic light optimalisation - Multirobot systems - Gait optimalisation in robot locomotion.
References [1] J. Peters and S. Schaal, “Natural Actor Critic”. Neurocomputing, 2008 [2] R.S. Sutton and A.G. Barto, “Reinforcement Learning: An Introduction” MIT Press, Cambridge, 1998 Web version: [3] [4] S. Amari “Natural Gradient Works Efficiently in Learning” Neural Computation 10, 251–276 (1998) [5] K. Fukumizu, S. Amari “Local Minima and Plateaus in Hierarchical Structures of Multilayer Perceptrons”, Neural Networks, 2000