Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.

Slides:

Advertisements

Similar presentations

Reinforcement learning

Advertisements

EE 690 Design of Embodied Intelligence

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.

1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The FIR Adaptive Filter The LMS Adaptive Filter Stability and Convergence.

Separating Hyperplanes

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

The loss function, the normal equation,

Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.

Chapter 5 NEURAL NETWORKS

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

October 28, 2010Neural Networks Lecture 13: Adaptive Networks 1 Adaptive Networks As you know, there is no equation that would tell you the ideal number.

Information Fusion Yu Cai. Research Article “Comparative Analysis of Some Neural Network Architectures for Data Fusion”, Authors: Juan Cires, PA Romo,

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks Gerhard Neumann Master Thesis 2005 Institute für Grundlagen der Informationsverarbeitung.

Algorithm Taxonomy Thus far we have focused on:

Artificial Neural Networks

Biointelligence Laboratory, Seoul National University

Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10

Classification / Regression Neural Networks 2

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

Non-Bayes classifiers. Linear discriminants, neural networks.

11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Derivation Computational Simplifications Stability Lattice Structures.

EE459 Neural Networks Backpropagation

Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Retraction: I’m actually 35 years old. Q-Learning.

Hand-written character recognition

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

Neural Networks 2nd Edition Simon Haykin

1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.

Chapter 6 Neural Network.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

Probabilistic Robotics Bayes Filter Implementations Gaussian filters.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Machine Learning Supervised Learning Classification and Regression

Fall 2004 Backpropagation CS478 - Machine Learning.

Artificial Neural Networks

Online Multiscale Dynamic Topic Models

Learning with Perceptrons and Neural Networks

One-layer neural networks Approximation problems

第 3 章神经网络.

Real Neurons Cell structures Cell body Dendrites Axon

An Overview of Reinforcement Learning

Classification / Regression Neural Networks 2

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming

Perceptron as one Type of Linear Discriminants

Chapter 2: Evaluative Feedback

Chapter 8: Generalization and Function Approximation

Backpropagation.

Chapter 10: Dimensions of Reinforcement Learning

The loss function, the normal equation,

Deep Reinforcement Learning

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Mathematical Foundations of BME Reza Shadmehr

Chapter 4: Dynamic Programming

Backpropagation.

Nonlinear Conjugate Gradient Method for Supervised Training of MLP

Chapter 2: Evaluative Feedback

Presentation transcript:

Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn

Content Content / Introduction Actor-Critic Natural gradient Applications Conclusion References

Actor-Critic Separate memory structure for policy (Actor) and value function (Critic). After each action the critic evaluates the new state and returns an error. The actor and the critic are updated using this error. The Actor-Critic Architecture [2]

Actor-Critic: Notation The model in the article is based loosely on a MDP. - Discrete time - Continues state set: - Continues action set: The system: - Start state: drawn from a start-state distribution - At any state the actor chooses an action - The system transfers to a new state. - The system yields a reward after each action.

Actor-Critic: Functions Goal of the ‘system’ is to find a policy This goal is reached by optimizing the normalized expected return as a function of the inputs With the differential : Problem: The meat and bones of the article gets lost in convoluted functions. Solution: Use a (presumably) known model/system that can be improved using the same method [4].

Actor-Critic: Simplified model Actor: - Universal function approximator e.g. Multi Layer Perceptron (MLP). - Gets error from the critic. - Gradient descent! Critic: Baseline (based on example data or a constant) times a function containing learned information combined with the reward.

Natural Gradient : Vanilla Gradient Descent The critic returns an error which in combination with the function approximator can be used to create an error function The partial differential of this error function, the gradient can now be used to update the internal variables in the function approximator (and critic). Gradient descent [3]

Natural Gradient: definition An ‘alternative’ gradient to update the function approximator. Definition of the natural gradient: Where denotes the transposed Fisher Information Matrix (FIM). The FIM is a statistical construct that summarizes the mean and variation of the input data. Used in combination with the natural gradient FIM gives the direction of steepest descent [4].

Natural Gradient: Properties The natural gradient is a linear weighted version of the normal (vanilla) gradient. Convergence to a local minimum is guaranteed. By choosing a more direct path to optimal solution faster convergence is reached avoiding premature convergence. Covariant: Independent of the coordinate frame. Averages out stochasticity resulting in smaller datasets for estimating the correct data set. Gradient landscape for the ‘vanilla’ and natural gradient. Adapted from [1]

Natural Gradient: plateaus The natural gradient is a solution to escape from plateaus in the gradient landscape. Plateaus are parts where the gradients of a function are extremely small. It takes considerate time to traverse these and are well know ‘feature’ of gradient descent methods. Example function landscape showing multiple plateaus and the resulting error while traversing it with normal gradient steps (iterations) [5]

Applications: Cart-Pole Balancing Well known benchmark for reinforced learning [1] Unstable non-linear system that can be simulated. State: Action: Reward based on the current state with constant baseline. (Episodic Actor-Critic)

Applications: Cart-Pole Balancing Simulated experiment with a sample rate of 60 hertz, Comparing natural and vanilla gradient Actor-Critic algorithms. Results: The natural gradient implementation takes on average ten minutes to find an optimal solution. The vanilla gradient takes on average two hours. Expected return policy ≈ error averaged over 100 simulated runs

Applications: Baseball Optimizing nonlinear dynamic motor primitives for robotics. In plain English: Learning a robot to hit a ball. Shows the usage of a baseline for the critic: A teacher manipulating the robot. (LSTD-Q( λ ) Actor-Critic) State, action and reward not explicitly given but are based on the motor primitives (and presumably a camera input): Optimal (red), POMDP (dashed) and Actor-Critic motor primitives.

Applications: Baseball The task of the robot is to hit the ball so that is flies as far as possible. The robot has seven degrees of freedom. Initially the robot is taught by supervised learning and fails. Subsequently the performance is improved by the Natural Actor-critic.

Applications: Baseball Both learning methods eventually learn their version of the best solution. However the POMDB requires 10^6 learning steps compared to 10^3 for the Natural Actor-Critic. Remarkable is that the Natural Actor-critic subjectively has a solution that is closer to the teacher/optimal solution.

Conclusions A novel policy-gradient reinforcement learning method. Two distinct flavors: -Episodic with a constant as a baseline function in the critic - LSTD-Q( λ ) with a rich baseline (teacher) function. The improved functioning can be traced back to the usage of the improved natural gradient which uses statistical information of the input data to optimize changes in the used learning functions.

Conclusions The preliminary versions of the method have been implemented in wide range of real word applications: - Humanoid robots - Trafic light optimalisation - Multirobot systems - Gait optimalisation in robot locomotion.

References [1] J. Peters and S. Schaal, “Natural Actor Critic”. Neurocomputing, 2008 [2] R.S. Sutton and A.G. Barto, “Reinforcement Learning: An Introduction” MIT Press, Cambridge, 1998 Web version: [3] [4] S. Amari “Natural Gradient Works Efficiently in Learning” Neural Computation 10, 251–276 (1998) [5] K. Fukumizu, S. Amari “Local Minima and Plateaus in Hierarchical Structures of Multilayer Perceptrons”, Neural Networks, 2000