The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks Gerhard Neumann Master Thesis 2005 Institute für Grundlagen der Informationsverarbeitung.

Slides:

Advertisements

Similar presentations

Reinforcement Learning Peter Bodík. Previous Lectures Supervised learning –classification, regression Unsupervised learning –clustering, dimensionality.

Advertisements

Reinforcement learning

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

RL for Large State Spaces: Value Function Approximation

Partially Observable Markov Decision Process (POMDP)

Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.

Reinforcement Learning Tutorial

Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Fuzzy Inference System Learning By Reinforcement Presented by Alp Sardağ.

An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology.

D Nagesh Kumar, IIScOptimization Methods: M1L4 1 Introduction and Basic Concepts Classical and Advanced Techniques for Optimization.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer.

Radial-Basis Function Networks

Artificial Neural Networks

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.

Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.

CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University

Akram Bitar and Larry Manevitz Department of Computer Science

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.

Neural Networks Chapter 7

INTRODUCTION TO Machine Learning

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Design and Implementation of General Purpose Reinforcement Learning Agents Tyler Streeter November 17, 2005.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods Dr. Itamar Arel.

Reinforcement learning (Chapter 21)

Retraction: I’m actually 35 years old. Q-Learning.

Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.

Reinforcement Learning

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Soft Computing Applied to Finite Element Tasks

An Overview of Reinforcement Learning

LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS

Human-level control through deep reinforcement learning

Reinforcement learning

Dr. Unnikrishnan P.C. Professor, EEE

RL for Large State Spaces: Value Function Approximation

Chapter 8: Generalization and Function Approximation

Deep Reinforcement Learning

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Akram Bitar and Larry Manevitz Department of Computer Science

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks Gerhard Neumann Master Thesis 2005 Institute für Grundlagen der Informationsverarbeitung (IGI) www.igi.tu-graz.ac.at/ril-toolbox

Master Thesis: Reinforcement Learning Toolbox General Software Tool for Reinforcement Learning Benchmark tests of Reinforcement Learning algorithms on three Optimal Control Problems Pendulum Swing Up Cart-Pole Swing Up Acro-Bot Swing Up www.igi.tu-graz.ac.at/ril-toolbox

RL Toolbox Motivation: Tool for beginners and professionals to work on RL-Situations. No additional work on the learning algorithm, we can concentrate on the learning problem. Universal interface for reinforcement learning algorithms. We can test different RL algorithms easily once we have defined our learning problem. www.igi.tu-graz.ac.at/ril-toolbox

RL Toolbox: Features Software: C++ Class System Open Source / Non Commercial Homepage: www.igi.tu-graz.ac.at/ril-toolbox Class Reference, Manual Runs under Linux and Windows > 40.000 lines of code, > 250 classes www.igi.tu-graz.ac.at/ril-toolbox

RL Toolbox: Features Learning in discrete or continuous State Space Learning in discrete or continuous Action Space Different kinds of Learning Algorithms TD-lambda learning Actor critic learning Dynamic Programming, Model based learning, planning methods Continuous time RL Policy search algorithm Residual / Residual gradient Algorithm Use Different Function Approximators RBF-Networks Linear Interpolation CMAC-Tile Coding Feed Forward Neural Networks Learning from other (self coded) Controllers Hierarchical Reinforcement Learning www.igi.tu-graz.ac.at/ril-toolbox

Structure of the Learning System The Agent and the environment The agent tells the environment which action to execute, the environment makes the internal state transitions Environment defines the learning problem www.igi.tu-graz.ac.at/ril-toolbox

Structure of the learning system Linkage to the learning algorithms All algorithms need <st,at,st+1> for learning. The algorithms are implemented as listeners The algorithms adapt the agent controller to learn optimal policy Agent informs all listeners about the steps and when a new episode has started. www.igi.tu-graz.ac.at/ril-toolbox

Reinforcement Learning: Agent: State Space S Action Space A Transition Function Agent has to optimize the future discounted reward Many possibilities to solve the optimization task: Value based Approaches Genetic Search Other Optimization algorithms www.igi.tu-graz.ac.at/ril-toolbox

Short Overview over the algorithms: Value-based algorithms Calculate the goodness of each state Policy-search algorithms Represent the policy directly, search in the policy parameter space Hybrid Methods Actor-Critic Learning www.igi.tu-graz.ac.at/ril-toolbox

Value Based Algorithms Calculate either: Action value function (Q-Function): Directly used for action selection Value Function Need the transition function for action selection E.g. Do state prediction or use the derivation of the transition function Representation of the V or Q Function is in the most cases independent of the learning algorithm. We can use any function approximator for the value function Independent V-Function and Q-Function interfaces Different Algorithms: TD-Learning, Advantage Learning, Continuous Time RL www.igi.tu-graz.ac.at/ril-toolbox

Value Based Algorithm TD (λ) Learning Advantage Learning [Baird, 1999] Update the Value function with the Temporal Difference: TD = rt + γ * Q(st+1,a‘) - Q(st, at) Use egibility traces to measure influence of preciding states Q-Learning, SARSA Learning V-Learning, Prediction of the next state using the (learned) transition function. Possibility to predict also far future. Advantage Learning [Baird, 1999] Calculates the advantage of each action in the current state The advantage is the difference to the optimal action The advantage gets scaled by the time step dt. Continuous Time RL [Doya, 1999] Approximates the value function through differential equations in continuous time. Uses the transition function for action selection www.igi.tu-graz.ac.at/ril-toolbox

Approximating the Value Function Gradient descent on the Bellman error function: E = [Vnew – Vold]²=[r(t) +γ V(t+1) - V(t)]² = TD ² Different Methods of calculating the gradient Direct Method : No guaranteed convergence Residual Gradient Method : Very slow, but guaranteed convergence Residual Method : Linear Combination of both approaches. These value function approximation methods are supported for all value based algorithms. www.igi.tu-graz.ac.at/ril-toolbox

Policy Search / Policy Gradient Algorithms Directly climb the value function with a parameterized policy Calculate the Values of N given initial states per simulation (PEGASUS, [NG, 2000]) Use standard optimization techniques like gradient ascent, simulated annealing or genetic algorithms. Gradient Ascent used in the Toolbox www.igi.tu-graz.ac.at/ril-toolbox

PEGASUS: Gradient estimation Value of a Policy: Numerical Gradient Estimation Analytical Gradient Estimation First algorithm that calculates the gradient analytically www.igi.tu-graz.ac.at/ril-toolbox

Actor Critic Methods: Learn the value function and an extra policy representation Discrete actor critic Stochastic Policies Represent directly the action selection propabilities. Similar to TD-Q Learning Continous actor critic Directly output the continuous control vector Policy can be represented by any Function approximator Stochastic Real Values (SRV) Algorithm ([Gullapalli, 1992]) Policy-Gradient Actor-Critic (PGAC) algorithm www.igi.tu-graz.ac.at/ril-toolbox

Policy-Gradient Actor-Critic Agorithm Learn the V-Function with standard algorithm Calculate Gradient of the Value within a certain time window (k-steps in the past, l-steps in the future) Gradient is then estimated by: Again exact model is needed www.igi.tu-graz.ac.at/ril-toolbox

Second Part: Benchmark Tests Pendulum Swing Up Easy Task CartPole Swing Up Medium Task AcroBot Swing Up Hard Task www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Problems Common problems in non-linear control Try to find an unstable fixpoint 2 or 4 continuous state variables 1 continuous control variable Reward: Height of the end point at time each step www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Tests: Test the algorithms on the benchmark problems with different parameter settings. Compare sensitivity of the parameter setting Use different Function Approximators (FA) Linear FAs (e.g. RBF-Networks) Typical local representation Curse of dimensionality Non-Linear FA (e.g. Feed-Forward Neural-Networks): No expontial dependency on the input state dimension Harder to learn (no local representation) Compare the algorithms with respect to their features and requirements Is the exact transition function needed? Can the algorithm produce continuous actions? How much computation time is needed? Use hierarchical RL, directed exploration strategies or planning methods to boost learning www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Tests: Cart-Pole Task, RBF-network Planning boosts performance significantly Very time intensive (search depth 5 – 120 times longer computation time) PG-AC approach can compete with standard V-Learning approach Can not represent sharp decision boundaries www.igi.tu-graz.ac.at/ril-toolbox

Benchmark: PG-AC vs V-Planning, Feed Forward NN Learning with FF-NN using the standard planning approach almost impossible (very unstable performance) PG-AC with RBF critic (time window = 7 time steps) manges to learn the task in almost 1/10 of episodes of the standard planning approach. www.igi.tu-graz.ac.at/ril-toolbox

V-Planning Cart-Pole Task: Higher Search Depths could improve performance significantly, but at exponential cost of computation time www.igi.tu-graz.ac.at/ril-toolbox

Hierarchical RL Cart-Pole Task: The Hierarchical Sub-Goal Approach (alpha = 0.6) outperforms the flat approach (alpha = 1.0) www.igi.tu-graz.ac.at/ril-toolbox

Other general results The Acro-Bot Task could not be learned with the flat architecture Hierarchical Architecture manges to swing up, but could not stay on top Nearly all algorithms managed to learn the first two tasks with linear function approximation (RBF networks) Non linear function approximators are very hard to learn Feed Forward NN‘s have a very poor performance (no locality), but can be used for larger state spaces Very restrictive parameter settings Approaches which use the transition function typically outperform the model-free approaches. The Policy Gradient algorithm (PEGASUS) only worked with the linear FAs, with non-linear FAs it could not recover from local maxima. www.igi.tu-graz.ac.at/ril-toolbox

Literature [Sutt_1999] R. Sutton and A. Barto: Reinforcement Learning: An Introduction. MIT press [NG_2000] A. Ng an M. Jordan: PEGASUS: A policy search method for large mdps and pomdps approximation [Doya_1999] K. Doya: Reinforcement Learning in continuous time and space [Baxter, 1999] J. Baxter: Direct gradient-based reinforcement learning: 2. gradient ascent algorithms and experiments. [Baird_1999] L. Baird: Reinforcement Learning Through Gradient Descent. PhD Thesis [Gulla_1992] V. Gullapalli: Reinforcement Learning and its application to control [Coulom_2000] R. Coulom: Reinforcement Learning using Neural Networks. PhD thesis www.igi.tu-graz.ac.at/ril-toolbox