An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous.

Slides:



Advertisements
Similar presentations
Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –
Advertisements

Lecture 18: Temporal-Difference Learning
RL for Large State Spaces: Value Function Approximation
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning Tutorial
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki Fung On Tik Andy
Chapter 5: Monte Carlo Methods
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.
CPS 270: Artificial Intelligence Machine learning Instructor: Vincent Conitzer.
Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
Artificial Intelligence Chapter 10 Planning, Acting, and Learning Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/ 2/theres-an-algorithm-for-that-algorithmia-
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning Dr. Itamar Arel College of Engineering Department of Electrical.
Chapter 10 Planning, Acting, and Learning. 2 Contents The Sense/Plan/Act Cycle Approximate Search Learning Heuristic Functions Rewards Instead of Goals.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Planning, Acting, and Learning Chapter Contents The Sense/Plan/Act Cycle Approximate Search Learning Heuristic Functions Rewards Instead of Goals.
Chapter 6: Temporal Difference Learning
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
یادگیری تقویتی Reinforcement Learning
Chapter 8: Generalization and Function Approximation
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
Chapter 6: Temporal Difference Learning
Chapter 10: Dimensions of Reinforcement Learning
Chapter 9: Planning and Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
Chapter 4: Dynamic Programming
October 20, 2010 Dr. Itamar Arel College of Engineering
Presentation transcript:

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous Learning Laboratory – Department of Computer Science

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, The Overall Plan pLecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes pLecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience pLecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, The Overall Plan pLecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes pLecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience pLecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Lecture 3, Part 1: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over a much larger part pOverview of function approximation (FA) methods and how they can be adapted to RL Objectives of this part:

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Value Prediction with FA As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function In earlier chapters, value functions were stored in lookup tables.

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Adapt Supervised Learning Algorithms Supervised Learning System Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output) Training example = {input, target output}

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Backups as Training Examples As a training example: inputtarget output

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Any FA Method? pIn principle, yes: artificial neural networks decision trees multivariate regression methods etc. pBut RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other?

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Gradient Descent Methods transpose

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Performance Measures pMany are applicable but… pa common and simple one is the mean-squared error (MSE) over a distribution P : pWhy P ? pWhy minimize MSE? pLet us assume that P is always the distribution of states with which backups are done. pThe on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Gradient Descent Iteratively move down the gradient:

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Gradient Descent Cont. For the MSE given above and using the chain rule:

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Gradient Descent Cont. Use just the sample gradient instead: Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if  decreases appropriately with t.

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, But We Don’t have these Targets

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, What about TD( ) Targets?

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, On-Line Gradient-Descent TD( )

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Linear Methods

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Nice Properties of Linear FA Methods pThe gradient is very simple: pFor MSE, the error surface is simple: quadratic surface with a single minumum.  Linear gradient descent TD( ) converges: Step size decreases appropriately On-line sampling (states sampled from the on-policy distribution) Converges to parameter vector with property: best parameter vector (Tsitsiklis & Van Roy, 1997)

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Coarse Coding

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Learning and Coarse Coding

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Tile Coding pBinary feature for each tile pNumber of features present at any one time is constant pBinary features means weighted sum easy to compute pEasy to compute indices of the freatures present

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Tile Coding Cont. Irregular tilings Hashing CMAC “Cerebellar Model Arithmetic Computer” Albus 1971

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Radial Basis Functions (RBFs) e.g., Gaussians

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Can you beat the “curse of dimensionality”? pCan you keep the number of features from going up exponentially with the dimension? pFunction complexity, not dimensionality, is the problem. pKanerva coding: Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only complexity p“Lazy learning” schemes: Remember all the data To get new value, find nearest neighbors and interpolate e.g., locally-weighted regression

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Control with FA Training examples of the form: pLearning state-action values pThe general gradient-descent rule:  Gradient-descent Sarsa( ) (backward view):

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Linear Gradient Descent Sarsa( )

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, GPI Linear Gradient Descent Watkins’ Q( )

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Mountain-Car Task

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Mountain-Car Results

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Baird’s Counterexample

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Baird’s Counterexample Cont.

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Should We Bootstrap?

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Summary pGeneralization pAdapting supervised-learning function approximation methods pGradient-descent methods pLinear gradient-descent methods Radial basis functions Tile coding Kanerva coding pNonlinear gradient-descent methods? Backpropation? pSubleties involving function approximation, bootstrapping and the on-policy/off-policy distinction

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, The Overall Plan pLecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes pLecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience pLecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Lecture 3, Part 2: Model-Based Methods pUse of environment models pIntegration of planning and learning methods Objectives of this part:

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Models pModel: anything the agent can use to predict how the environment will respond to its actions pDistribution model: description of all possibilities and their probabilities e.g., pSample model: produces sample experiences e.g., a simulation model pBoth types of models can be used to produce simulated experience pOften sample models are much easier to come by

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Planning pPlanning: any computational process that uses a model to create or improve a policy pPlanning in AI: state-space planning plan-space planning (e.g., partial-order planner) pWe take the following (unusual) view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Planning Cont. Random-Sample One-Step Tabular Q-Planning pClassical DP methods are state-space planning methods pHeuristic search methods are state-space planning methods pA planning method based on Q-learning:

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Learning, Planning, and Acting pTwo uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy pImproving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Direct vs. Indirect RL pIndirect (model-based) methods: make fuller use of experience: get better policy with fewer environment interactions pDirect methods simpler not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, The Dyna Architecture (Sutton 1990)

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, The Dyna-Q Algorithm direct RL model learning planning

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Dyna-Q on a Simple Maze rewards = 0 until goal, when =1

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Dyna-Q Snapshots: Midway in 2nd Episode

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, When the Model is Wrong: Blocking Maze The changed envirnoment is harder

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Shortcut Maze The changed environment is easier

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, What is Dyna-Q ? pUses an “exploration bonus”: Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting The agent actually “plans” how to visit long unvisited states +

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Prioritized Sweeping pWhich states or state-action pairs should be generated during planning? pWork backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue pMoore and Atkeson 1993; Peng and Williams, 1993

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Prioritized Sweeping

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Rod Maneuvering (Moore and Atkeson 1993)

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Full and Sample (One-Step) Backups

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Full vs. Sample Backups b successor states, equally likely; initial error = 1; assume all next states’ values are correct

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Trajectory Sampling pTrajectory sampling: perform backups along simulated trajectories pThis samples from the on-policy distribution pAdvantages when function approximation is used pFocusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Reachable under optimal control Irrelevant states

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Trajectory Sampling Experiment pone-step full tabular backups puniform: cycled through all state- action pairs pon-policy: backed up along simulated trajectories p200 randomly generated undiscounted episodic tasks p2 actions for each state, each with b equally likely next states p.1 prob of transition to terminal state pexpected reward on each transition selected from mean 0 variance 1 Gaussian

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Heuristic Search pUsed for action selection, not for changing a value function (=heuristic evaluation function) pBacked-up values are computed, but typically discarded pExtension of the idea of a greedy policy — only deeper pAlso suggests ways to select states to backup: smart focusing:

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Summary pEmphasized close relationship between planning and learning pImportant distinction between distribution models and sample models pLooked at some ways to integrate planning and learning synergy among planning, acting, model learning pDistribution of backups: focus of the computation trajectory sampling: backup along trajectories prioritized sweeping heuristic search pSize of backups: full vs. sample; deep vs. shallow

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, The Overall Plan pLecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes pLecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience pLecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Lecture 3, part 3: Dimensions of Reinforcement Learning pReview the treatment of RL taken in this course pWhat have left out? pWhat are the hot research areas? Objectives of this part:

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Three Common Ideas pEstimation of value functions pBacking up values along real or simulated trajectories pGeneralized Policy Iteration: maintain an approximate optimal value function and approximate optimal policy, use each to improve the other

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Backup Dimensions

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Other Dimensions pFunction approximation tables aggregation other linear methods many nonlinear methods pOn-policy/Off-policy On-policy: learn the value function of the policy being followed Off-policy: try learn the value function for the best policy, irrespective of what policy is being followed

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Still More Dimensions pDefinition of return: episodic, continuing, discounted, etc. pAction values vs. state values vs. afterstate values  Action selection/exploration:  -greed, softmax, more sophisticated methods pSynchronous vs. asynchronous pReplacing vs. accumulating traces pReal vs. simulated experience pLocation of backups (search control) pTiming of backups: part of selecting actions or only afterward? pMemory for backups: how long should backed up values be retained?

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Frontier Dimensions pProve convergence for bootstrapping control methods. pTrajectory sampling pNon-Markov case: Partially Observable MDPs (POMDPs) –Bayesian approach: belief states –construct state from sequence of observations Try to do the best you can with non-Markov states pModularity and hierarchies Learning and planning at several different levels –Theory of options

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, More Frontier Dimensions pUsing more structure factored state spaces: dynamic Bayes nets factored action spaces

A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, Still More Frontier Dimensions pIncorporating prior knowledge advice and hints trainers and teachers shaping Lyapunov functions etc.