Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.

Slides:

Advertisements

Similar presentations

Hierarchical Reinforcement Learning Amir massoud Farahmand

Advertisements

The Right Way to do Reinforcement Learning with Function Approximation Rich Sutton AT&T Labs with thanks to Satinder Singh, David McAllester, Mike Kearns.

RL for Large State Spaces: Value Function Approximation

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Infinite Horizon Problems

Planning under Uncertainty

Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning Tutorial

Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham

Reinforcement Learning

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.

Efficiency in ML / AI 1.Data efficiency (rate of learning) 2.Computational efficiency (memory, computation, communication) 3.Researcher efficiency (autonomy,

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.

Retraction: I’m actually 35 years old. Q-Learning.

Reinforcement Learning

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Reinforcement learning (Chapter 21)

Reinforcement Learning

Announcements Homework 3 due today (grace period through Friday)

Dr. Unnikrishnan P.C. Professor, EEE

CS 188: Artificial Intelligence Fall 2007

Chapter 8: Generalization and Function Approximation

October 6, 2011 Dr. Itamar Arel College of Engineering

CS 188: Artificial Intelligence Spring 2006

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Markov Decision Processes

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David McAllester, Sanjoy Dasgupta

Computers have gotten faster and bigger Analytic solutions are less important Computer-based approximate solutions –Neural networks –Genetic algorithms Machines take on more of the work More general solutions to more general problems –Non-linear systems –Stochastic systems –Larger systems Exponential methods are still exponential… but compute-intensive methods increasingly winning

New Computers have led to a New Artificial Intelligence  More general problems and algorithms, automation - Data intensive methods - learning methods  Less handcrafted solutions, expert systems  More probability, numbers  Less logic, symbols, human understandability  More real-time decision-making States, Actions, Goals, Probability => Markov Decision Processes

Markov Decision Processes State Space S (finite) Action Space A (finite) Discrete time t = 0,1,2,… Episode Transition Probabilities Expected Rewards Policy Return Value Optimal policy (discount rate) PREDICTION Problem CONTROL Problem

Key Distinctions Control vs Prediction Bootstrapping/Truncation vs Full Returns Sampling vs Enumeration Function approximation vs Table lookup Off-policy vs On-policy Harder, more challenging and interesting Easier, conceptually simpler

Full Depth Search s a r Full Returns  r  r  2 r   s’ a’ r’ r” Computing V (s) ˆ is of exponential complexity BDBD branching factor depth

Truncated Search s Truncated Returns  r  ˆ V(s) Computing V (s) a r s’ ˆ V(s) Search truncated after one ply Approximate values used at stubs Values computed from their own estimates! -- “Bootstrapping”

Dynamic Programming is Bootstrapping s Truncated Returns ˆ V ˆ V ˆ V E.g., DP Policy Evaluation Er  ˆ V(s)s a r s’ ˆ V 

Boostrapping seems to Speed Learning

Bootstrapping/Truncation Replacing possible futures with estimates of value Can reduce computation and variance A powerful idea, but Requires stored estimates of value for each state

The Curse of Dimensionality The number of states grows exponentially with dimensionality -- the number of state variables Thus, on large problems, –Can’t complete even one sweep of DP Can’t enumerate states, need sampling! –Can’t store separate values for each state Can’t store values in tables, need function approximation! DP Policy Evaluation Bellman, 1961

ˆ V k  1 (s)  (s,a) a  p ss a r ss a  ˆ V k (s)  s   s  S ˆ V k  1 (s)  d(s)  (s,a) a  p ss a r ss a  ˆ V k (s)  s  DP Policy Evaluation TD( ) samples the possibilities rather than enumerating and explicitly considering all of them Some distribution over states, possibly uniform  s  S

These terms can be replaced by sampling ˆ V k  1 (s)  (s,a) a  p ss a r ss a  ˆ V k (s)  s   s  S ˆ V k  1 (s)  d(s)  (s,a) a  p ss a r ss a  ˆ V k (s)  s  DP Policy Evaluation  s  S

For each sample transition, s,a  s ’,r : Sutton, 1988; Witten, 1974 Tabular TD(  ) Sampling vs Enumeration ˆ V k  1 (s)  (s,a) a  p ss a r ss a  ˆ V k (s)  s   s  S ˆ V k  1 (s)  d(s)  (s,a) a  p ss a r ss a  ˆ V k (s)  s  DP Policy Evaluation  s  S

Sample Returns can also be either  r  r  2 r   r r r  r  ˆ V(s) As in the general TD( ) algorithm FullorTruncated

Function Approximation Store values as a parameterized form Update , e.g., by gradient descent: cf. DP Policy Evaluation (rewritten to include a step-size  ):

Linear Function Approximation Each state s represented by a feature vector Or respresent a state-action pair with and approximate action values:

Linear TD( ) After each episode: where “n-step return” r t  1  T  s t  1 a t  1 “  -return” e.g., Sutton, 1988

RoboCup Use soccer as a rich and realistic testbed Robotic and simulation leagues –Open source simulator (Noda) An international AI and Robotics research initiative Research Challenges Multiple teammates with a common goal Multiple adversaries – not known in advance Real-time decision making necessary Noisy sensors and actuators Enormous state space, > states 9

RoboCup Feature Vectors.. Sparse, coarse, tile coding Linear map  Full soccer state action values Huge binary feature vector (about 400 1’s and 40,000 0’s) 13 continuous state variables    s

13 Continuous State Variables (for 3 vs 2) 11 distances among the players, ball, and the center of the field 2 angles to takers along passing lanes

Sparse, Coarse, Tile-Coding (CMACs) 32 tilings per group of state variables

Learning Keepaway Results 3v2 handcrafted takers Multiple, independent runs of TD( ) Stone & Sutton, 2001

Key Distinctions Control vs Prediction Bootstrapping/Truncation vs Full Returns Function approximation vs Table lookup Sampling vs Enumeration Off-policy vs On-policy –The distribution d(s)

Off-Policy Instability Examples of diverging  k are known for –Linear FA –Bootstrapping Even for –Prediction –Enumeration –Uniform d(s) In particular, linear Q-learning can diverge Baird, 1995 Gordon, 1995 Bertsekas & Tsitsiklis, 1996

Baird’s Counterexample Markov chain (no actions) All states updated equally often, synchronously Exact solution exists:  = 0 Initial  0 = (1,1,1,1,1,10,1) T 100% ±1)

On-Policy Stability If d(s) is the stationary distribution of the MDP under policy  (the on-policy distribution) Then convergence is guaranteed for –Linear FA –Bootstrapping –Sampling –Prediction Furthermore, asymptotic mean square error is a bounded expansion of the minimal MSE: Tsitsiklis & Van Roy, 1997 Tadic, 2000

— Value Function Space — inadmissable value functions value functions consistent with parameterization True V* Region of  * best admissable policy Original naïve hope guaranteed convergence to good policy Res gradient et al. guaranteed convergence to less desirable policy Sarsa, TD( ) & other on-policy methods chattering without divergence or guaranteed convergence Q-learning, DP & other off-policy methods divergence possible V* best admissable value fn.

There are Two Different Problems: Chattering Is due to Control + FA Bootstrapping not involved Not necessarily a problem Being addressed with policy-based methods Argmax-ing is to blame Instability Is due to Bootstrapping + FA + Off-Policy Control not involved Off-policy is to blame

Yet we need Off-policy Learning Off-policy learning is needed in all the frameworks that have been proposed to raise reinforcement learning to a higher level –Macro-actions, options, HAMs, MAXQ –Temporal abstraction, hierarchy, modularity –Subgoals, goal-and-action-oriented perception The key idea is: We can only follow one policy, but we would like to learn about many policies, in parallel –To do this requires off-policy learning

On-Policy Policy Evaluation Problem Use data (episodes) generated by  to learn Off-Policy Policy Evaluation Problem Use data (episodes) generated by  ’ to learn behavior policy Target policy

Naïve Importance-Sampled TD( )   1  2  3   T-1 importance sampling correction ratio for t Relative prob. of episode under  and  ’ We expect this to have relatively high variance

Per-Decision Importance-Sampled TD( )   1  2  3   t is like, except in terms of

Per-Decision Theorem Precup, Sutton & Singh (2000) New Result for Linear PD Algorithm Precup, Sutton & Dasgupta (2001) Total change over episode for new algorithm Total change for conventional TD( )

Convergence Theorem Under natural assumptions – S and A are finite – All s,a are visited under  ’ –  and  ’ are proper (terminate w.p.1) – bounded rewards – usual stochastic approximation conditions on the step size  k And one annoying assumption Then the off-policy linear PD algorithm converges to the same    as on-policy TD( ) e.g., bounded episode length

The variance assumption is restrictive Consider a modified MDP with bounded episode length –We have data for this MDP –Our result assures good convergence for this –This solution can be made close to the sol’n to original problem –By choosing episode bound long relative to  or the mixing time Consider application to macro-actions –Here it is the macro-action that terminates –Termination is artificial, real process is unaffected –Yet all results directly apply to learning about macro-actions –We can choose macro-action termination to satisfy the variance condition But can often be satisfied with “artificial” terminations

Empirical Illustration Agent always starts at S Terminal states marked G Deterministic actions Behavior policy chooses up-down with prob. Target policy chooses up-down with If the algorithm is successful, it should give positive weight to rightmost feature, negative to the leftmost one

Trajectories of Two Components of  = 0.9  decreased  appears to converge as advertised Episodes x 100,000 µ leftmost,down µ leftmost,down µ rightmost,down * µ rightmost,down *

Comparison of Naïve and PD IS Algs Root Mean Squared Error Naive IS Per-Decision IS Log 2  = 0.9  constant (after 100,000 episodes, averaged over 50 runs) Precup, Sutton & Dasgupta, 2001

Can Weighted IS help the variance? Return to the tabular case, consider two estimators: i th return following s,a IS correction product   t  1  t  2  t  3   T  1 ( s,a occurs at t ) converges with finite variance iff the w i have finite variance converges with finite variance even if the w i have infinite variance Can this be extended to the FA case?

Restarting within an Episode We can consider episodes to start at any time This alters the weighting of states, –But we still converge, –And to near the best answer (for the new weighting)

Incremental Implementation At the start of each episode: On each step:

Key Distinctions Control vs Prediction Bootstrapping/Truncation vs Full Returns Sampling vs Enumeration Function approximation vs Table lookup Off-policy vs On-policy Harder, more challenging and interesting Easier, conceptually simpler

Conclusions RL is beating the Curse of Dimensionality –FA and Sampling There is a broad frontier, with many open questions MDPs: States, Decisions, Goals, and Probability is a rich area for mathematics and experimentation