Sutton & Barto, Chapter 4 Dynamic Programming. Policy Improvement Theorem Let π & π’ be any pair of deterministic policies s.t. for all s in S, Then,

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
Infinite Horizon Problems
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning Tutorial
Markov Decision Processes
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Probability theory 2010 Conditional distributions  Conditional probability:  Conditional probability mass function: Discrete case  Conditional probability.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Marko Tainio, marko.tainio[at]thl.fi Modeling and Monte Carlo simulation Marko Tainio Decision analysis and Risk Management course in Kuopio
Monte Carlo Simulation and Personal Finance Jacob Foley.
ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 49 DP can give complete quantitative solution Example 1: Discrete, finite.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Salem, OR 4 th owl attack in month. Any Questions? Programming Assignments?
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Monte Carlo Process Risk Analysis for Water Resources Planning and Management Institute for Water Resources 2008.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
INTRODUCTION TO Machine Learning
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Expectation-Maximization (EM) Algorithm & Monte Carlo Sampling for Inference and Approximation.
Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.
Monte Carlo Simulation Natalia A. Humphreys April 6, 2012 University of Texas at Dallas.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Comparison Value vs Policy iteration
Chapter 6 Large Random Samples Weiqi Luo ( 骆伟祺 ) School of Data & Computer Science Sun Yat-Sen University :
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Any Questions? Programming Assignments?. CptS 450/516 Singularity Weka / Torch / RL-Glue/Burlap/RLPy Finite MDP = ? Start arbitrarily, moving towards.
Sullivan Algebra and Trigonometry: Section 12.9 Objectives of this Section Set Up a Linear Programming Problem Solve a Linear Programming Problem.
Markov Decision Process (MDP)
Markov Decision Processes II
1. What is a Monte Carlo method ?
Reinforcement learning
Chapter 5: Monte Carlo Methods
CS 188: Artificial Intelligence
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
CAP 5636 – Advanced Artificial Intelligence
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 4: Dynamic Programming
Presentation transcript:

Sutton & Barto, Chapter 4 Dynamic Programming

Policy Improvement Theorem Let π & π’ be any pair of deterministic policies s.t. for all s in S, Then, π’ must be as good as, or better than, π: If 1 st inequality is strict for any state, then 2 nd inequality must be strict for at least one state

Policy Iteration Value Iteration

Gambler’s problem Series of coin flips. Heads: win as many dollars as staked. Tails: lose it. On each flip, decide proportion of capital to stake, in integers Ends on $0 or $100 Example: prob of heads = 0.4

Gambler’s problem Series of coin flips. Heads: win as many dollars as staked. Tails: lose it. On each flip, decide proportion of capital to stake, in integers Ends on $0 or $100 Example: prob of heads = 0.4

Generalized Policy Iteration

V stabilizes when consistent with π π stabilizes when greedy with respect to V

Asynchronous DP Can backup states in any order But must continue to back up all values, eventually Where should we focus our attention?

Asynchronous DP Can backup states in any order But must continue to back up all values, eventually Where should we focus our attention? Changes to V Changes to Q Changes to π

Aside Variable discretization – When to Split Track Bellman Error: look for plateaus – Where to Split Which direction? Value Criterion: Maximize the difference in values of subtiles = minimize error in V Policy Criterion: Would the split change policy?

Optimal value function satisfies the fixed- point equation for all states x: Define Bellman optimality operator: Now, V* can be defined compactly:

T* is a maximum-norm contraction. The fixed-point equation, T*V=V, has a unique solution Or, let T* map Reals in X×A to the same, and:

Value iteration: V k+1 =T*V k, k≥0 Q k+1 =T*Q k, k≥0 This “Q-iteration” converges asymptotically: (uniform norm/supremum norm/Chebyshev norm)

Also, for Q-iteration, we can bound the suboptimality for a set number of iterations based on the discount factor and maximum possible reward

Monte Carlo Administrative area in the Principality of Monaco Founded in 1866, named “Mount Charles” after then-reigning price, Charles III of Monaco 1 st casino opened in 1862, not a success

Monte Carlo Methods Computational Algorithms: repeated random sampling Useful when difficult / impossible to obtain a closed-form expression, or if deterministic algorithms won’t cut it Modern version: 1940s for nuclear weapons program at Los Alamos (radiation shielding and neutron penetration) Monte Carlo = code name (mathematician’s uncle gambled there) Soon after, John von Neumann programmed ENIAC to carry out these calculations Physical Sciences Engineering Applied Statistics Computer Graphics, AI Finance/Business Etc.

Rollouts Exploration vs. Exploitation On- vs. Off-policy