Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?

Slides:



Advertisements
Similar presentations
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Markov Decision Processes
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Markov Decision Processes
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MAKING COMPLEX DEClSlONS
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Salem, OR 4 th owl attack in month. Any Questions? Programming Assignments?
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning Elementary Solution Methods
Any Questions? Programming Assignments?. CptS 450/516 Singularity Weka / Torch / RL-Glue/Burlap/RLPy Finite MDP = ? Start arbitrarily, moving towards.
Sutton & Barto, Chapter 4 Dynamic Programming. Policy Improvement Theorem Let π & π’ be any pair of deterministic policies s.t. for all s in S, Then,
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Markov Decision Process (MDP)
Chapter 5: Monte Carlo Methods
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement Learning
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
Chapter 6: Temporal Difference Learning
Chapter 17 – Making Complex Decisions
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 4: Dynamic Programming
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Sutton & Barto, Chapter 4 Dynamic Programming

Programming Assignments? Course Discussions?

Review: – V, V* – Q, Q* – π, π* Bellman Equation vs. Update

Solutions Given a Model Finite MDPs Exploration / Exploitation? Where would Dynamic Programming be used?

“DP may not be practical for very large problems...” & “For the largest problems, only DP methods are feasible” – Curse of Dimensionality Bootstrapping Memoization (Dmitry)

Value Function Existence and uniqueness guaranteed when – γ < 1 or – eventual termination is guaranteed from all states under π

Policy Evaluation: Value Iteration Iterative Policy Evaluation – Full backup: go through each state and consider each possible subsequent state – Two copies of V or “in place” backup?

Policy Evaluation: Algorithm

Policy Improvement Update policy so that action for each state maximizes V(s’) For each state, π’(s)=?

Policy Improvement Update policy so that action for each state maximizes V(s’) For each state, π’(s)=?

Policy Improvement: Examples (Actions Deterministic, Goal state(s) shaded)

Policy Improvement: Examples 4.1: If policy is equiprobably random actions, what is the action-value for Q(11,down)? What about Q(7, down)? 4.2a: A state (15) is added just below state 13. Its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Transitions from original states are unchanged. What is V(15) for equiprobable random policy? 4.2b: Now, assume dynamics of state 13 are also changed so that down takes the agent to 15. Now, what is V(15)?

Policy Iteration Convergence in limit vs. EM? (Dmitry & Chris)

Value Iteration: Update Turn Bellman optimality equation into an update rule

Value Iteration: algorithm Why focus on deterministic policies?

Gambler’s problem Series of coin flips. Heads: win as many dollars as staked. Tails: lose it. On each flip, decide proportion of capital to stake, in integers Ends on $0 or $100 Example: prob of heads = 0.4

Gambler’s problem Series of coin flips. Heads: win as many dollars as staked. Tails: lose it. On each flip, decide proportion of capital to stake, in integers Ends on $0 or $100 Example: prob of heads = 0.4

Asynchronous DP Can backup states in any order But must continue to back up all values, eventually Where should we focus our attention?

Asynchronous DP Can backup states in any order But must continue to back up all values, eventually Where should we focus our attention? Changes to V Changes to Q Changes to π

Generalized Policy Iteration

V stabilizes when consistent with π π stabilizes when greedy with respect to V