Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Slides:



Advertisements
Similar presentations
Reinforcement Learning: Learning from Interaction
Advertisements

Markov Decision Process
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.
Reinforcement Learning
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
Reinforcement Learning
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Introduction to Reinforcement Learning
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Department of Computer Science Undergraduate Events More
Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
MDPs (cont) & Reinforcement Learning
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.
Department of Computer Science Undergraduate Events More
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 3: The Reinforcement Learning Problem pdescribe the RL problem we will.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Timothy Boger and Mike Korostelev
ECE 517: Reinforcement Learning in Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Fall 2007
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012

Continuing from last time… MDPs and Formulation of RL Problem 31/01/20122

Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode. 31/01/20123

Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return: 31/01/20124

Example – Setting Up Rewards Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. As an episodic task where episode ends upon failure: As a continuing task with discounted return: In either case, return is maximized by avoiding failure for as long as possible. 31/01/20125

Another Example Get to the top of the hill as quickly as possible. Return is maximized by minimizing number of steps to reach the top of the hill. 31/01/20126

A Unified Notation In episodic tasks, we number the time steps of each episode starting from zero. We usually do not have to distinguish between episodes, so we write instead of for the state at step t of episode j. Think of each episode as ending in an absorbing state that always produces reward of zero: We can cover all cases by writing 31/01/20127

Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: – state and action sets – one-step “dynamics” defined by transition probabilities: – reward probabilities: 31/01/20128

The RL Problem Main Elements: States, s Actions, a State transition dynamics - often, stochastic & unknown Reward (r) process - possibly stochastic Objective: Policy  t (s,a) – probability distribution over actions given current state Assumption: Environment defines a finite-state MDP 31/01/20129

Recycling Robot An Example Finite MDP At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if it runs out of power while searching, it has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected 31/01/201210

Recycling Robot MDP 31/01/201211

Enumerated in Tabular Form 31/01/201212

Given such an enumeration of transitions and corresponding costs/rewards, what is the best sequence of actions? We know we want to maximize: So, what must one do? 31/01/201213

The Shortest Path Problem 31/01/201214

Finite-State Systems and Shortest Paths – state space s k is a finite set for each k – a k can get you from s k to f k (s k, a k ) at a cost g k (x k, u k ) 31/01/ Length ≈ Cost ≈ Sum of length of arcs Solve this first J k (i) = min j [ a k ij + J k+1 (j)]

Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of taking an action in a state under policy  is the expected return starting from that state, taking that action, and thereafter following  : 31/01/201216

Recursive Equation for Value The basic idea: So: 31/01/201217

Optimality in MDPs – Bellman Equation 31/01/201218

More on the Bellman Equation This is a set of equations (in fact, linear), one for each state. The value function for  is its unique solution. Backup diagrams: 31/01/201219

Bellman Equation for Q 31/01/201220

Gridworld Actions: north, south, east, west ; deterministic. If it would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy;  = /01/201221

Golf State is ball location Reward of –1 for each stroke until the ball is in the hole Value of a state? Actions: –putt (use putter) –driver (use driver) putt succeeds anywhere on the green 22

Optimal Value Functions For finite MDPs, policies can be partially ordered: There are always one or more policies that are better than or equal to all the others. These are the optimal policies. We denote them all  *. Optimal policies share the same optimal state-value function: Optimal policies also share the same optimal action-value function: This is the expected return for taking action a in state s and thereafter following an optimal policy. 31/01/201223

Optimal Value Function for Golf We can hit the ball farther with driver than with putter, but with less accuracy Q*(s, driver ) gives the value or using driver first, then using whichever actions are best 31/01/201224

Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: The relevant backup diagram: V* is the unique solution of this system of nonlinear equations. 31/01/201225

Recursive Form of V* 31/01/201226

Bellman Optimality Equation for Q* The relevant backup diagram: Q* is the unique solution of this system of nonlinear equations. 31/01/201227

Why Optimal State-Value Functions are Useful Any policy that is greedy with respect to V* is an optimal policy. Therefore, given V*, one-step-ahead search produces the long-term optimal actions. E.g., back to the gridworld (from your S+B book): ** 31/01/201228

What About Optimal Action-Value Functions? Given Q*, the agent does not even have to do a one-step-ahead search: 31/01/201229

Solving the Bellman Optimality Equation Finding an optimal policy by solving the Bellman Optimality Equation requires the following: – accurate knowledge of environment dynamics; – we have enough space and time to do the computation; – the Markov Property. How much space and time do we need? – polynomial in number of states (via dynamic programming), – BUT, number of states is often huge (e.g., backgammon has about states) We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation. 31/01/201230