Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Slides:

Advertisements

Similar presentations

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.

Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Markov Decision Process

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Decision Theoretic Planning

Reinforcement Learning

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

An Introduction to Markov Decision Processes Sarah Hickmott

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Markov Decision Processes

Planning under Uncertainty

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.

RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Markov Decision Processes

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning Reinforced. Administrivia I’m out of town next Tues and Wed Class cancelled Apr 4 -- work on projects! No office hrs Apr 4 or 5.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Department of Computer Science Undergraduate Events More

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning (1)

Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Department of Computer Science Undergraduate Events More

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC.

MDPs (cont) & Reinforcement Learning

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Department of Computer Science Undergraduate Events More

Markov Decision Process (MDP)

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Making complex decisions

Markov Decision Processes

Markov Decision Processes

Planning to Maximize Reward: Markov Decision Processes

Markov Decision Processes

CS 188: Artificial Intelligence Fall 2007

Chapter 17 – Making Complex Decisions

Reinforcement Learning Dealing with Partial Observability

Markov Decision Processes

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

Planning in MDPs S&B: Sec 3.6; Ch. 4

Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the chance! HW2, R2 returned today

MDPs defined Full definition: A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R 〉 S : State space A : Action space T : Transition function R : Reward function For most of RL, we’ll assume the agent is living in an MDP

Exercise Given the following tasks, describe the corresponding MDP -- what are the state space, action space, transition function, and reward function? How many states/actions are there? How many policies are possible? Flying an airplane and trying to get from point A to B. Flying an airplane and trying to emulate recorded human behaviors. Delivering a set of packages to buildings on UNM campus Winning at the stock market

The true meaning of Value First we said: value is the sum of reward experienced along a history, under a fixed policy Possibly w/ discounting for infinite procs Now we know: fixed policy+stochastic environment => distribution over histories So what’s value?

The true meaning of Value Definition: expected value, V, is expected aggregate reward when averaged across all possible histories, weighted by their probabilities: Note: this may be an infinite sum V() itself may also involve an infinite sum (bummer...)

Histories & dynamics Pr({s 2,s 5,s 8 }|q 1 =s 1,π)=0.0138Pr({s 4,s 11,s 9 }|q 1 =s 1,π)= V({s 2,s 5,s 8 })=+1.9V({s 4,s 11,s 9 })=-2.3 Fixed policy π

Maximizing reward Which policy should the agent prefer? Definition: The principle of maximum expected utility says that an agent should act so as to obtain the maximum reward over time on average. Because dynamics of environment are stochastic, no specific reward can be guaranteed Note: Other optimality criteria are possible (e.g., risk aversion) -- this is the easiest to work with

The RL problem Now the RL question is concrete: Given an agent acting in an MDP, M, how can the agent learn π* (or a good approximation of it) as quickly as possible? (More generally, while receiving as much reward along the way as possible)

Planning & Learning Need to distinguish when the agent knows full MDP, M, vs. has to learn M from experience: Definition: In the reinforcement learning problem, the agent is acting in an initially unknown MDP Needs experience tuples from environment Equivalent to: learn class models (PDFs), then figure out best decision surface Definition: In the planning problem, the agent is acting in a fully known MDP No experience needed! Can figure out π * purely by thinking about it Equiv to: Given known class models, figure out best decision surface

Planning Planning is a useful subroutine for learning, so we’ll do it first Goal: given known MDP, M = 〈 S, A,T,R 〉, find 2 sub-problems: Evaluate a single policy Search through space of all possible policies

How good is that plan? Policy evaluation: given M = 〈 S, A,T,R 〉 and π, find V π (s i ) for all i Recall: V is average over all histories: How to calculate this? From definition of V(h) :

Policy evaluation

The Bellman equation The final recursive equation is known as the Bellman equation: Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M = 〈 S, A,T,R 〉 When state/action spaces are discrete, can think of V and R as vectors and T π as matrix, and get matrix eqn:

Exercise Solve the matrix Bellman equation (i.e., find V ): I formulated the Bellman equations for “state- based” rewards: R(s) Formulate & solve the B.E. for “state-action” rewards ( R(s,a) ) and “state-action-state” rewards ( R(s,a,s’) )