Markov Decision Process (MDP)

Slides:



Advertisements
Similar presentations
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)RL Lecture, Slide 1 Reinforcement Learning (RL) Consider an “agent” embedded.
Advertisements

Markov Decision Process
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Planning under Uncertainty
CPSC 322, Lecture 37Slide 1 Finish Markov Decision Processes Last Class Computer Science cpsc322, Lecture 37 (Textbook Chpt 9.5) April, 8, 2009.
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Reinforcement Learning
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2005.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discretization Pieter Abbeel UC Berkeley EECS
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
Reinforcement Learning (1)
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Reinforcement Learning
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
1 Operations Research Prepared by: Abed Alhameed Mohammed Alfarra Supervised by: Dr. Sana’a Wafa Al-Sayegh 2 nd Semester ITGD4207 University.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Department of Computer Science Undergraduate Events More
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
CS 416 Artificial Intelligence Lecture 19 Making Complex Decisions Chapter 17 Lecture 19 Making Complex Decisions Chapter 17.
Department of Computer Science Undergraduate Events More
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Announcements Grader office hours posted on course website
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 2
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Warm-up as You Walk In Given Set actions (persistent/static)
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Markov Decision Process (MDP) Ruti Glick Bar-Ilan university

example Agent is situate in 4x3 environment Each step have to choose action Possible actions: Up, Down, Left, Right Terminates when reaching goal state might be more than one goal state Each goal state have weight Full observable Agent always know where it is START +1 -1

Example (cont.) In deterministic environment solution easy: [Up, Up, Right, Right, Right] [Right, Right, Up, Up, Right] START +1 -1 START +1 -1

Example (cont.) But… Action are unreliable intended effect achieved only with probability 0.8 Probability of 0.1 getting right Probability of 0.1 getting left If bumps into wall – stays in place 0.8 0.1

Example (cont.) by executing the sequence [Up, Up, Right, Right, Right]: Chance of following the desired path: 0.8*0.8*0.8* 0.8*0.8 = 0.85 =0.32768 Chance of accidentally get to goal from the other path: 0.1*0.1*0.1* 0.1*0.8 = 0.14 * 0.81 =0.00008 Total of only 0.32776 to get to desired goal START +1 -1 START +1 -1

Transition Model Specification of outcome probabilities of each action in each possible state T(s, a, s’) = Probability of reaching s’ if action a is done in state s Can be described as 3 dimensional table Markov assumption: The next state’s conditional probability depends only on its immediately previous state

Reward Positive of negative reward that agent receives in state s Sometimes reward is associated only with state R(S) Sometimes reward is assumed associated with state and action R(S, A) Sometimes, reward associated with state, action and destination-state R(S,A,J) In our example: R([4,3]) = +1 R([4,2]) = -1 R(s) = -0.04, s ≠ [4,3] and s ≠ [4,2] Can be seen as the desired of agent staying in game

Environment history Decision problem is sequential Utility function depend on sequence of state Utility function is sum of rewards received In our example: If reached (4,3) after 10 steps than total utility= 1+10*(-0.04) = 0.6

Markov Decision Process (MDP) The specification of a sequential decision problem for a fully observable environment that satisfies the Markov Assumption and yields additive rewards. Defined as a tuple: <S, A, P, R> S: State A: Action P: Transition function Table P(s’| s, a), prob of s’ given action a in state s R: Reward R(s) = cost or reward being in state s

In our example… S: State of the agent on the grid Note that cell denoted by (x,y) A: Actions of the agent i.e. Up, Down, Left, Right P: Transition function Table P(s’| s, a), prob of s’ given action “a” in state “s” E.g., P( (4,3) | (3,3), Up) = 0.1 E.g., P((3, 2) | (3,3), Up) = 0.8 R: Reward R(3, 3) = -0.04 R (4, 3) = +1

Solution to MDP Notation: In deterministic processes, solution is a plan. In observable stochastic processes, solution is a policy Policy: a mapping from S to A A policy’s quality is measured by its EU Notation: π ≡ a policy π(s) ≡ the recommended action in state s π* ≡ the optimal policy (maximum expected utility)

Following a Policy Procedure: 1. Determine current state s 2. Execute action π(s) 3. Repeat 1-2

Optimal policy for our example The reward is small relative to -1. prefer to go around then falling into -1. +1 -1 R(4,3) = +1 R(4,2) = -1 R(S) = -0.04 START

Optimal Policies in our example +1 +1 -1 -1 START START R(Start) < -1.6284 -0.4278 < R(Start) < -1.6284 Life so painful – the agent run to the nearest exit Life unpleasant – the agent trying to get +1, willing to to risk and fall into -1

Optimal Policies in our example +1 +1 -1 -1 START START -0.0221 < R(S) < 0 R(s) > 0 Life is nice – the agent takes no risks at all Agent wants to stay at game

Decision Epoch Finite horizon Infinite horizon After fixed time N the game is over. Nothing matters. Uh([s0, s1, …, sN+k]) = Uh([s0, s1, …, sN]), for all k>0 Optimal action might change over time Infinite horizon no fixed deadline No reason to behave differently in same state at different time We will discuss this case

example If agent in (1,3). what will it do? +1 -1 Agent here If agent in (1,3). what will it do? In Finite horizon where N=3 will go up In Infinite horizon – depend on other parameters

Assigning Utility to Sequences Additive reward Uh([s0, s1, s2, …]) = R(s0) + R(s1) + R(s2) + … In our last example we used this method Discounted Factor Uh([s0, s1, s2, …]) = R(s0) + γ R(s1) + γ2R(s2) + … 0 < γ < 1 γrepresent the chance the world will continue exist We will assume discounter reward

additive reward Problem Solutions For infinite horizon if the agent never get to terminate state, the utility will be infinite. Can’t compare between +∞ and +∞ Solutions With discount reward the utility of infinite sequence is finite Uh([s0, s1, s2, …]) = Proper policy = policy that guaranteed to get to finite state Compare infinite sequences in term of average reward

conclusion Discount reward is the best solution for infinite horizon Policy π represent a group of possible sequences Specific probability of each case Value of policy is the expected sum of all possible state sequences