Markov Decision Process (MDP)

Slides:

Advertisements

Similar presentations

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.

Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Decision Theoretic Planning

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Markov Decision Processes

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Department of Computer Science Undergraduate Events More

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Department of Computer Science Undergraduate Events More

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

MDPs (cont) & Reinforcement Learning

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

Department of Computer Science Undergraduate Events More

Markov Decision Process (MDP)

Comparison Value vs Policy iteration

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Announcements Grader office hours posted on course website

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes II

Reinforcement learning

Making complex decisions

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Reinforcement Learning

Markov Decision Processes

Planning to Maximize Reward: Markov Decision Processes

Markov Decision Processes

Reinforcement learning

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming

Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Instructor: Vincent Conitzer

Chapter 17 – Making Complex Decisions

CS 188: Artificial Intelligence Spring 2006

Introduction to Reinforcement Learning and Q-Learning

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Hidden Markov Models (cont.) Markov Decision Processes

CS 416 Artificial Intelligence

Chapter 4: Dynamic Programming

CS 416 Artificial Intelligence

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

Markov Decision Process (MDP) Ruti Glick Bar-Ilan university

Policy Policy is similar to plan generated ahead of time Unlike traditional plans, it is not a sequence of actions that an agent must execute If there are failures in execution, agent can continue to execute a policy Prescribes an action for all the states Maximizes expected reward, rather than just reaching a goal state

Utility and Policy utility Policy Compute for every state “What is the usage (utility) of this state for the overall task?”􀂄 Policy Complete mapping from states to actions “In which state should I perform which action?” policy: state  action

The optimal Policy π*(s) = argmaxa∑s’T(s, a, s’)U(s’) T(s, a, s’) = Probability of reaching state s’ from state s U(s’) = Utility of state s’. If we know the utility we can easily compute the optimal policy.􀂄 The problem is to compute the correct utilities for all states.

Finding π* Value iteration Policy iteration

Value iteration Process: Calculate the utility of each state Use the values to select an optimal action

Bellman Equation Bellman Equation: U(s) = R(s) + γmaxa ∑ (T(s, a, s’) U(s’)) For example: U(1,1) = -0.04+γmax{ 0.8U(1,2)+0.1U(2,1)+0.1U(1,1), 0.9U(1,1)+0.1U(1,2), 0.9U(1,1)+0.1U(2,1), 0.8U(2,1)+0.1U(1,2)+0.1U(1,1,) } Up Left Down Right START +1 -1

Bellman Equation properties Problem Solution U(s) = R(s) + γ maxa ∑ (T(s, a, s’) U(s’)) n equations. One for each step n vaiables Problem Operator max is not a linear operator Non-linear equations. Solution Iterative approach

Value iteration algorithm Initial arbitrary values for utilities Update utility of each state from it’s neighbors Ui+1(s) R(s) + γ maxa ∑ (T(s, a, s’) Ui(s’)) Iteration step called Bellman update Repeat till converges

Value Iteration properties This equilibrium is a unique solution! Can prove that the value iteration process converges Don’t need exact values

convergence Value iteration is contraction Function of one argument When applied on to inputs produces value that are “closer together Have only one fixed point When applied the value must be closer to fixed point We’ll not going to prove last point converge to correct value

Value Iteration Algorithm function VALUE_ITERATION (mdp) returns a utility function input: mdp, MDP with states S, transition model T, reward function R, discount γ local variables: U, U’ vectors of utilities for states in S, initially identical to R repeat U U’ for each state s in S do U’[s]  R[s] + γmaxa s’ T(s, a, s’)U[s’] until close-enough(U,U’) return U

Example Small version of our main example 2x2 world The agent is placed in (1,1) States (2,1), (2,2) are goal states If blocked by the wall – stay in place The reward are written in the board R=-0.04 R=+1 R=-1

Example (cont.) First iteration U=-0.04 R=-0.04 U=+1 R=+1 U=-1 R=-1 U’(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = -0.04 + 1 x max { 0.8x(-0.04) + 0.1x(-0.04) + 0.1x(-1), 0.9x(-0.04) + 0.1x(-0.04), 0.9x(-0.04) + 0.1x(-1), 0.8x(-1) + 0.1x(-0.04) + 0.1x(-0.04)} =-0.04 + max{ -0.136, -0.04, -0.136, -0.808} =-0.08 U’(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = -0.04 + 1 x max {0.9x(-0.04)+ 0.1x1, 0.9x(-0.04) + 0.1x(-0.04), 0.8x(-0.04) + 0.1x1 + 0.1x(-0.04), 0.8x1 + 0.1x(-0.04) + 0.1x(-0.04)} =-0.04 + max{ 0.064, -0.04, 0.064, 0.792} =0.752 Goal states remain the same

Example (cont.) Second iteration U=0.752 R=-0.04 U=+1 R=+1 U=-0.08 U’(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = -0.04 + 1 x max { 0.8x(0.752) + 0.1x(-0.08) + 0.1x(-1), 0.9x(-0.08) + 0.1x(0.752), 0.9x(-0.08) + 0.1x(-1), 0.8x(-1) + 0.1x(-0.08) + 0.1x(0.752)} =-0.04 + max{ 0.4936, 0.0032, -0.172, -0.3728} =0.4536 U’(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = -0.04 + 1 x max {0.9x(0.752)+ 0.1x1, 0.9x(0.752) + 0.1x(-0.08), 0.8x(-0.08) + 0.1x1 + 0.1x(0.752), 0.8x1 + 0.1x(0.752) + 0.1x(-0.08)} =-0.04 + max{ 0.7768, 0.6688, 0.1112, 0.8672} = 0.8272

Example (cont.) Third iteration U=0.8272 R=-0.04 U=+1 R=+1 U= 0.4536 U(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), 0.9U(1,1) + 0.1U(1,2), 0.9U(1,1) + 0.1U(2,1), 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = -0.04 + 1 x max { 0.8x(0.8272) + 0.1x(0.4536) + 0.1x(-1), 0.9x(0.4536) + 0.1x(0.8272), 0.9x(0.4536) + 0.1x(-1), 0.8x(-1) + 0.1x(0.4536) + 0.1x(0.8272)} =-0.04 + max{ 0.6071, 0.491, 0.3082, -0.6719} =0.5676 U(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), 0.9U(1,2) + 0.1U(1,1), 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = -0.04 + 1 x max {0.9x(0.8272)+ 0.1x1, 0.9x(0.8272) + 0.1x(0.4536), 0.8x(0.4536) + 0.1x1 + 0.1x(0.8272), 0.8x1 + 0.1x(0.8272) + 0.1x(0.4536)} =-0.04 + max{ 0.8444, 0.7898, 0.5456, 0.9281} = 0.8881

Example (cont.) Continue to next iteration… Finish if “close enough” Here last change was 0.114 – close enough U= 0.8881 R=-0.04 U=+1 R=+1 U=-0.5676 U=-1 R=-1

“close enough” We will not go down deeply to this issue! Different possibilities to detect convergence: RMS error –root mean square error of the utility value compare to the correct values demand of RMS(U, U’) < ε when: ε – maximum error allowed in utility of any state in an iteration

“close enough” (cont.) || Ui+1 – Ui || < ε (1-γ) / γ Policy Loss : difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 – Ui || < ε (1-γ) / γ When: ||U|| = maxa |U(s)| ε – maximum error allowed in utility of any state in an iteration γ – the discount factor

Finding the policy True utilities have founded New search for the optimal policy: For each s in S do π[s]  argmaxa ∑s’ T(s, a, s’)U(s’) Return π

Example (cont.) Find the optimal police Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), //Up 0.9U(1,1) + 0.1U(1,2), //Left 0.9U(1,1) + 0.1U(2,1), //Down 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} //Right = argmaxa { 0.8x(0.8881) + 0.1x(0.5676) + 0.1x(-1), 0.9x(0.5676) + 0.1x(0.8881), 0.9x(0.5676) + 0.1x(-1), 0.8x(-1) + 0.1x(0.5676) + 0.1x(0.8881)} = argmaxa { 0.6672, 0.5996, 0.4108, -0.6512} = Up Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), //Up 0.9U(1,2) + 0.1U(1,1), //Left 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), //Down 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} //Right = argmaxa { 0.9x(0.8881)+ 0.1x1, 0.9x(0.8881) + 0.1x(0.5676), 0.8x(0.5676) + 0.1x1 + 0.1x(0.8881), 0.8x1 + 0.1x(0.8881) + 0.1x(0.5676)} = argmaxa {0.8993, 0.8561, 0.6429, 0.9456} = Right U= 0.8881 R=-0.04 U=+1 R=+1 U=-0.5676 U=-1 R=-1

Summery – value iteration +1 -1 +1 -1 0.705 0.655 0.611 0.388 0.762 0.660 0.912 0.868 0.812 1. The given environment 2. Calculate utilities +1 -1 +1 -1 With the value iteration algorithm we can construct the a set of 11 linear equations in 11 unknowns, which can be solved by linear algebra method. 4. Execute actions 3. Extract optimal policy

Example - convergence Error allowed

Policy iteration picking a policy, then calculating the utility of each state given that policy (value iteration step) Update the policy at each state using the utilities of the successor states Repeat until the policy stabilize

Policy iteration For each state in each step Policy evaluation Given policy πi. Calculate the utility Ui of each state if π were to be execute Policy improvement Calculate new policy πi+1 Based on πi π i+1[s]  argmaxa ∑s’ T(s, a, s’)U π_i (s’)

Policy iteration Algorithm function POLICY_ITERATION (mdp) returns a policy input: mdp, an MDP with states S, transition model T local variables: U, U’ vectors of utilities for states in S, initially identical to R π, a policy, vector indexed by states, initially random repeat U Policy-Evaluation(π, mdp) unchanged?  true for each state s in S do if maxa s’ T(s, a, s’) U[s’] > s’T(s, π[s], s’) U[s’] then π[s]  argmaxa s’ T(s, a, s’) U[s’] unchanged?  false end until unchanged? return π

Example Back to our last example… 2x2 world The agent is placed in (1,1) States (2,1), (2,2) are goal states If blocked by the wall – stay in place The reward are written in the board Initial policy: Up (for every step) R=-0.04 R=+1 R=-1

Example (cont.) First iteration – policy evaluation U=-0.04 R=-0.04 U(1,1) = R(1,1) + γ x (0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1)) U(1,2) = R(1,2) + γ x (0.9U(1,2) + 0.1U(2,2)) U(2,1) = R(2,1) U(2,2) = R(2,2) U(1,1) = -0.04 + 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1) U(1,2) = -0.04 + 0.9U(1,2) + 0.1U(2,2) U(2,1) = -1 U(2,2) = 1 0.04 = -0.9U(1,1) + 0.8U(1,2) + 0.1U(2,1) + 0U(2,2) 0.04 = 0U(1,1) – 0.1U(1,2) + 0U(2,1) + 0.1U(2,2) -1 = 0U(1,1) + 0U(1,2) - 1U(2,1) + 0U(2,2) 1 = 0U(1,1) + 0U(1,2) - 0U(2,1) + 1U(2,2) U(1,1) = 0.3778 U(1,2) = 0.6 http://www.math.ncsu.edu/ma114/PDF/1.4.pdf U=-0.04 R=-0.04 U=+1 R=+1 U=-1 R=-1 Policy Π(1,1) = Up Π(1,2) = Up

Example (cont.) First iteration – policy improvement U= 0.6 R=-0.04 Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), //Up 0.9U(1,1) + 0.1U(1,2), //Left 0.9U(1,1) + 0.1U(2,1), //Down 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} //Right = argmaxa { 0.8x(0.6) + 0.1x(0.3778) + 0.1x(-1), 0.9x(0.3778) + 0.1x(0.6), 0.9x(0.3778) + 0.1x(-1), 0.8x(-1) + 0.1x(0.3778) + 0.1x(0.6)} = argmaxa { 0.4178, 0.4, 0.24, -0.7022} = Up  don’t have to update Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), //Up 0.9U(1,2) + 0.1U(1,1), //Left 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), //Down 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} //Right = argmaxa { 0.9x(0.6)+ 0.1x1, 0.9x(0.6) + 0.1x(0.3778), 0.8x(0.3778) + 0.1x1 + 0.1x(0.6), 0.8x1 + 0.1x(0.6) + 0.1x(0.3778)} = argmaxa { 0.64, 0.5778, 0.4622, 0.8978} = Right  update Policy Π(1,1) = Up Π(1,2) = Up

Example (cont.) Second iteration – policy evaluation U(1,1) = R(1,1) + γ x (0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1)) U(1,2) = R(1,2) + γ x (0.1U(1,2) + 0.8U(2,2) + 0.1U(1,1)) U(2,1) = R(2,1) U(2,2) = R(2,2) U(1,1) = -0.04 + 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1) U(1,2) = -0.04 + 0.1U(1,2) + 0.8U(2,2) + 0.1U(1,1) U(2,1) = -1 U(2,2) = 1 0.04 = -0.9U(1,1) + 0.8U(1,2) + 0.1U(2,1) + 0U(2,2) 0.04 = 0.1U(1,1) – 0.9U(1,2) + 0U(2,1) + 0.8U(2,2) -1 = 0U(1,1) + 0U(1,2) - 1U(2,1) + 0U(2,2) 1 = 0U(1,1) + 0U(1,2) - 0U(2,1) + 1U(2,2) U(1,1) = 0.5413 U(1,2) = 0.7843 U=0.6 R=-0.04 U=+1 R=+1 U=0.3778 U=-1 R=-1 Policy Π(1,1) = Up Π(1,2) = Right

Example (cont.) Second iteration – policy improvement Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), //Up 0.9U(1,1) + 0.1U(1,2), //Left 0.9U(1,1) + 0.1U(2,1), //Down 0.8U(2,1) + 0.1U(1,1) + 0.1U(1,2)} //Right = argmaxa { 0.8x(0.7843) + 0.1x(0.5413) + 0.1x(-1), 0.9x(0.5413) + 0.1x(0.7843), 0.9x(0.5413) + 0.1x(-1), 0.8x(-1) + 0.1x(0.5413) + 0.1x(0.7843)} = argmaxa { 0.5816, 0.5656, 0.3871, -0.6674} = Up  don’t have to update Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), //Up 0.9U(1,2) + 0.1U(1,1), //Left 0.8U(1,1) + 0.1U(2,2) + 0.1U(1,2), //Down 0.8U(2,2) + 0.1U(1,2) + 0.1U(1,1)} //Right = argmaxa { 0.9x(0.7843)+ 0.1x1, 0.9x(0.7843) + 0.1x(0.5413), 0.8x(0.5413) + 0.1x1 + 0.1x(0.7843), 0.8x1 + 0.1x(0.7843) + 0.1x(0.5413)} = argmaxa {0.8059, 0.76, 0.6115, 0.9326} = Right  don’t have to update U=0.7843 R=-0.04 U=+1 R=+1 U=0.5413 U=-1 R=-1 Policy Π(1,1) = Up Π(1,2) = Right

Example (cont.) No change in the policy has found  finish The optimal policy: π(1,1) = Up π(1,2) = Right Policy iteration must terminate since policy’s number is finite

Simplify Policy iteration Can focus of subset of state Find utility by simplified value iteration: Ui+1(s) = R(s) + γ ∑s’ (T(s, π(s), s’) Ui(s’)) OR Policy Improvement Guaranteed to converge under certain conditions on initial polity and utility values

Policy Iteration properties Linear equation – easy to solve Fast convergence in practice Proved to be optimal

Value vs. Policy Iteration Which to use: Policy iteration is more expensive per iteration In practice, Policy iteration requires fewer iterations

Reinforcement Learning: An Introduction http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html Richard S. Sutton and Andrew G. Barto A Bradford Book The MIT Press Cambridge, Massachusetts London, England