Planning to Maximize Reward: Markov Decision Processes

Planning to Maximize Reward: Markov Decision Processes
Brian C. Williams 16.412J/6.835J October 23rd, 2002 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 11/20/2018

Markov Decision Processes
Motivation What are Markov Decision Processes? Computing Action Policies From a Model Summary

Reading and Assignments
Markov Decision Processes Read AIMA Chapters 17 sections 1 – 4. Or equivalent in new text Reinforcement Learning Read AIMA Chapter 20 Or equivalent in new text Next Homework: involves coding MDPs, RL and HMM Belief Update Lecture based on development in: “Machine Learning” by Tom Mitchell Chapter 13: Reinforcement Learning

How Might a Mouse Search a Maze for Cheese?
State Space Search? As a Constraint Satisfaction Problem? Goal-directed Planning? As a Rule or Production Systems? What is missing?

Ideas in this lecture Objective is to accumulate rewards, rather than goal states. Objectives are achieved along the way, rather than at the end. Task is to generate policies for how to act in all situations, rather than a plan for a single starting situation. Policies fall out of value functions, which describe the greatest lifetime reward achievable at every state. Value functions are iteratively approximated.

Motivation What are Markov Decision Processes (MDPs)? Models Lifetime Reward Policies Computing Policies From a Model Summary

MDP Problem V = r0 + g r1 + g 2 r2 . . . Agent State Action Reward
Environment r0 a0 s1 a1 r1 s2 a2 r2 s3 s0 Given an environment model as an MDP create a policy for acting that maximizes lifetime reward V = r0 + g r1 + g 2 r

MDP Problem: Model V = r0 + g r1 + g 2 r2 . . . Agent State Action
Reward Environment r0 a0 s1 a1 r1 s2 a2 r2 s3 s0 Given an environment model as an MDP create a policy for action that maximizes lifetime reward V = r0 + g r1 + g 2 r

Markov Decision Processes (MDPs)
Model: Finite set of states, S Finite set of actions, A Probabilistic state transitions, d(s,a) Reward for each state and action, R(s,a) Observe state st in S Choose action at in A Receive immediate reward rt State changes to st+1 Example: s1 a0 a1 r1 s2 a2 r2 s3 s0 s1 r0 a1 10 Legal transitions shown Reward on unlabeled transitions is 0. G

MDP Environment Assumptions
Markov Assumption: Next state and reward is a function only of the current state and action: st+1 = d(st, at) rt = r(st, at) Uncertain and Unknown Environment: d and r may be nondeterministic and unknown Today: Deterministic Case Only

MDP Problem: Model V = r0 + g r1 + g 2 r2 . . . Agent State Action

MDP Problem: Lifetime Reward
Agent State Action Reward Environment r0 a0 s1 a1 r1 s2 a2 r2 s3 s0 Given an environment model as an MDP create a policy for action that maximizes lifetime reward V = r0 + g r1 + g 2 r

Lifetime Reward Finite horizon: Rewards accumulate for a fixed period.
$100K + $100K + $100K = $300K Infinite horizon: Assume reward accumulates for ever $100K + $100K = infinity Discounting: Future rewards not worth as much (a bird in hand …) Introduce discount factor g $100K + g $100K + g 2 $100K. . . = converges Will make the math work

MDP Problem: V = r0 + g r1 + g 2 r2 . . . Agent State Action Reward
Environment r0 a0 s1 a1 r1 s2 a2 r2 s3 s0 Given an environment model as an MDP create a policy for action that maximizes lifetime reward V = r0 + g r1 + g 2 r

MDP Problem: Policy V = r0 + g r1 + g 2 r2 . . . Agent State Action

G G Assume deterministic world Policy p : S A
Selects an action for each state. Optimal policy p* : S A Selects action for each state that maximizes lifetime reward. G 10 p G 10 p*

G There are many policies, not all are necessarily optimal.
There may be several optimal policies. G 10

Motivation Markov Decision Processes Computing Policies From a Model Value Functions Mapping Value Functions to Policies Computing Value Functions Summary

Value Function Vp for a Given Policy p
Vp(st) is the accumulated lifetime reward resulting from starting in state st and repeatedly executing policy p: Vp(st) = rt + g rt+1 + g 2 rt Vp(st) = i g i rt+I where rt, rt+1 , rt are generated by following p starting at st . G 10 p Vp Assume g = .9 9 10 9 10 10

An Optimal Policy p* Given Value Function V*
Idea: Given state s Examine possible actions ai in state s. Select action ai with greatest lifetime reward. Lifetime reward Q(s, ai) is: the immediate reward of taking action r(s,a) … plus life time reward starting in target state V(d(s, a)) … discounted by g. p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Requires: Value function Environment model. d : S x A  S r : S x A   G 10 p 10 9

Example: Mapping Value Function to Policy
Agent selects optimal action from V: p(s) = argmaxa [r(s,a) + gV(d(s, a)] Model + V: g = 0.9 90 81 100 G 100 100

Agent selects optimal action from V: p(s) = argmaxa [r(s,a) + gV(d(s, a)] Model + V: g = 0.9 a 90 100 100 G a: x 100 = 90 b: x 81 = 72.9 select a b 81 90 100 100 G p:

Agent selects optimal action from V: p(s) = argmaxa [r(s,a) + gV(d(s, a)] Model + V: g = 0.9 90 100 100 G a a: x 0 = 100 b: x 90 = 81 select a b 81 90 100 100 p: G

Agent selects optimal action from V: p(s) = argmaxa [r(s,a) + gV(d(s, a)] Model + V: g = 0.9 90 100 100 G a: ? b: ? c: ? select ? a b 81 90 100 100 c p: G

Motivation Markov Decision Processes Computing Policies From a Model Value Functions Mapping Value Functions to Policies Computing Value Functions Summary

Value Function V* for an optimal policy p*
Example SA SB B A RA RB Optimal value function for a one step horizon: V*1(s) = maxai [r(s,ai)] SA V*1(SA) B RA RB SA SB A Max SB V*1(SB) . . .

Example SA SB B A RA RB Optimal value function for a one step horizon: V*1(s) = maxai [r(s,ai)] Optimal value function for a two step horizon: V*2(s) = maxai [r(s,ai) + gV 1*(d(s, ai))] g SA V*2(SA) V*1(SA) RA + A A SA SA Instance of the Dynamic Programming Principle: Reuse shared sub-results Exponential saving Max B . . . B RB + g . . . SB V*2(SB) SB SB V*1(SB)

Example SA SB B A RA RB Optimal value function for a one step horizon: V*1(s) = maxai [r(s,ai)] Optimal value function for a two step horizon: V*2(s) = maxai [r(s,ai) + gV 1*(d(s, ai))] Optimal value function for an n step horizon: V*n(s) = maxai [r(s,ai) + gV n-1*(d(s, ai))]

Example SA SB B A RA RB Optimal value function for a one step horizon: V*1(s) = maxai [r(s,ai)] Optimal value function for a two step horizon: V*2(s) = maxai [r(s,ai) + gV 1*(d(s, ai))] Optimal value function for an n step horizon: V*n(s) = maxai [r(s,ai) + gV n-1*(d(s, ai))] Optimal value function for an infinite horizon: V*(s) = maxai [r(s,ai) + gV*(d(s, ai))]

Solving MDPs by Value Iteration
Insight: Can calculate optimal values iteratively using Dynamic Programming. Algorithm: Iteratively calculate value using Bellman’s Equation: V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] Terminate when values are “close enough” |V*t+1(s) - V* t (s) | < e Agent selects optimal action by one step lookahead on V*: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)]

Convergence of Value Iteration
Algorithm: Iteratively calculate value using Bellman’s Equation: Vt+1(s)  maxa [r(s,a) + gV t(d(s, a))] Terminate when values are “close enough” |Vt+1(s) - V t (s) | < e Then: Maxs in S |Vt+1(s) - V* (s) | < 2eg/(1 - g) Note: Convergence guaranteed even if updates are performed in any order, but infinitely often.

Example of Value Iteration
V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] g = 0.9 V* t V* t+1 a 100 G 100 G b 100 100 a: x 0 = 0 b: x 0 = 0 Max = 0

V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] g = 0.9 V* t V* t+1 100 G 100 G a 100 c b 100 100 a: x 0 = 100 b: x 0 = 0 c: x 0 = 0 Max = 100

V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] g = 0.9 V* t V* t+1 100 G 100 G 100 a 100 100 a: x 0 = 0 Max = 0

V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] g = 0.9 V* t V* t+1 100 G 100 G 100 100 100

V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] g = 0.9 V* t V* t+1 100 G 100 G 100 100 100 100

V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] g = 0.9 V* t V* t+1 100 G 100 100 G 90 100 100 100 90 100

V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] g = 0.9 V* t V* t+1 90 100 G 100 100 G 90 100 100 100 81 90 100

V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] g = 0.9 V* t V* t+1 90 81 100 G 100 100 G 90 100 100 100 81 90 100

Motivation Markov Decision Processes Computing policies from a model Summary

Markov Decision Processes (MDPs)
Model: Finite set of states, S Finite set of actions, A Probabilistic state transitions, d(s,a) Reward for each state and action, R(s,a) Process: Observe state st in S Choose action at in A Receive immediate reward rt State changes to st+1 s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 Deterministic Example: G 10 s1 a1

Crib Sheet: MDPs by Value Iteration
Insight: Can calculate optimal values iteratively using Dynamic Programming. Algorithm: Iteratively calculate value using Bellman’s Equation: V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] Terminate when values are “close enough” |V*t+1(s) - V* t (s) | < e Agent selects optimal action by one step lookahead on V*: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)]

How Might a Mouse Search a Maze for Cheese?
By Value Iteration? What is missing?

Planning to Maximize Reward: Markov Decision Processes

Similar presentations

Presentation on theme: "Planning to Maximize Reward: Markov Decision Processes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Planning to Maximize Reward: Markov Decision Processes

Similar presentations

Presentation on theme: "Planning to Maximize Reward: Markov Decision Processes"— Presentation transcript:

Similar presentations

About project

Feedback