Download presentation
Presentation is loading. Please wait.
Published byMilo Garrett Modified over 9 years ago
1
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence
2
Topic Planning and Control in Stochastic Domains With Imperfect Information
3
Objective Markov Decision Processes (Sequences of decisions) – Introduction to MDPs – Computing optimal policies for MDPs
4
Markov Decision Process (MDP) Sequential decision problems under uncertainty – Not just the immediate utility, but the longer-term utility as well – Uncertainty in outcomes Roots in operations research Also used in economics, communications engineering, ecology, performance modeling and of course, AI! – Also referred to as stochastic dynamic programs
5
Markov Decision Process (MDP) Defined as a tuple: – S: State – A: Action – P: Transition function Table P(s’| s, a), prob of s’ given action “a” in state “s” – R: Reward R(s, a) = cost or reward of taking action a in state s Choose a sequence of actions (not just one decision or one action) – Utility based on a sequence of decisions
6
Example: What SEQUENCE of actions should our agent take? Reward Blocked CELL Reward +1 Start 1234 1 2 3 0.8 0.1 Each action costs –1/25 Agent can take action N, E, S, W Faces uncertainty in every state N
7
MDP Tuple: MDP Tuple: S: State of the agent on the grid (4,3) – Note that cell denoted by (x,y) A: Actions of the agent, i.e., N, E, S, W P: Transition function – Table P(s’| s, a), prob of s’ given action “a” in state “s” – E.g., P( (4,3) | (3,3), N) = 0.1 – E.g., P((3, 2) | (3,3), N) = 0.8 – (Robot movement, uncertainty of another agent’s actions,…) R: Reward (more comments on the reward function later) – R( (3, 3), N) = -1/25 – R (4,1) = +1
8
??Terminology Before describing policies, lets go through some terminology Terminology useful throughout this set of lectures Policy: Complete mapping from states to actions
9
MDP Basics and Terminology An agent must make a decision or control a probabilistic system Goal is to choose a sequence of actions for optimality Defined as MDP models: – Finite horizon: Maximize the expected reward for the next n steps – Infinite horizon: Maximize the expected discounted reward. – Transition model: Maximize average expected reward per transition. – Goal state: maximize expected reward (minimize expected cost) to some target state G.
10
???Reward Function According to chapter2, directly associated with state – Denoted R(I) – Simplifies computations seen later in algorithms presented Sometimes, reward is assumed associated with state,action – R(S, A) – We could also assume a mix of R(S,A) and R(S) Sometimes, reward associated with state,action,destination-state – R(S,A,J) – R(S,A) = R(S,A,J) * P(J | S, A) J
11
Markov Assumption Markov Assumption: Transition probabilities (and rewards) from any given state depend only on the state and not on previous history Where you end up after action depends only on current state – After Russian Mathematician A. A. Markov (1856-1922) – (He did not come up with markov decision processes however) – Transitions in state (1,2) do not depend on prior state (1,1) or (1,2)
12
???MDP vs POMDPs Accessibility: Agent’s percept in any given state identify the state that it is in, e.g., state (4,3) vs (3,3) – Given observations, uniquely determine the state – Hence, we will not explicitly consider observations, only states Inaccessibility: Agent’s percepts in any given state DO NOT identify the state that it is in, e.g., may be (4,3) or (3,3) – Given observations, not uniquely determine the state – POMDP: Partially observable MDP for inaccessible environments We will focus on MDPs in this presentation.
13
MDP vs POMDP Agent World States Actions MDP Agent World Observations Actions SE P b
14
Stationary and Deterministic Policies Policy denoted by symbol
15
Policy Policy is like a plan, but not quite – Certainly, generated ahead of time, like a plan Unlike traditional plans, it is not a sequence of actions that an agent must execute – If there are failures in execution, agent can continue to execute a policy Prescribes an action for all the states Maximizes expected reward, rather than just reaching a goal state
16
MDP problem The MDP problem consists of: – Finding the optimal control policy for all possible states; – Finding the sequence of optimal control functions for a specific initial state – Finding the best control action(decision) for a specific state.
17
Non-Optimal Vs Optimal Policy +1 Start 1234 1 2 3 Choose Red policy or Yellow policy? Choose Red policy or Blue policy? Which is optimal (if any)? Value iteration: One popular algorithm to determine optimal policy
18
Value Iteration: Key Idea Iterate: update utility of state “I” using old utility of neighbor states “J”; given actions “A” – U t+1 (I) = max [R(I,A) + P(J|I,A)* U t (J)] A J – P(J|I,A): Probability of J if A is taken in state I – max F(A) returns highest F(A) – Immediate reward & longer term reward taken into account
19
Value Iteration: Algorithm Initialize: U 0 (I) = 0 Iterate: U t+1 (I) = max [ R(I,A) + P(J|I,A)* U t (J) ] A J – Until close-enough (U t+1, U t ) At the end of iteration, calculate optimal policy: Policy(I) = argmax [R(I,A) + P(J|I,A)* U t+1 (J) ] A J
20
Forward Method for Solving MDP Decision Tree
21
??Markov Chain Given fixed policy, you get a markov chain from the MDP – Markov chain: Next state is dependent only on previous state – Next state: Not dependent on action (there is only one action) – Next state: History dependency only via the previous state – P(S t+1 | S t, S t-1, S t-2 …..) = P(S t+1 | S t ) How to evaluate the markov chain? Could we try simulations? Are there other sophisticated methods around?
22
Influence Diagram
23
Expanded Influence Diagram
24
Relation between time & steps-to-go
25
Decision Tree
26
Dynamic Construction of the Decision Tree Incrémental expansion(MDP,γ, s I, є, V L, V U ) Initialize tree T with s I and ubound (s I ), lbound (s I ) using V L, V U ; repeat until(single action remains for s I or ubound (s I ) - lbound (s I ) <= є call Improve-tree(T,MDP,γ, V L, V U ) return action with greatest lover bound as a result; Improve-tree (T,MDP,γ, V L, V U ) if root(T) is a leaf then expand root(T) set bouds lbound, ubound of new leaves using V L, V U ; else for all decision subtrees T’ of T do call Improve-tree (T,MDP,γ, V L, V U ) recompute bounds lbound(root(T)), ubound(root(T))for root(T); when root(T) is a decision node prune suboptimal action branches from T; return;
27
Incremental expansion function: Basic Method for the Dynamic Construction of the Decision Tree start MDP, γ, S I, ε, V L, V U OR S I )-bound(S I ) initialize leaf node of the partially built decision tree return call Improve-tree(T,MDP, γ, ε, V L, V U ) Terminate
28
Computer Decisions using Bound Iteration Incrémental expansion(MDP,γ, s I, є, V L, V U ) Initialize tree T with s I and ubound (s I ), lbound (s I ) using V L, V U ; repeat until(single action remains for s I or ubound (s I ) - lbound (s I ) <= є call Improve-tree(T,MDP,γ, V L, V U ) return action with greatest lover bound as a result; Improve-tree (T,MDP,γ, V L, V U ) if root(T) is a leaf then expand root(T) set bouds lbound, ubound of new leaves using V L, V U ; else for all decision subtrees T’ of T do call Improve-tree (T,MDP,γ, V L, V U ) recompute bounds lbound(root(T)), ubound(root(T))for root(T); when root(T) is a decision node prune suboptimal action branches from T; return;
29
Incremental expansion function: Basic Method for the Dynamic Construction of the Decision Tree start MDP, γ, S I, ε, V L, V U OR (S I )-bound(S I ) initialize leaf node of the partially built decision tree return call Improve-tree(T,MDP, γ, ε, V L, V U ) Terminate
30
Solving Large MDP problmes
31
If You Want to Read More on MDPs Book: – Martin L. Puterman Markov Decision Processes Wiley Series in Probability – Available on Amazon.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.