Download presentation
Presentation is loading. Please wait.
Published byJulia McLaughlin Modified over 8 years ago
1
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock
2
Planning under Uncertainty Planning Uncertainty Learning MDPs PO-MDPs RL
3
Planning Agent Tasks Characteristics DETERMINISTICSTOCHASTIC FULLY OBSERVABLEA*, DEPTH-FIRST, etc.MDP PARTIALLY OBSERVABLEPOMDP Stochastic is an environment where the outcome of an action is somewhat random, whereas an environment that's deterministic where the outcome of an action is predictable and always the same. An environment is fully observable if you can see the state of the environment which means if you can make all decisions based on the momentary sensory input. Whereas if you need memory, it's partially observable.
4
Markov Decision Process (MDP) S1 S2 S3 a1 a2 a1 a2 a1 S1 S2 S3 a1 a2 a1 a2 a1 Finite State Machine Markov Decision Process Randomness 50%
5
Markov Decision Process (MDP) S1 S2 S3 a1 a2 a1 a2 a1 50% States: S1…Sn Actions: a1…ak State Transition Matrix T(S, a, S’) = P(S’|a,S) Reward function R(S)
6
MDP Grid World +100 -100 START Absorbing States 80% 10% Stochastic actions: 1234 a b c Policy: pi(s) -> A The planning problem we have becomes one of finding the optimal policy 10%
7
Stochastic Environments – Conventional Planning +100 -100 START 1234 a b c c1 N S W E b1 c1 Problems: 1)Branching factor: 4 choices, 3 outcomes, at least 12 branches we need to follow 2)Depth of the search tree (i.e., loops, etc.) 3)Many states visited more than once (i.e., states may re-occur) 1)In A*, we ensure we only visit each state only once c1c2
8
Policy +100 -100 START 1234 a b c Goal: Find an optimal policy for all these states that with maximum probability leads me to the absorbing state plus 100 Quiz: What is the optimal action? a1: N, S, W, E ??? c1: N, S, W, E ??? c4: N, S, W, E ??? B3: N, S, W, E ???
9
MDP and Costs +100 -100 1234 a b c R(s) +100, a4 -100, b4 -3, other states (i.e., gives us incentives to shorten our action sequence) γ=discount factor, e.g., γ=0.9, (i.e., decay of the future rewards) Objective of MDP:
10
Value Function -3 +100 -3 -100 -3 1234 a b c Value Function: for each state, the value of the state is the expected sum of future discounted reward provided that we start in state S, executed policy PI. Planning = Iteratively calculating value functions
11
Value Iteration 000+100 00-100 0000 1234 a b c 858993+100 8168-100 77737047 1234 a b c run value iteration through convergence
12
Value Iteration - 2 858993+100 8168-100 77737047 1234 a b c If S is the terminal state Back-up equation: After converges (Bellman equality): the optimal future cost reward trade off that you can achieve if you act optimally in any given state.
13
Quiz – DETERMINSTIC +100 -100 1234 a b c V(a3) = ??? DETERMINSTIC, γ= 1, R(S) = -3
14
Quiz - 1 97+100 -100 1234 a b c V(a3) = 97 V(b3) = ??? DETERMINSTIC, γ= 1, R(S) = -3
15
Quiz - 1 97+100 94-100 1234 a b c V(a3) = 97 V(b3) = 94 V(c1) = ??? DETERMINSTIC, γ= 1, R(S) = -3
16
Quiz - 1 919497+100 8894-100 85889188 1234 a b c V(a3) = 97 V(b3) = 94 V(c1) = 85 DETERMINSTIC, γ= 1, R(S) = -3
17
Quiz – STOCHASTIC +100 -100 1234 a b c V(a3) = ??? STOCHASTIC, γ= 1, R(S) = -3, P=0.8
18
Quiz – STOCHASTIC 77+100 -100 1234 a b c V(a3) = 77 V(b3) = ??? STOCHASTIC, γ= 1, R(S) = -3, P=0.8
19
Quiz – STOCHASTIC 77+100 48.6-100 1234 a b c V(a3) = 77 V(b3) = 48.6 STOCHASTIC, γ= 1, R(S) = -3, P=0.8 N: 0.8 * 77 + 0.1(-100) + 0.1*0 – 3 = 48.6 W: 0.1 * 77 + 0.8 * 0 + 0.1 *0 – 3 = 4.7
20
Value Iteration and Policy - 1 What is the optimal policy?
21
Value Iteration and Policy - 2 858993+100 8168-100 77737047 1234 a b c STOCHASTIC, γ= 1, R(S) = -3, P=0.8 What is the optimal policy? +100 -100 1234 a b c This is a situation where the risk of falling into the -100 is balanced by the time spent going around.
22
Value Iteration and Policy - 3 100 +100 100 -100 100 1234 a b c STOCHASTIC, γ= 1, R(S) = 0, P=0.8 What is the optimal policy? +100 -100 1234 a b c
23
Value Iteration and Policy - 4 -704-423-173+100 -954-357-100 -1082-847-597-377 1234 a b c STOCHASTIC, γ= 1, R(S) = -100, P=0.8 What is the optimal policy? +100 -100 1234 a b c
24
Markov Decision Processes Fully Observable: S1, …, Sn; a1, …, ak Stochastic: P(S’|a, S) Reward: R(S) Objective: Value iteration: V(S) Converges: PI = argmax…
25
Partial Observability +100-100 Fully Observable, Deterministic Fully Observable, Stochastic
26
Partial Observability ?? Partially Observable, Stochastic MDP vs POMDP: Optimal exploration versus exploitation, where some of the actions might be information- gathering actions; whereas others might be goal-driven actions. SIGN
27
Partial Observability Information Space (Belief Space) ? ? +100 50%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.