KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

2 Markov Decision Problem How to use knowledge about the world to make decision even when the outcomes of an action are uncertain and the payoffs will not be obtained until several (or many) actions have passed.

3 The Solution Sequential decision problems in uncertain environments can be solved by calculating a policy that associates an optimal decision with every state that the agent might reach => Markov Decision Process (MDP)

4 Example +1 1 2 3 1234 start 0.1 0.8 0.1 The world Actions have uncertain consequences

11 Utility of a State Sequence  Additive rewards  Discounted rewards

13  The utility of each state is the expected sum of discounted rewards if the agent executes the policy   The true utility of a state corresponds to the optimal policy  * Utility of a State

15 Algorithms for Calculating the Optimal Policy  Value iteration  Policy iteration

16  Calculate the utility of each state  Then use the state utilities to select an optimal action in each state Value Iteration

17 Value Iteration Algorithm function value-iteration(MDP) returns a utility function local variables : U, U’ initially identical to R repeat U  U’ for each state s do end until close-enough(U, U’) return U Bellman update

18 The utilities of the states by value iteration algorithm The Utilities of the States Obtained After Value Iteration +1 1 2 3 1234 0.7050.6550.6110.388 0.7620.660 0.9120.8680.812

19 Policy Iteration  Pick a policy, then calculate the utility of each state given that policy (value determination step)  Update the policy at each state using the utilities of the successor states  Repeat until the policy stabilizes

20 Policy Iteration Algorithm function policy-iteration(MDP) returns a policy local variables : U, a utility function, , a policy repeat U  value-determination( ,U,MDP,R) unchanged?  true for each state s do unchanged?  false end until unchanged? return 

21 Value Determination  Simplification of the value iteration algorithm because the policy is fixed  Linear equations because the max() operator has been removed  Solve exactly for the utilities using standard linear algebra

22 +1 1 2 3 1234 u (1,1) = 0.8 u (1,2) + 0.1 u (1,2) + 0.1 u (1,1) u (1,2) = 0.8 u (1,3) + 0.2 u (1,2) … Optimal Policy (policy iteration with 11 linear equations)

23 Partially observable MDP (POMDP)  In an inaccessible environment, the percept does not provide enough information to determine the state or the transition probability  POMDP –State transition function: P(s t+1 | s t, a t ) –Observation function: P(o t | s t, a t ) –Reward function: E(r t | s t, a t )  Approach –To calculate a probability distribution over the possible states given all previous percepts, and to base decision on this distribution  Difficulty –Actions will cause the agent to obtain new percept, which will cause the agent’s beliefs to change in complex ways

KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Similar presentations

Presentation on theme: "KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Similar presentations

Presentation on theme: "KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17."— Presentation transcript:

Similar presentations

About project

Feedback