Download presentation
Presentation is loading. Please wait.
Published byDorothy Payne Modified over 8 years ago
1
Assignment 1 Solutions
2
Problem 1 States : Actions: Single MDP controlling both detectives D1 (0) (1) C (2) D2 (3) (4)(5) (6)(7)(8)
3
Problem 1 contd. Transitions: Explained by example. –For the action, the state transitions will be: = 0.9 –0.8 for stay where you are action –0.05 for north –0.05 for east = 0.05 –0.05 for south = 0.05 –0.05 for west
4
Problem 1 contd. Goal state: states where atleast one detective has the same position as criminal –Ex: (1,2,1), (5,1,1) etc. Reward function will vary from person to person. But a reward function can be: –R(Goal state) = 100 –R(Goal state, *) = 0 –R(!(Goal state), *) = -2 –Example: R ([1,2,1]) = 100 ;R ([1,2,1], *) = 0; R([1,2,3], *) = -2
5
Problem 2 Implement value iteration Provide policies given only the start state. –For example, for (a) Start state is. Best action needs to be provided for (T=1). With the above reward function, the action is. At T = 2, Best action for – ; ; Goal state (any action is fine) At T = 3, Best action for – Goal state; ; D12 (0) (1) C (2) (3)(4)(5) (6)(7)(8)
6
Problem 3 Calculate all paths (for criminal) of size 5. Find the average number of moves used by the detectives to catch the thief in the paths enumerated above. In the above MDP, the average was 2.4.
7
Problem 4 It is not possible to define the reward (to accommodate the rule on T=4) given the above state space. State space needs to be modified to include time. Without the additional state feature for time, the problem does not have the markov property.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.