Download presentation
Presentation is loading. Please wait.
Published byKenneth Holland Modified over 9 years ago
1
Reinforcement Learning (II.) Exercise Solutions Ata Kaban A.Kaban@cs.bham.ac.uk School of Computer Science University of Birmingham
2
Exercise The diagram below depicts an MDP model of a fierce battle.
3
You can move between two locations, L1 and L2, one of them being closer to the adversary. If you attack from the closest state, –then you have more chances (90%) to succeed (while only 70% from the farther location), –however you could also be detected (with 80% chance) and killed (while the chances of being detected from the farther location is 50%). You can only be detected if you stay in the same location. You need to come up with an action plan for the situation.
4
The arrows represent the possible actions: – ‘move’ (M) is a deterministic action –‘attack’ (A) and ‘stay’ (S) are stochastic. For the stochastic actions, the probabilities of transitioning to the next state are indicated on the arrow. All rewards are 0, except in the terminal states, where your success is represented by a reward of +50 while your adversary’s success is a reward of -50 for you. Employing a discount factor of 0.9, compute an optimal policy (action plan).
5
Solution The computations of action-values for all states and actions are required. Denote by In value iteration, we start with initial estimates (for all other states) Then we update all action values according to the update rule: where
6
Here the Q table will be updated after each iteration only (having explored all state-action pairs) In the first iteration of the algorithm we get: The values for the ‘move’ action stay the same (at 0): After this iteration, the values of the two states are and they correspond to the action of ‘attacking’ in both states.
7
The next iteration gives the following: The new V-values are (by computing max): These correspond to the ‘attack’ action in both states.
8
This process can continue until the values do not change much between successive iterations. From what we can see at this point, the best action plan seems to be attacking all the time. Note: –Designing the parameter setting for a situation according to the conditions is up to the human and not up to the machine… –In this exercise all parameters were given but in your potential future real applications, in order to use RL successfully and make a robot learn to do what you wanted it to, you need to come up with the reward values appropriately.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.