Presentation is loading. Please wait.

Presentation is loading. Please wait.

Framework: Agent in State Space

Similar presentations


Presentation on theme: "Framework: Agent in State Space"— Presentation transcript:

1 Framework: Agent in State Space
Remark: no terminal states Example: XYZ-World e e 1 2 3 R=+5 s s n w 6 R=-9 4 5 R=+3 sw ne s s n x/0.7 x/0.3 8 R=+4 9 R=-6 7 nw Problem: What actions should an agent choose to maximize its rewards? s 10

2 XYZ-World: Discussion Problem 12a
Bellman Update g=0.2 e e 10.145 20.72 30.58 R=+5 s s n w R=-9 40.03 53.63 R=+3 sw ne s s n x/0.7 x/0.3 83.17 R=+4 R=-6 70.001 nw s Discussion on using Bellman Update for Problem 12: No convergence for g=1.0; utility values seem to run away! The Bellman update converged very fast for g=0.2 State 3 has utility 0.58 although it gives a reward of +5 due to the fact that it is on a “bad” path; we were able to detect that. Did anybody run the algorithm for other g (e.g. 0.4 or 0.6) values; if yes, did it converge to the same utility values? Speed of convergence seems to depend on the value of g (lower faster). The Bellman update suggest a utility value for 3.6 for state 5; what does this tell us about the optimal policy? E.g. is optimal? 100.63

3 e e 1 2 3 R=+5 s s n w 6 R=-9 4 5 R=+3 sw s s n 8 R=+4 9 R=-6 7 nw s
XYZ-World: Discussion Problem 12b TD inverse R TD e e (0.57, -0.65) 1 2 3 R=+5 s s n (2.98, -2.99) w 6 R=-9 4 5 R=+3 sw ne s s n x/0.7 x/0.3 8 R=+4 (-0.50, 0.47) 9 R=-6 7 nw s Other observations: TD reversed utility values quite neatly when reward were inversed; x become –x+u with u[-0.08,0.08] The “correctness” of the learnt utility values depends on the chosen Policy P (see next transparency) Anything else that is interesting to observe??? (-0.18, -0.12) 10 P:

4 e e 1 2 3 R=+5 s s n w 6 R=-9 4 5 R=+3 sw s s n 8 R=+4 9 R=-6 7 nw s
XYZ-World: Discussion Problem 12a/b TD P Bellman e e (3.3, 0.5) 1 2 3 R=+5 s s n I tried hard but: any better explanations? w 6 R=-9 4 5 R=+3 sw ne s s n (3.2, -0.5) x/0.7 x/0.3 8 R=+4 9 R=-6 7 nw s (0.6, -0.2) Explanation of discrepancies TD for P & Bellman-Update: Most significant discrepancies in states 3 and 8; minor in state 10; mostly agreement for the other states. P chooses worst successor of 8; should apply operator x instead P should apply w in state 6, but only does it only in 2/3 of the cases; which affects the utility of state 3 The low utility value of state 8 in TD seems to lower the utility value of state 10  only a minor discrepancy 10 P:


Download ppt "Framework: Agent in State Space"

Similar presentations


Ads by Google