Presentation is loading. Please wait.

Presentation is loading. Please wait.

István Szita & András Lőrincz

Similar presentations


Presentation on theme: "István Szita & András Lőrincz"— Presentation transcript:

1 István Szita & András Lőrincz
Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary

2 Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning
motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

3 Reinforcement learning
the agent makes decisions … in an unknown world makes some observations (including rewards) tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

4 What kind of observation?
??? structured observations structure is unclear Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

5 How to “solve an RL task”?
a model is useful can reuse experience from previous trials can learn offline observations are structured structure is unknown structured + model + RL = FMDP ! (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

6 Factored MDPs ordinary MDPs everything is factored states rewards
transition probabilities (value functions) what’s the use? later… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

7 Factored state space all functions depend on a few variables only
don’t forget: we are in factored MDPs! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

8 Factored dynamics Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

9 Factored rewards Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

10 (Factored value functions)
V * is not factored in general we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

11 Solving a known FMDP NP-hard exponential-time worst case
either exponential-time or non-optimal… exponential-time worst case flattening the FMDP approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] non-optimal solution (approximating value function in a factored form) approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] ALP + policy iteration [Guestrin et al., 2002] factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

12 Factored value iteration
H := matrix of basis functions N (HT) := row-normalization of HT, the iteration converges to fixed point w£ can be computed quickly for FMDPs Let V £ = H w£. Then V £ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

13 Learning in unknown FMDPs
unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics) …actually, we will deal with the third one only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

14 Learning in unknown FMDPs
unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics) …actually, we will deal with the third one only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

15 Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning
motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

16 Learning in an unknown FMDP a.k.a. “Explore or exploit?”
after trying a few action sequences… … try to discover better ones? … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

17 Be Optimistic! (when facing uncertainty)
Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

18 either you get experience…
Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

19 or you get reward! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

20 Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning
motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

21 Factored Initial Model
component x1 parents: (x1,x3) 1 (0,0), a1 - (0,0), a2 (0,1), a1 (0,1), a2 (1,0), a1 (1,0), a2 (1,1), a1 (1,1), a2 component x2 parent: (x2) 1 (0), a1 - (0), a2 (1), a1 (1), a2 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

22 Factored Optimistic Initial Model
“Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) 1 GOE (0,0), a1 - (0,0), a2 (0,1), a1 (0,1), a2 (1,0), a1 (1,0), a2 (1,1), a1 (1,1), a2 component x2 parent: (x2) 1 GOE (0), a1 - (0), a2 (1), a1 (1), a2 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

23 Later on… … according to initial model, all states have value
in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) 1 GOE (0,0), a1 25 30 (0,0), a2 42 12 (0,1), a1 3 (0,1), a2 2 5 (1,0), a1 11 9 (1,0), a2 29 (1,1), a1 56 63 (1,1), a2 98 - component x2 parent: (x2) 1 GOE (0), a1 42 34 (0), a2 25 27 (1), a1 7 (1), a2 3 6 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

24 Factored optimistic initial model
initialize model (optimistically) for each time step t, solve aproximate model using factored value iteration take greedy action, observe next state update model number of non-near-optimal steps (w.r.t. V £ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

25 elements of proof: some standard stuff
if , then if for all i, then let mi be the number of visits to if mi is large, then for all yi. more precisely: with prob. (Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

26 elements of proof: main lemma
for any , approximate Bellman-updates will be more optimistic than the real ones: if VE is large enough, the bonus term dominates for a long time if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

27 elements of proof: wrap up
for a long time, Vt is optimistic enough to boost exploration at most polynomially many exploration steps can be made except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

28 Previous approaches extensions of E3, Rmax, MBIE to FMDPs
using current model, make smart plan (explore or exploit) explore: make model more accurate exploit: collect near-optimal reward unspecified planners requirement: output plan is close-to-optimal …e.g., solve the flat MDP polynomial sample complexity exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

29 Unknown rewards? “To simplify the presentation, we assume the reward function is known and does not need to be learned. All results can be extended to the case of an unknown reward function.” false. problem: cannot observe reward components, only their sum ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

30 Unknown structure? can be learnt in polynomial time
SLF-Rmax [Strehl, Diuk, Littman, 2007] Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

31 Take-home message if your model starts out optimistically enough,
you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

32 Thank you for your attention!

33 Optimistic initial model for FMDPs
add “garden of Eden” value to each state variable add reward factors for each state variable init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

34 Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

35 Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

36 Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs


Download ppt "István Szita & András Lőrincz"

Similar presentations


Ads by Google