István Szita & András Lőrincz Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary
Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Reinforcement learning the agent makes decisions … in an unknown world makes some observations (including rewards) tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
What kind of observation? ??? structured observations structure is unclear Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
How to “solve an RL task”? a model is useful can reuse experience from previous trials can learn offline observations are structured structure is unknown structured + model + RL = FMDP ! (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Factored MDPs ordinary MDPs everything is factored states rewards transition probabilities (value functions) what’s the use? later… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Factored state space all functions depend on a few variables only don’t forget: we are in factored MDPs! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Factored dynamics Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Factored rewards Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
(Factored value functions) V * is not factored in general we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Solving a known FMDP NP-hard exponential-time worst case either exponential-time or non-optimal… exponential-time worst case flattening the FMDP approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] non-optimal solution (approximating value function in a factored form) approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] ALP + policy iteration [Guestrin et al., 2002] factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Factored value iteration H := matrix of basis functions N (HT) := row-normalization of HT, the iteration converges to fixed point w£ can be computed quickly for FMDPs Let V £ = H w£. Then V £ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Learning in unknown FMDPs unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics) …actually, we will deal with the third one only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Learning in unknown FMDPs unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics) …actually, we will deal with the third one only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Learning in an unknown FMDP a.k.a. “Explore or exploit?” after trying a few action sequences… … try to discover better ones? … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Be Optimistic! (when facing uncertainty) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
either you get experience… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
or you get reward! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Factored Initial Model component x1 parents: (x1,x3) 1 (0,0), a1 - (0,0), a2 (0,1), a1 (0,1), a2 (1,0), a1 (1,0), a2 (1,1), a1 (1,1), a2 component x2 parent: (x2) 1 (0), a1 - (0), a2 (1), a1 (1), a2 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Factored Optimistic Initial Model “Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) 1 GOE (0,0), a1 - (0,0), a2 (0,1), a1 (0,1), a2 (1,0), a1 (1,0), a2 (1,1), a1 (1,1), a2 component x2 parent: (x2) 1 GOE (0), a1 - (0), a2 (1), a1 (1), a2 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Later on… … according to initial model, all states have value in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) 1 GOE (0,0), a1 25 30 (0,0), a2 42 12 (0,1), a1 3 (0,1), a2 2 5 (1,0), a1 11 9 (1,0), a2 29 (1,1), a1 56 63 (1,1), a2 98 - component x2 parent: (x2) 1 GOE (0), a1 42 34 (0), a2 25 27 (1), a1 7 (1), a2 3 6 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Factored optimistic initial model initialize model (optimistically) for each time step t, solve aproximate model using factored value iteration take greedy action, observe next state update model number of non-near-optimal steps (w.r.t. V £ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
elements of proof: some standard stuff if , then if for all i, then let mi be the number of visits to if mi is large, then for all yi. more precisely: with prob. (Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
elements of proof: main lemma for any , approximate Bellman-updates will be more optimistic than the real ones: if VE is large enough, the bonus term dominates for a long time if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
elements of proof: wrap up for a long time, Vt is optimistic enough to boost exploration at most polynomially many exploration steps can be made except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Previous approaches extensions of E3, Rmax, MBIE to FMDPs using current model, make smart plan (explore or exploit) explore: make model more accurate exploit: collect near-optimal reward unspecified planners requirement: output plan is close-to-optimal …e.g., solve the flat MDP polynomial sample complexity exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Unknown rewards? “To simplify the presentation, we assume the reward function is known and does not need to be learned. All results can be extended to the case of an unknown reward function.” false. problem: cannot observe reward components, only their sum ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Unknown structure? can be learnt in polynomial time SLF-Rmax [Strehl, Diuk, Littman, 2007] Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Take-home message if your model starts out optimistically enough, you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Thank you for your attention!
Optimistic initial model for FMDPs add “garden of Eden” value to each state variable add reward factors for each state variable init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs
Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs