István Szita & András Lőrincz

István Szita & András Lőrincz
Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary

Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning
motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Reinforcement learning
the agent makes decisions … in an unknown world makes some observations (including rewards) tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

What kind of observation?
??? structured observations structure is unclear Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

How to “solve an RL task”?
a model is useful can reuse experience from previous trials can learn offline observations are structured structure is unknown structured + model + RL = FMDP ! (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored MDPs ordinary MDPs everything is factored states rewards
transition probabilities (value functions) what’s the use? later… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored state space all functions depend on a few variables only
don’t forget: we are in factored MDPs! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored dynamics Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored rewards Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

(Factored value functions)
V * is not factored in general we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Solving a known FMDP NP-hard exponential-time worst case
either exponential-time or non-optimal… exponential-time worst case flattening the FMDP approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] non-optimal solution (approximating value function in a factored form) approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] ALP + policy iteration [Guestrin et al., 2002] factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored value iteration
H := matrix of basis functions N (HT) := row-normalization of HT, the iteration converges to fixed point w£ can be computed quickly for FMDPs Let V £ = H w£. Then V £ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Learning in unknown FMDPs
unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics) …actually, we will deal with the third one only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Learning in an unknown FMDP a.k.a. “Explore or exploit?”
after trying a few action sequences… … try to discover better ones? … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Be Optimistic! (when facing uncertainty)
Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

either you get experience…
Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

or you get reward! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored Initial Model
component x1 parents: (x1,x3) 1 (0,0), a1 - (0,0), a2 (0,1), a1 (0,1), a2 (1,0), a1 (1,0), a2 (1,1), a1 (1,1), a2 component x2 parent: (x2) 1 (0), a1 - (0), a2 (1), a1 (1), a2 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored Optimistic Initial Model
“Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) 1 GOE (0,0), a1 - (0,0), a2 (0,1), a1 (0,1), a2 (1,0), a1 (1,0), a2 (1,1), a1 (1,1), a2 component x2 parent: (x2) 1 GOE (0), a1 - (0), a2 (1), a1 (1), a2 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Later on… … according to initial model, all states have value
in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) 1 GOE (0,0), a1 25 30 (0,0), a2 42 12 (0,1), a1 3 (0,1), a2 2 5 (1,0), a1 11 9 (1,0), a2 29 (1,1), a1 56 63 (1,1), a2 98 - component x2 parent: (x2) 1 GOE (0), a1 42 34 (0), a2 25 27 (1), a1 7 (1), a2 3 6 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored optimistic initial model
initialize model (optimistically) for each time step t, solve aproximate model using factored value iteration take greedy action, observe next state update model number of non-near-optimal steps (w.r.t. V £ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: some standard stuff
if , then if for all i, then let mi be the number of visits to if mi is large, then for all yi. more precisely: with prob. (Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: main lemma
for any , approximate Bellman-updates will be more optimistic than the real ones: if VE is large enough, the bonus term dominates for a long time if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: wrap up
for a long time, Vt is optimistic enough to boost exploration at most polynomially many exploration steps can be made except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Previous approaches extensions of E3, Rmax, MBIE to FMDPs
using current model, make smart plan (explore or exploit) explore: make model more accurate exploit: collect near-optimal reward unspecified planners requirement: output plan is close-to-optimal …e.g., solve the flat MDP polynomial sample complexity exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Unknown rewards? “To simplify the presentation, we assume the reward function is known and does not need to be learned. All results can be extended to the case of an unknown reward function.” false. problem: cannot observe reward components, only their sum ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Unknown structure? can be learnt in polynomial time
SLF-Rmax [Strehl, Diuk, Littman, 2007] Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Take-home message if your model starts out optimistically enough,
you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Thank you for your attention!

Optimistic initial model for FMDPs
add “garden of Eden” value to each state variable add reward factors for each state variable init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

István Szita & András Lőrincz

Similar presentations

Presentation on theme: "István Szita & András Lőrincz"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

István Szita & András Lőrincz

Similar presentations

Presentation on theme: "István Szita & András Lőrincz"— Presentation transcript:

Similar presentations

About project

Feedback