István Szita & András Lőrincz

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Partially Observable Markov Decision Process (POMDP)

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Reinforcement Learning

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.

Generalizing Plans to New Environments in Relational MDPs

Reinforcement learning

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Reinforcement Learning

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Reinforcement learning (Chapter 21)

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement learning (Chapter 21)

Reinforcement Learning (1)

Reinforcement learning (Chapter 21)

Markov Decision Processes

Reinforcement Learning

CS 188: Artificial Intelligence

Reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Dr. Unnikrishnan P.C. Professor, EEE

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

CS 188: Artificial Intelligence Spring 2006

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Reinforcement Learning (2)

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

István Szita & András Lőrincz Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary

Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Reinforcement learning the agent makes decisions … in an unknown world makes some observations (including rewards) tries to maximize collected reward Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

What kind of observation? ??? structured observations structure is unclear Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

How to “solve an RL task”? a model is useful can reuse experience from previous trials can learn offline observations are structured structure is unknown structured + model + RL = FMDP ! (or linear dynamical systems, neural networks, etc…) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored MDPs ordinary MDPs everything is factored states rewards transition probabilities (value functions) what’s the use? later… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored state space all functions depend on a few variables only don’t forget: we are in factored MDPs! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored dynamics Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored rewards Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

(Factored value functions) V * is not factored in general we will make an approximation error Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Solving a known FMDP NP-hard exponential-time worst case either exponential-time or non-optimal… exponential-time worst case flattening the FMDP approximate policy iteration [Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000] non-optimal solution (approximating value function in a factored form) approximate linear programming [Guestrin, Koller, Parr & Venkataraman, 2002] ALP + policy iteration [Guestrin et al., 2002] factored value iteration [Szita & Lőrincz, 2008] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored value iteration H := matrix of basis functions N (HT) := row-normalization of HT, the iteration converges to fixed point w£ can be computed quickly for FMDPs Let V £ = H w£. Then V £ has bounded error: Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Learning in unknown FMDPs unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics) …actually, we will deal with the third one only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Learning in unknown FMDPs unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics) …actually, we will deal with the third one only Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Learning in an unknown FMDP a.k.a. “Explore or exploit?” after trying a few action sequences… … try to discover better ones? … do the best thing according to current knowledge? Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Be Optimistic! (when facing uncertainty) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

either you get experience… Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

or you get reward! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Factored MDPs Optimism Optimism & FMDPs & Model-based learning motivation definitions planning in FMDPs Optimism Optimism & FMDPs & Model-based learning motivation has not much to do with the technical part, but i think its important to understand why its important Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored Initial Model component x1 parents: (x1,x3) 1 (0,0), a1 - (0,0), a2 (0,1), a1 (0,1), a2 (1,0), a1 (1,0), a2 (1,1), a1 (1,1), a2 component x2 parent: (x2) 1 (0), a1 - (0), a2 (1), a1 (1), a2 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored Optimistic Initial Model “Garden of Eden” +$10000 reward (or something very high) component x1 parents: (x1,x3) 1 GOE (0,0), a1 - (0,0), a2 (0,1), a1 (0,1), a2 (1,0), a1 (1,0), a2 (1,1), a1 (1,1), a2 component x2 parent: (x2) 1 GOE (0), a1 - (0), a2 (1), a1 (1), a2 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Later on… … according to initial model, all states have value in frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas component x1 parents: (x1,x3) 1 GOE (0,0), a1 25 30 (0,0), a2 42 12 (0,1), a1 3 (0,1), a2 2 5 (1,0), a1 11 9 (1,0), a2 29 (1,1), a1 56 63 (1,1), a2 98 - component x2 parent: (x2) 1 GOE (0), a1 42 34 (0), a2 25 27 (1), a1 7 (1), a2 3 6 … Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Factored optimistic initial model initialize model (optimistically) for each time step t, solve aproximate model using factored value iteration take greedy action, observe next state update model number of non-near-optimal steps (w.r.t. V £ ) is polynomial with probability ¼1 Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: some standard stuff if , then if for all i, then let mi be the number of visits to if mi is large, then for all yi. more precisely: with prob. (Hoeffding/Azuma inequality) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: main lemma for any , approximate Bellman-updates will be more optimistic than the real ones: if VE is large enough, the bonus term dominates for a long time if all elements of H are nonnegative, projection preserves optimism lower bound by Azuma’s inequality bonus promised by Garden of Eden state Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

elements of proof: wrap up for a long time, Vt is optimistic enough to boost exploration at most polynomially many exploration steps can be made except those, the agent must be near-V £-optimal Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Previous approaches extensions of E3, Rmax, MBIE to FMDPs using current model, make smart plan (explore or exploit) explore: make model more accurate exploit: collect near-optimal reward unspecified planners requirement: output plan is close-to-optimal …e.g., solve the flat MDP polynomial sample complexity exponential amounts of computation! Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Unknown rewards? “To simplify the presentation, we assume the reward function is known and does not need to be learned. All results can be extended to the case of an unknown reward function.” false. problem: cannot observe reward components, only their sum ! UAI poster [Walsh, Szita, Diuk, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Unknown structure? can be learnt in polynomial time SLF-Rmax [Strehl, Diuk, Littman, 2007] Met-Rmax [Diuk, Li, Littman, 2009] Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Take-home message if your model starts out optimistically enough, you get efficient exploration for free! (even if your planner is non-optimal (as long as it is monotonic)) Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Thank you for your attention!

Optimistic initial model for FMDPs add “garden of Eden” value to each state variable add reward factors for each state variable init transition model Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Outline Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs