Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.

Slides:

Advertisements

Similar presentations

Selective attention in RL

Advertisements

Hierarchical Reinforcement Learning Amir massoud Farahmand

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor,

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment Magdalena Jankowska (M.Sc. - Algorithms) Ilya Levner (M.Sc - AI/ML)

Distributed Reinforcement Learning for a Traffic Engineering Application Mark D. Pendrith DaimlerChrysler Research & Technology Center Presented by: Christina.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment Magdalena Jankowska (M.Sc. - Algorithms) Ilya Levner (M.Sc - AI/ML)

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

A Multi-Agent Learning Approach to Online Distributed Resource Allocation Chongjie Zhang Victor Lesser Prashant Shenoy Computer Science Department University.

Reinforcement Learning

Reinforcement Learning (1)

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.

Artificial Intelligence Chapter 10 Planning, Acting, and Learning Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.

Reinforcement Learning

Chapter 10 Planning, Acting, and Learning. 2 Contents The Sense/Plan/Act Cycle Approximate Search Learning Heuristic Functions Rewards Instead of Goals.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Some Final Thoughts Abhijit Gosavi. From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Chapter 6: Temporal Difference Learning

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Artificial Intelligence Chapter 10 Planning, Acting, and Learning

Chapter 6: Temporal Difference Learning

Designing Neural Network Architectures Using Reinforcement Learning

Artificial Intelligence Chapter 10 Planning, Acting, and Learning

Reinforcement Learning Dealing with Partial Observability

Presentation transcript:

Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts, Amherst

Autonomous Learning Laboratory 2 Abstraction Ignore information irrelevant for the task at hand Minimization – finding the smallest equivalent model A B C D E A B C D E

Autonomous Learning Laboratory 3 Outline Minimization –Notion of equivalence –Modeling symmetries Extensions –Partial equivalence –Hierarchies – relativized options –Approximate equivalence

Autonomous Learning Laboratory 4 Markov Decision Processes (Puterman ’94) MDP, M, is the tuple: –S : set of states –A : set of actions – : set of admissible state-action pairs – : probability of transition – : expected immediate reward Policy Maximize the return

Autonomous Learning Laboratory 5 Equivalence in MDPs N E S W

Autonomous Learning Laboratory 6 Modeling Equivalence Model using homomorphisms Extend to MDPs hhagg.

Autonomous Learning Laboratory 7 Modeling Equivalence (cont.) Let h be a homomorphism from to –a map from onto, s.t.. e.g. is a homomorphic image of.

Autonomous Learning Laboratory 8 Model Minimization Finding reduced models that preserve some aspects of the original model Various modeling paradigms –Finite State Automata (Hartmanis and Stearns ’66) Machine homomorphisms –Model Checking (Emerson and Sistla ’96, Lee and Yannakakis ’92) Correctness of system models –Markov Chains (Kemeny and Snell ’60) Lumpability –MDPs (Dean and Givan ’97, ’01) Simpler notion of equivalence

Autonomous Learning Laboratory 9 Symmetry A symmetric system is one that is invariant under certain transformations onto itself. –Gridworld in earlier example, invariant under reflection along diagonal N E S W N E S W

Autonomous Learning Laboratory 10 Symmetry example. –Towers of Hanoi GoalStart Such a transformation that preserves the system properties is an automorphism. Group of all automorphisms is known as the symmetry group of the system.

Autonomous Learning Laboratory 11 Symmetries in Minimization Any subgroup of a symmetry group can be employed to define symmetric equivalence Induces a reduced homomorphic image –Greater reduction in problem size –Possibly more efficient algorithms Related work: Zinkevich and Balch ’01, Popplestone and Grupen ’00.

Autonomous Learning Laboratory 12 Partial Equivalence Equivalence holds only over parts of the state- action space Context dependent equivalence Fully reduced Partially reduced

Autonomous Learning Laboratory 13 Abstraction in Hierarchical RL Options (Sutton, Precup and Singh ’99, Precup ’00) –E.g. go-to-door1, drive-to-work, pick-up-red- ball An option is given by: - Initiation set - Option policy - Termination criterion

Autonomous Learning Laboratory 14 Option specific minimization Equivalence holds in the domain of the option Special class –Markov subgoal options Results in relativized options –Represents a family of options –Terminology: Iba ’89

Autonomous Learning Laboratory 15 Task is to collect all objects in the world 5 options – one for each room. Markov, subgoal options Single relativized option – get-object- exit-room –Employ suitable transformations for each room Rooms world task

Autonomous Learning Laboratory 16 Relativized Options Relativized option: - Option homomorphism - Option MDP (Reduced representation of MDP) - Initiation set - Termination criterion reduced state action option Top level actions percept envenv

Autonomous Learning Laboratory 17 Especially useful when learning option policy –Speed up –Knowledge transfer Rooms world task

Autonomous Learning Laboratory 18 Experimental Setup Regular Agent –5 options, one for each room –Option reward of +1 on exiting room with object Relativized Agent –1 relativized option, known homomorphism –Same option reward Global reward of +1 on completing task Actions fail with probability 0.1

Autonomous Learning Laboratory 19 Reinforcement Learning (Sutton and Barto ’98) Trial and Error Learning Maintain “value” of performing action a in state s Update values based on immediate reward and current estimate of value Q-learning at the option level (Watkins ’89) SMDP Q-learning at the higher level (Bradtke and Duff ’95)

Autonomous Learning Laboratory 20 Results Average over 100 runs

Autonomous Learning Laboratory 21 Modified problem Exact equivalence does not always arise Vary stochasticity of actions in each room

Autonomous Learning Laboratory 22 Asymmetric Testbed

Autonomous Learning Laboratory 23 Results – Asymmetric Testbed Still significant speed up in initial learning Asymptotic performance slightly worse

Autonomous Learning Laboratory 24 Results – Asymmetric Testbed Still significant speed up in initial learning Asymptotic performance slightly worse

Autonomous Learning Laboratory 25 Approximate Equivalence Model as a map onto a Bounded-parameter MDP –Transition probabilities and rewards given by bounded intervals (Givan, Leach and Dean ’00) –Interval Value Iteration –Bound loss in performance of policy learned

Autonomous Learning Laboratory 26 Summary Model minimization framework Considers state-action equivalence Accommodates symmetries Partial equivalence Approximate equivalence

Autonomous Learning Laboratory 27 Summary (cont.) Options in a relative frame of reference –Knowledge transfer across symmetrically equivalent situations –Speed up in initial learning Model minimization ideas used to formalize notion –Sufficient conditions for safe state abstraction (Dietterich ’00) –Bound loss when approximating

Autonomous Learning Laboratory 28 Future Work Symmetric minimization algorithms Online minimization Adapt minimization algorithms to hierarchical frameworks –Search for suitable transformations Apply to other hierarchical frameworks Combine with option discovery algorithms

Autonomous Learning Laboratory 29 Issues Design better representations Partial observability –Deictic representation Connections to symbolic representations Connections to other MDP abstraction frameworks –Esp. Boutilier and Dearden ’94, Boutilier et al. ’95, Boutilier et al. ’01