Using Hierarchical Reinforcement Learning to Balance Conflicting Sub- problems By: Stephen Robertson Supervisor: Phil Sterne.

Slides:



Advertisements
Similar presentations
Data Quality Improving motivation to improve data quality.
Advertisements

Tree Diagrams 1. Learning Objectives Upon completing this module, you will be able to:  Understand the purpose and use of a Tree Diagram (TD)  Construct,
We are born with five genetically encoded needs
CSC 423 ARTIFICIAL INTELLIGENCE
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Artificial Intelligence in Game Design Introduction to Learning.
Using Hierarchical Reinforcement Learning to Solve a Problem with Multiple Conflicting Sub-problems.
New procedures for Editing and Imputation of demographic variables G. Bianchi, A. Manzari, A. Pezone, A. Reale, G. Saporito ISTAT.
The AutoSimOA Project Katy Hoad, Stewart Robinson, Ruth Davies Warwick Business School WSC 07 A 3 year, EPSRC funded project in collaboration with SIMUL8.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
IJCNN, International Joint Conference on Neural Networks, San Jose 2011 Pawel Raif Silesian University of Technology, Poland, Janusz A. Starzyk Ohio University,
Fundamental Techniques
Dynamic Programming 0-1 Knapsack These notes are taken from the notes by Dr. Steve Goddard at
Step 1: Simplify Both Sides, if possible Distribute Combine like terms Step 2: Move the variable to one side Add or Subtract Like Term Step 3: Solve for.
4.1 The Theory of Optimization  Optimizing Theory deals with the task of finding the “best” outcome or alternative –Maximums and –Minimums  What output.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Empirical Explorations with The Logical Theory Machine: A Case Study in Heuristics by Allen Newell, J. C. Shaw, & H. A. Simon by Allen Newell, J. C. Shaw,
ADVANCED DESIGN APPLICATIONS UNIT 4 - MANUFACTURING © 2015 International Technology and Engineering Educators Association, Learning Cycle Three – Looping.
Applying reinforcement learning to Tetris A reduction in state space Underling : Donald Carr Supervisor : Philip Sterne.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Bi-directional incremental evolution Dr Tatiana Kalganova Electronic and Computer Engineering Dept. Bio-Inspired Intelligent Systems Group Brunel University.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Algebra Equations Lesson 1-7 Pages Equations An Equation is a sentence that contains an equals = sign. The equals = sign tells you that the expression.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
A guide to... Safe Systems of Work.
Foundations of Technology The Systems Model
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Distributed Models for Decision Support Jose Cuena & Sascha Ossowski Pesented by: Gal Moshitch & Rica Gonen.
1 Solving problems by searching Chapter 3. Depth First Search Expand deepest unexpanded node The root is examined first; then the left child of the root;
Safe Systems of Work. Legislation w HSWA Section 2 (2) (a): Provide and maintain plant and systems of work that are, so far as is reasonably practicable,
Ramakrishna Lecture#2 CAD for VLSI Ramakrishna
D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.
Analyzing Planning Domains For Task Decomposition AHM Modasser Billah( ), Raju Ahmed( ) Department of Computer Science and Engineering (CSE),
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
HOW TECHNOLOGY WORKS. TECHNOLOGY Using knowledge to develop products and systems that satisfy needs, solve problems, and increase our capabilities.
Introduction and Preliminaries D Nagesh Kumar, IISc Water Resources Planning and Management: M4L1 Dynamic Programming and Applications.
Planning and Scheduling.  A job can be made up of a number of smaller tasks that can be completed by a number of different “processors.”  The processors.
Dynamic Programming. What is Dynamic Programming  A method for solving complex problems by breaking them down into simpler sub problems. It is applicable.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
Distributed Learning for Multi-Channel Selection in Wireless Network Monitoring — Yuan Xue, Pan Zhou, Tao Jiang, Shiwen Mao and Xiaolei Huang.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Greedy Algorithms Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Machine Learning: Ensemble Methods
2-4 Solving Equations Goal:
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
A Comparison of Learning Algorithms on the ALE
Introduction Defining the Problem as a State Space Search.
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
Introduction to Computer Programming
Exploiting Graphical Structure in Decision-Making
Reinforcement Learning
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
CIS 488/588 Bruce R. Maxim UM-Dearborn
Overview of Machine Learning
Chapter 2: Evaluative Feedback
October 6, 2011 Dr. Itamar Arel College of Engineering
Programming Fundamentals (750113) Ch1. Problem Solving
COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P.
Software Design Principles
Programming Fundamentals (750113) Ch1. Problem Solving
Chapter 2: Evaluative Feedback
By: Stephen Robertson Supervisor: Phil Sterne
Discrete Optimization
Presentation transcript:

Using Hierarchical Reinforcement Learning to Balance Conflicting Sub- problems By: Stephen Robertson Supervisor: Phil Sterne

Presentation Outline Project Motivation Project Aim Rules of the Gridworld Flat Reinforcement Learning Feudal Reinforcement Learning State Variable Combination Approach

Project Motivation Reinforcement Learning is an attractive form of machine learning, but because of the curse of dimensionality, with complex problems it becomes inefficient Hierarchical Reinforcement Learning is a method for dealing with this curse of dimensionality

Project Aim Implementing various algorithms of Hierarchical Reinforcement Learning to a complex gridworld problem Comparing the various algorithms to each other and to flat Reinforcement Learning

Rules of the gridworld Possible Actions: Left, Right, Up, Down and Rest Collecting food and drink increases nourishment and hydration respectively After landing on the tree, the creature is carrying wood which it can use to repair its shelter

Rules of the gridworld Resting in a repaired shelter increases health in proportion to the shelter condition Landing on the lion decreases health and results in a direct punishment After every 4 steps, nourishment, hydration, and shelter condition decrease by 1. After 10 steps, health decreases by 1.

Flat Reinforcement learning Sarsa with eligibility traces was used To get Flat Reinforcement Learning working, the task needed to be simplified slightly Limited to a 6x6 gridworld Nourishment, Hydration, Health and Shelter Condition minimised to 5 discrete levels each Total states: 6 x 6 x 5 x 5 x 5 x 5 x 2 = Managable

Flat Reinforcement Learning The given task requires a large amount of exploration in order to find the optimal solution Total exploration at first, decreasing gradually until finally total exploitation Optimistic initialisation of tables to maximum possible reward of 6400 encourages efficient exploration

Flat Reinforcement Learning Results

Feudal Reinforcement Learning Needs to be modified for the given problem In the simple maze problem, state variables change independently, and don’t change by more than 1 In the simple maze problem, high level actions can be defined as the same as low level actions

Feudal Reinforcement Learning Main problem with the complex problem is the high level actions are hard to define State variables can change simultaneously and by more than one, i.e. creature can move to the left, and fully satisfy hunger in one step, changing two state variables simultaneously High level actions are defined as desired high level state

Feudal Reinforcement Learning Results Feudal reinforcement learning failed horribly

State Variable Combination Approach In a problem with conflicting sub-problems, sub-problems tend to be defined by a limited set of state variables Sub-agents are created, each in charge of a limited set of state variables Some sub-agents will be inherently equipped to solve a sub-problem Some sub-agents will not hold any useful information By incorporating all possible combinations, we minimise the amount of designer intervention

Examples of Sub-agents

Choosing between sub-agents If the sub-agent which predicts the highest possible reward for a given state is obeyed, the best action should get chosen The problem with this is that some sub-agents which do not hold any useful information might falsely predict a high reward Reliability of sub-agents also needs to be taken into account This is achieved by keeping track of the variance of predicted rewards High Variance = Unreliable Prediction Low Variance = Reliable Prediction

Results

Questions ?