Reinforcement Learning in a Multi-Robot Domain

Slides:



Advertisements
Similar presentations
Partially Observable Markov Decision Process (POMDP)
Advertisements

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reward Functions for Accelerated Learning Presented by Alp Sardağ.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Romaric GUILLERM Hamid DEMMOU LAAS-CNRS Nabil SADOU SUPELEC/IETR.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Introduction Many decision making problems in real life
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Introduction to Research
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
PAPER BY JONATHAN MUGAN & BENJAMIN KUIPERS PRESENTED BY DANIEL HOUGH Learning Distinctions and Rules in a Continuous World through Active Exploration.
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
Distributed Q Learning Lars Blackmore and Steve Block.
Behavior-based Multirobot Architectures. Why Behavior Based Control for Multi-Robot Teams? Multi-Robot control naturally grew out of single robot control.
Selection of Behavioral Parameters: Integration of Case-Based Reasoning with Learning Momentum Brian Lee, Maxim Likhachev, and Ronald C. Arkin Mobile Robot.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.
WHAT MODELS DO THAT THEORIES CAN’T Lilia Gurova Department of Cognitive Science and Psychology New Bulgarian University.
Introduction to Machine Learning, its potential usage in network area,
Properties of Living things
What you DO with language in a Literature Review…
Risk Tolerance Factor # 10 Role Models Accepting Risk
With Remote Capabilities by Justin Dansby
Computational Reasoning in High School Science and Math
Copyright © Dale Carnegie & Associates, Inc.
Done Done Course Overview What is AI? What are the Major Challenges?
Properties of Living things
Analytics and OR DP- summary.
Adapted from
Reinforcement Learning in POMDPs Without Resets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Rutherford County Schools
Psychology as a science
Biomedical Data & Markov Decision Process
IS Psychology A Science?
Copyright 2004 © James H. Redin
Teaching with Instructional Software
The Misappropriation of Public Funds in Our Education System
IS Psychology A Science?
Harm van Seijen Bram Bakker Leon Kester TNO / UvA UvA
The Use of Artificial Life and Culture in Gaming As a Tool for Education Jared Witzer Frequently, presenters must deliver material of a technical nature.
Rutherford County Schools
Patterns of Involuntary Technology Adoption
Erlang in Banking & Financial Switching
Copyright © Dale Carnegie & Associates, Inc.
Presenting a Technical Report
Numerical Methods Charudatt Kadolkar 12/9/2018
© DMTI (2018) | Resource Materials |
Research and Methodology
Engineering Services & Software introduces SuperFractionate/Ponchon
Properties of Living things
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
Designing Neural Network Architectures Using Reinforcement Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CIET,LAM,DEPARTMENT OF MECHANICAL ENGINEERING
Research Strategies.
Artificial Intelligence Chapter 10 Planning, Acting, and Learning
CS 416 Artificial Intelligence
Properties of Living things
PSoup: A System for streaming queries over streaming data
Reinforcement Learning Dealing with Partial Observability
DESIGN OF EXPERIMENTS by R. C. Baker
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Reinforcement Learning in a Multi-Robot Domain Author: Maja J Mataric Presenter: Drew Bagnell Some of the material I will discuss comes from Mataric’s paper in Adaption and Learning in Multi-Agent systems…. “Learning in Multi-robot Systems” 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute Motivation Reinforcement learning potentially provides a method to construct sophisticated behaviors in single and multiple robots with little programming in the classical sense Declarative task specification– what not how Good results in many domains (see RL Survey, Kaebling, Littman and Moore) Why do we care? 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute Problems in RL? Standard RL algorithms and analyses are inapplicable in the multi-agent environments she is interested in. MDP is an inaccurate model in situated robotics. The traditional algorithms are not taking advantage of domain knowledge to enable/accelerate learning Two main classes of problems: Managing state space complexity Structuring and assigning reinforcement Mataric takes a very broad view of the Reinforcement Learning Problem… it’s refreshing in that it’s not as theoretically biased. Specifically, the MDP fails for a # of reasons: 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute State Representation Combinatorial explosion– state space is exponential. Continuous state– high level filters used in simulation may be unrealistic. Suggests in here case using behaviors pre-conditions as a “state” space As an early consideration, it is important to note that Mataric’s definition of state is unclear– I take it to mean any observation space the learning algorithm can work It is important to note that at this point Mataric is essentially throwing up her hands on finding optimal or near optimal policies– she has given up any notion of state, and in doing we can conclude that finding even the best memoryless policy is NP-hard. 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute Transitions/Events World/agent are asynchronous World is largely uncontrolled Noise and uncertainty have “specific usually complex properties that cannot be modeled” Building a predictive model can be very slow– model free instead The statement about noise re-iterates her thoughts on state. 5/14/2019 Drew Bagnell, Robotics Institute

Reinforcement vs. “Shaping” Monolithic reinforcement functions Multiple Goals-- ordering The more immediate the reinforcement the better Domain knowledge can give us progress estimates that are informative “RL methods hide (domain knowledge) in the reinforcement function, which often employs some ad hoc embedding of the domain semantics.” “Shaping” is a term Mataric borrows from the psychology literature, which is a shortening of “shaping by successive approximations”. A robot/animal is shaped when it is conditioned in steps toward an ultimate goal. Traditional RL techniques without domain knowledge would have to accidentally reach the goal– the further away that is, the more difficult it is to associate good actions with that. 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute Progress Estimators Partial internal critics Functions that provide positive or negative reinforcement with respect to the current goal (behavior post-conditions) Encourage exploration in the sense that as long as the behavior makes progress we do not switch behaviors in discrete time Allow us to catch thrashing in a single behavior 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute The Learning Task Mataric assumes behaviors are known, and the task is to learn a switching function between behaviors The observation space is shifted to be defined by the operating conditions of each behavioral module Behaviors are run until the next event or progress estimators declare a failure 5/14/2019 Drew Bagnell, Robotics Institute

The Learning Algorithm Find a total ordering on behaviors A(c,b) Where A(c,b) is a weighted sum of immediate (“heterogeneous”) reinforcement for subgoals achieved and Progress estimators Learning is continuous No bootstraping occurs between (c,b) pairs– there is no flow of information as DP algorithms use. If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute Experimental Setup Foraging– “complex & biological inspired” Known behaviors, also utility behaviors Safe-wandering Dispersion Resting Homing Space of pre-conditions as observation space If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute Experimental Results Monolithic Q-learning does poorly– in particular it has problems with other robots and only does tasks that gain it immediate reward Heterogeneous rewards do better The “Shaping” reward structure does the best of all. If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute

Uncovered Interesting Issues Interaction and credit assignment— the task is essentially a single robot task Hidden environmental state– the problems induced here are hardly discussed Hidden state introduced by interaction If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute Objections Use of the word “state” implies sufficient statistic Features need not expand the observation space exponentially if we know a priori that they are only relevant given certain other features– We can build progress estimators into an initial value function using the above observation– plus the estimators tune themselves Semi-Markov Processes deal with discrete event systems instead of discrete time It almost certainly took longer to craft the reward system and the progress estimators than to write the “empirically derived” optimal switcher. If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience. 5/14/2019 Drew Bagnell, Robotics Institute

Drew Bagnell, Robotics Institute Conclusions Mataric makes important and valid points regarding the difficulty of learning on real robots She shows her methods can improve performance using domain knowledge The methods described are ad-hoc and very little could be said about even convergence to any policy It is not clear that any work is saved in this approach– nor that what we have when done learning is a good or optimal policy. Determine the best close for your audience and your presentation. Close with a summary; offer options; recommend a strategy; suggest a plan; set a goal. Keep your focus throughout your presentation, and you will more likely achieve your purpose. 5/14/2019 Drew Bagnell, Robotics Institute