Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

Slides:



Advertisements
Similar presentations
Viktor Zhumatiya, Faustino Gomeza,
Advertisements

Reinforcement Learning
Dialogue Policy Optimisation
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Optimal Policies for POMDP Presented by Alp Sardağ.
DARPA Mobile Autonomous Robot SoftwareMay Adaptive Intelligent Mobile Robotics William D. Smart, Presenter Leslie Pack Kaelbling, PI Artificial.
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
LCSLCS 18 September 2002DARPA MARS PI Meeting Intelligent Adaptive Mobile Robots Georgios Theocharous MIT AI Laboratory with Terran Lane and Leslie Pack.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Planning under Uncertainty
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Reinforcement Learning
1 Autonomous Controller Design for Unmanned Aerial Vehicles using Multi-objective Genetic Programming Choong K. Oh and Gregory J. Barlow U.S. Naval Research.
Brent Dingle Marco A. Morales Texas A&M University, Spring 2002
Experiences with an Architecture for Intelligent Reactive Agents By R. Peter Bonasso, R. James Firby, Erann Gat, David Kortenkamp, David P Miller, Marc.
An Introduction to PO-MDP Presented by Alp Sardağ.
Incorporating Advice into Agents that Learn from Reinforcement Presented by Alp Sardağ.
Autonomous Mobile Robots CPE 470/670 Lecture 8 Instructor: Monica Nicolescu.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Design of Autonomous Navigation Controllers for Unmanned Aerial Vehicles using Multi-objective Genetic Programming Gregory J. Barlow March 19, 2004.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
1 Autonomous Controller Design for Unmanned Aerial Vehicles using Multi-objective Genetic Programming Gregory J. Barlow North Carolina State University.
Reward Functions for Accelerated Learning Presented by Alp Sardağ.
Evolutionary Reinforcement Learning Systems Presented by Alp Sardağ.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Decision-Making on Robots Using POMDPs and Answer Set Programming Introduction Robots are an integral part of many sectors such as medicine, disaster rescue.
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Reinforcement Learning
Leslie Luyt Supervisor: Dr. Karen Bradshaw 2 November 2009.
Hybrid AI & Machine Learning Systems Using Ne ural Network and Subsumption Architecture Libraries By Logan Kearsley.
Introduction Many decision making problems in real life
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
Hybrid AI & Machine Learning Systems Using Ne ural Networks and Subsumption Architecture By Logan Kearsley.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.
Hybrid Behavior Co-evolution and Structure Learning in Behavior-based Systems Amir massoud Farahmand (a,b,c) (
Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright E3 project, BUPT Autonomic Joint Session Admission Control using Reinforcement Learning.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.
Learning Agents MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Inverse Kinematics for Robotics using Neural Networks. Authors: Sreenivas Tejomurtula., Subhash Kak
Over-Trained Network Node Removal and Neurotransmitter-Inspired Artificial Neural Networks By: Kyle Wray.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
A Roadmap towards Machine Intelligence
Selection of Behavioral Parameters: Integration of Case-Based Reasoning with Learning Momentum Brian Lee, Maxim Likhachev, and Ronald C. Arkin Mobile Robot.
Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.
Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of.
Learning Behavioral Parameterization Using Spatio-Temporal Case-Based Reasoning Maxim Likhachev, Michael Kaess, and Ronald C. Arkin Mobile Robot Laboratory.
Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.
Robot Intelligence Technology Lab. Evolutionary Robotics Chapter 3. How to Evolve Robots Chi-Ho Lee.
Intelligent Agents (Ch. 2)
CS b659: Intelligent Robotics
AlphaGo with Deep RL Alpha GO.
Reinforcement learning (Chapter 21)
Markov Decision Processes
Markov Decision Processes
Subsuption Architecture
Robot Intelligence Kevin Warwick.
Morteza Kheirkhah University College London
Presentation transcript:

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ

Two Layer Architecture  The lower layer provides fast, short horizon decision.  The lower layer is designed to keep robot out of trouble.  The upper layer ensures that the robot continually works toward its target task or goal.

Advantages  Offers reliability.  Reliability: the robot must be able to deal with failure of sensors and actuators.  Hardware failure = mission failure  Example, robots operating out of direct human control: Space exploration Office robot

The System  It has two levels of control: The lower level controls the actuators that move the robot around and provides a set of behaviors that can be used by the higher level of control. The upper level, planning system, plans a sequence of actions in order to move the robot from its current location to the goal.

The Architecture  The bottom level is accomplished by RL: RL as an incremental learning is able to learn online. RL can adapt changes in the environment. RL reduce the programmer intervention.

The Architecture  The higher level is POMDP planner: POMDP planner operates quickly once a policy is generated. POMDP planner can provide reinforcement needed by lower level behaviors.

The Test  For test, the Kephera robot simulator is used. Kephera has limited sensors. It has well-defined environment. The simulator can run much faster than real- time. The simulator does not require human intervention for low battery conditions and sensor failures.

Methods for Low-Level Behaviors  Subsumption  Learning from examples.  Behavioral cloning.

Methods for Low-Level Behaviors  Neural systems tend to be robust to noise and perturbation in the environment.  GeSAM is a neural network based robot hand control system. GeSAM uses adaptive neural network.  Neural networks often require long trainning periods and large amounts of data.

Methods for Low-Level Behaviors  RL can learn continuously.  RL provide adaptation to sensor drift and changes in actuators.  In many extreme cases, sensor or actuator failures adapt enough to allow the robot to accomplish the mission.

Planning at the Top  POMDP deals with the uncertainity. For Kephera, with limited sensors, determining the exact state is very difficult. Also, the effects of actuators may not be deterministic.

 Some rewards are associated with the goal state.  Some rewards are associated with performing some action in a certain state.  Thus, this will allow to define complex, compound goals. Planning at the Top

Drawback  The current POMDP solution method: Does not scale well with the size of state space. Exact solutions are only feasible for very small POMDP planning problems. Requires that the robot be given a map, which is not always feasible.

What is Gained?  By combining RL and POMDP, the system is robust to changes.  RL will learn how to use the damaged sensors and actuators.  Continuous learning has some drawbacks when using backpropagation neural networks. Over- trainning.  POMDP adapt to sensor and actuator failures by adjusting the transition probabilities.

The Simulator  Pulse encoders are not used in this work.  The simulation results can be successfully transferred to a real robot.  The sensor model includes stochastic modeling of noise and responds similarly to the real sensors.  The simulation environment includes some stochastic modeling of wheel slippage and accelaration.  Hooks are added into the simulator to allow to simulate sensor failures.  Effector failures are simulated in the code.

RL Behaviors  Three basic behavior, move forward, turn right and turn left.  The robot is always moving or performing an action.  RL is responsible for dealing: With obstacles, With adjusting sensor or actuator malfunction.

 The goal of the RL module is to maximize the reward given them by the POMDP planner.  The reward is a function how long it took to make a desired state transition.  Each behaviors has its own RL module.  Only one RL module can be active in a given time.  Q-learning with table lookup for approximating the value function.  Fortunately, the problem so far small enough for table lookup. RL Behaviors

POMDP planning  Since robots can rarely determine their state from sensor observations, COMDP do not work well in many real-world robot planning tasks.  It is more adequate to use the state probability distribution, and update using the information about transition and obsservation probabilities.

Sensor Grouping  Kephera has 8 sensors that report distance values between 0 and  The observations are reduced to 16: The sensors are grouped in pairs to make 4 pseudo sensors, Tresholding applied to the output of the sensors.  POMDP planner is now robust to single sensor failures.

Solving a POMDP  Witness algorithm is used to compute the optimal policy for POMDP.  Witness does not scale well with the size of the state space.

Environment and State Space  64 possible state for the robot: 16 discrete positions. Robot’s heading is disceretized into the four compass directions.  Sensor information was reduced to 4 bits by combining the sensors in pairs and thresholding.  Solution to LP required several days on a Sun Ultra 2 workstation.

Environment and State Space

Interface Between Layers  POMDP uses current belief state to select low level behavior to activate.  The implementation tracks the state with the highest probability: the most likely current state.  If the most likely current state changes to the state that POMDP want, a reward of 1, otherwise –1 is generated.

Hypothesis  Since RL  POMDP is adaptive, the author expect that the overall performance should degrade gracefully as sensors and actuators gradually fails.

Evaluation  State 13 is the goal state.  POMDP state transition and observation probabilities obtained by placing the robot in each 64 state and taking each action ten times.  With the policy in place,RL modules are trained in the same way.  For each system configuration (RL or hand coded), the simulation is started from every position and orientation and performance is recorded.

Metrics  Failures during trial evaluating the reliability.  Average steps to goal asses the efficiency.

Gradual Sensor Failure  Battery power is used up, dust accumulates on sensors.

Intermittent Actuator Failure  Right motor control signal failed.

Conclusion  The RL  POMDP exihibits robust behavior in the presence of sensor and actuator degradation.  Future work scaling the problem.  To overcome the scaling problem of table lookup of RL, neural nets can be used (learn  forget cycle).  To increase the size of the space for the POMDP, non-optimal solution algorithms are investigated.  New behaviors will be added.