Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ

Two Layer Architecture  The lower layer provides fast, short horizon decision.  The lower layer is designed to keep robot out of trouble.  The upper layer ensures that the robot continually works toward its target task or goal.

Advantages  Offers reliability.  Reliability: the robot must be able to deal with failure of sensors and actuators.  Hardware failure = mission failure  Example, robots operating out of direct human control: Space exploration Office robot

The System  It has two levels of control: The lower level controls the actuators that move the robot around and provides a set of behaviors that can be used by the higher level of control. The upper level, planning system, plans a sequence of actions in order to move the robot from its current location to the goal.

The Architecture  The bottom level is accomplished by RL: RL as an incremental learning is able to learn online. RL can adapt changes in the environment. RL reduce the programmer intervention.

The Architecture  The higher level is POMDP planner: POMDP planner operates quickly once a policy is generated. POMDP planner can provide reinforcement needed by lower level behaviors.

The Test  For test, the Kephera robot simulator is used. Kephera has limited sensors. It has well-defined environment. The simulator can run much faster than real- time. The simulator does not require human intervention for low battery conditions and sensor failures.

Methods for Low-Level Behaviors  Subsumption  Learning from examples.  Behavioral cloning.

Methods for Low-Level Behaviors  Neural systems tend to be robust to noise and perturbation in the environment.  GeSAM is a neural network based robot hand control system. GeSAM uses adaptive neural network.  Neural networks often require long trainning periods and large amounts of data.

Methods for Low-Level Behaviors  RL can learn continuously.  RL provide adaptation to sensor drift and changes in actuators.  In many extreme cases, sensor or actuator failures adapt enough to allow the robot to accomplish the mission.

Planning at the Top  POMDP deals with the uncertainity. For Kephera, with limited sensors, determining the exact state is very difficult. Also, the effects of actuators may not be deterministic.

 Some rewards are associated with the goal state.  Some rewards are associated with performing some action in a certain state.  Thus, this will allow to define complex, compound goals. Planning at the Top

Drawback  The current POMDP solution method: Does not scale well with the size of state space. Exact solutions are only feasible for very small POMDP planning problems. Requires that the robot be given a map, which is not always feasible.

What is Gained?  By combining RL and POMDP, the system is robust to changes.  RL will learn how to use the damaged sensors and actuators.  Continuous learning has some drawbacks when using backpropagation neural networks. Over- trainning.  POMDP adapt to sensor and actuator failures by adjusting the transition probabilities.

The Simulator  Pulse encoders are not used in this work.  The simulation results can be successfully transferred to a real robot.  The sensor model includes stochastic modeling of noise and responds similarly to the real sensors.  The simulation environment includes some stochastic modeling of wheel slippage and accelaration.  Hooks are added into the simulator to allow to simulate sensor failures.  Effector failures are simulated in the code.

RL Behaviors  Three basic behavior, move forward, turn right and turn left.  The robot is always moving or performing an action.  RL is responsible for dealing: With obstacles, With adjusting sensor or actuator malfunction.

 The goal of the RL module is to maximize the reward given them by the POMDP planner.  The reward is a function how long it took to make a desired state transition.  Each behaviors has its own RL module.  Only one RL module can be active in a given time.  Q-learning with table lookup for approximating the value function.  Fortunately, the problem so far small enough for table lookup. RL Behaviors

POMDP planning  Since robots can rarely determine their state from sensor observations, COMDP do not work well in many real-world robot planning tasks.  It is more adequate to use the state probability distribution, and update using the information about transition and obsservation probabilities.

Sensor Grouping  Kephera has 8 sensors that report distance values between 0 and 1024.  The observations are reduced to 16: The sensors are grouped in pairs to make 4 pseudo sensors, Tresholding applied to the output of the sensors.  POMDP planner is now robust to single sensor failures.

Solving a POMDP  Witness algorithm is used to compute the optimal policy for POMDP.  Witness does not scale well with the size of the state space.

Environment and State Space  64 possible state for the robot: 16 discrete positions. Robot’s heading is disceretized into the four compass directions.  Sensor information was reduced to 4 bits by combining the sensors in pairs and thresholding.  Solution to LP required several days on a Sun Ultra 2 workstation.

Environment and State Space

Interface Between Layers  POMDP uses current belief state to select low level behavior to activate.  The implementation tracks the state with the highest probability: the most likely current state.  If the most likely current state changes to the state that POMDP want, a reward of 1, otherwise –1 is generated.

Hypothesis  Since RL  POMDP is adaptive, the author expect that the overall performance should degrade gracefully as sensors and actuators gradually fails.

Evaluation  State 13 is the goal state.  POMDP state transition and observation probabilities obtained by placing the robot in each 64 state and taking each action ten times.  With the policy in place,RL modules are trained in the same way.  For each system configuration (RL or hand coded), the simulation is started from every position and orientation and performance is recorded.

Metrics  Failures during trial evaluating the reliability.  Average steps to goal asses the efficiency.

Gradual Sensor Failure  Battery power is used up, dust accumulates on sensors.

Intermittent Actuator Failure  Right motor control signal failed.

Conclusion  The RL  POMDP exihibits robust behavior in the presence of sensor and actuator degradation.  Future work scaling the problem.  To overcome the scaling problem of table lookup of RL, neural nets can be used (learn  forget cycle).  To increase the size of the space for the POMDP, non-optimal solution algorithms are investigated.  New behaviors will be added.

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

Similar presentations

Presentation on theme: "Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

Similar presentations

Presentation on theme: "Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ."— Presentation transcript:

Similar presentations

About project

Feedback