Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

Content Background Review Q-learning Reinforcement learning on mobile robots Learning framework Experimental results Conclusion Discussion

Background Hard to code behaviour efficiently and correctly Reinforcement learning: tell the robot what to do, not how to do it How well suited is reinforcement learning for mobile robots?

Review Q-learning Discrete states s and actions a Learn value function by observing rewards – Actual function Q*(s,a) = E[R(s,a) +  max Q*(s’,a’)] – Learn by Q(s t,a t ) = (1-  ) Q(s t,a t ) +  (r t+1 +  max Q(s t+1,a’)) Sample distribution has no effect on learned policy  *(s) = argmax Q*(s,a)

Reinforcement learning on mobile robots Sparse reward function – Almost always zero reward R(s,a) – Non-zero reward only when on success or failure Continuous environment – HEDGER is used as a function approximator – Function approximation can be used when it never extrapolates from the data

Reinforcement learning on mobile robots Q-learning can only be successful when a state with positive reward can be found Sparse reward function and continuous environment cause reward states to be hard to find by trial and error Solution: show robot how to find the reward states

Learning framework Split learning into two phases: – Phase one: actions are controlled by exterior force, learning algorithm only passively observes – Phase two: learning algorithm learns optimal policy By ‘showing’ the robot where the interesting states are, learning should be quicker

Experimental setup Two experiments on B21r mobile robot – Movement speed is fixed by outside force – Rotation speed has to be learned – Settings  = 0.2,  = 0.99 or 0.90 Performance is measured after every 5 runs – Robot does not learn from these test – Starting position and orientation similar, not identical

Experimental Results: Corridor Following Task State space: – distance to end of corridor – distance to left wall as fraction of corridor width – angle  to target point

Experimental Results: Corridor Following Task Computer controlled teacher – Rotation speed is a fraction  of the angle 

Experimental Results: Corridor Following Task Human controlled teacher – Different corridor than computer controlled teacher

Experimental Results: Corridor Following Task Results Decrease in performance after training – Phase 2 supplies more novel experiences Sloppy human controller causes faster convergence than rigid computer controller – Fewer phase 1 and phase 2 runs – Human controller supplies more varied data

Experimental Results: Corridor Following Task Results Simulated performance without advantage of teacher examples

Experimental Results: Obstacle Avoidance Task State space: – direction and distance to obstacles – direction and distance to target

Experimental Results: Obstacle Avoidance Task Results Human controlled teacher – Robot starts 3m from target, random orientation

Experimental Results: Obstacle Avoidance Task Results Simulation without teacher examples – No obstacles present; robot only must reach goal – Simulated robot starts in the right orientation – 3 meters from target: 18.7% reached target in one week of simulated time, taking 6.54 hours on average

Conclusion Passive observation of appropriate state-action behaviour can speed up Q-learning Knowledge about the robot or the learning algorithm is not necessary Any solution will work, providing a good solution is not necessary

Discussion

Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

Similar presentations

Presentation on theme: "Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

Similar presentations

Presentation on theme: "Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P."— Presentation transcript:

Similar presentations

About project

Feedback