Download presentation
Presentation is loading. Please wait.
Published bySharon Jefferson Modified over 9 years ago
1
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
2
Content Background Review Q-learning Reinforcement learning on mobile robots Learning framework Experimental results Conclusion Discussion
3
Background Hard to code behaviour efficiently and correctly Reinforcement learning: tell the robot what to do, not how to do it How well suited is reinforcement learning for mobile robots?
4
Review Q-learning Discrete states s and actions a Learn value function by observing rewards – Actual function Q*(s,a) = E[R(s,a) + max Q*(s’,a’)] – Learn by Q(s t,a t ) = (1- ) Q(s t,a t ) + (r t+1 + max Q(s t+1,a’)) Sample distribution has no effect on learned policy *(s) = argmax Q*(s,a)
5
Reinforcement learning on mobile robots Sparse reward function – Almost always zero reward R(s,a) – Non-zero reward only when on success or failure Continuous environment – HEDGER is used as a function approximator – Function approximation can be used when it never extrapolates from the data
6
Reinforcement learning on mobile robots Q-learning can only be successful when a state with positive reward can be found Sparse reward function and continuous environment cause reward states to be hard to find by trial and error Solution: show robot how to find the reward states
7
Learning framework Split learning into two phases: – Phase one: actions are controlled by exterior force, learning algorithm only passively observes – Phase two: learning algorithm learns optimal policy By ‘showing’ the robot where the interesting states are, learning should be quicker
8
Experimental setup Two experiments on B21r mobile robot – Movement speed is fixed by outside force – Rotation speed has to be learned – Settings = 0.2, = 0.99 or 0.90 Performance is measured after every 5 runs – Robot does not learn from these test – Starting position and orientation similar, not identical
9
Experimental Results: Corridor Following Task State space: – distance to end of corridor – distance to left wall as fraction of corridor width – angle to target point
10
Experimental Results: Corridor Following Task Computer controlled teacher – Rotation speed is a fraction of the angle
11
Experimental Results: Corridor Following Task Human controlled teacher – Different corridor than computer controlled teacher
12
Experimental Results: Corridor Following Task Results Decrease in performance after training – Phase 2 supplies more novel experiences Sloppy human controller causes faster convergence than rigid computer controller – Fewer phase 1 and phase 2 runs – Human controller supplies more varied data
13
Experimental Results: Corridor Following Task Results Simulated performance without advantage of teacher examples
14
Experimental Results: Obstacle Avoidance Task State space: – direction and distance to obstacles – direction and distance to target
15
Experimental Results: Obstacle Avoidance Task Results Human controlled teacher – Robot starts 3m from target, random orientation
16
Experimental Results: Obstacle Avoidance Task Results Simulation without teacher examples – No obstacles present; robot only must reach goal – Simulated robot starts in the right orientation – 3 meters from target: 18.7% reached target in one week of simulated time, taking 6.54 hours on average
17
Conclusion Passive observation of appropriate state-action behaviour can speed up Q-learning Knowledge about the robot or the learning algorithm is not necessary Any solution will work, providing a good solution is not necessary
18
Discussion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.