Download presentation
Presentation is loading. Please wait.
Published byGwendolyn Wade Modified over 9 years ago
1
Encoding Robotic Sensor States for Q-Learning using the Self-Organizing Map Gabriel J. Ferrer Department of Computer Science Hendrix College
2
Outline Statement of Problem Q-Learning Self-Organizing Maps Experiments Discussion
3
Statement of Problem Goal Make robots do what we want Minimize/eliminate programming Proposed Solution: Reinforcement Learning Specify desired behavior using rewards Express rewards in terms of sensor states Use machine learning to induce desired actions Target Platform Lego Mindstorms NXT
4
Robotic Platform
5
Experimental Task Drive forward Avoid hitting things
6
Q-Learning Table of expected rewards (“Q-values”) Indexed by state and action Algorithm steps Calculate state index from sensor values Calculate the reward Update previous Q-value Select and perform an action Q(s,a) = (1 - α) Q(s,a) + α (r + γ max(Q(s',a)))
7
Certain sensors provide continuous values Sonar Motor encoders Q-Learning requires discrete inputs Group continuous values into discrete “buckets” [Mahadevan and Connell, 1992] Q-Learning produces discrete actions Forward Back-left/Back-right Q-Learning and Robots
8
Creating Discrete Inputs Basic approach Discretize continuous values into sets Combine each discretized tuple into a single index Another approach Self-Organizing Map Induces a discretization of continuous values [Touzet 1997] [Smith 2002]
9
Self-Organizing Map (SOM) 2D Grid of Output Nodes Each output corresponds to an ideal input value Inputs can be anything with a distance function Activating an Output Present input to the network Output with the closest ideal input is the “winner”
10
Applying the SOM Each input is a vector of sensor values Sonar Left/Right Bump Sensors Left/Right Motor Speeds Distance function is sum-of-squared-differences
11
SOM Unsupervised Learning Present an input to the network Find the winning output node Update ideal input for winner and neighbors –weight ij = weight ij + (α * (input ij – weight ij )) Neighborhood function
12
Experiments Implemented in Java (LeJOS 0.85) Each experiment 240 seconds (800 Q-Learning iterations) 36 States Three actions Both motors forward Left motor backward, right motor stopped Left motor stopped, right motor backward
13
Rewards Either bump sensor pressed: 0.0 Base reward: 1.0 if both motors are going forward 0.5 otherwise Multiplier: Sonar value greater than 20 cm: 1 Otherwise, (sonar value) / 20
14
Parameters Discount (γ): 0.5 Learning rate (α): 1/(1 + (t/100)), t is the current iteration (time step) Used for both SOM and Q-Learning [Smith 2002] Exploration/Exploitation Epsilon = α/4 Probability of random action Selected using weighted distribution
15
Experimental Controls Q-Learning without SOM Qa States Current action (1-3) Current bumper states Quantized sonar values (0-19 cm; 20-39; 40+) Qb States Current bumper states Quantized sonar values (9) (0-11 cm…; 84-95; 96+)
16
SOM Formulations 36 Output Nodes Category “a”: Length-5 input vectors Motor speeds, bumper values, sonar value Category “b”: Length-3 input vectors Bumper values, sonar value All sensor values normalized to [0-100]
17
SOM Formulations QSOM Based on [Smith 2002] Gaussian Neighborhood Neighborhood size is one-half SOM width QT Based on [Touzet 1997] Learning rate is fixed at 0.9 Neighborhood is immediate Manhattan neighbors Neighbor learning rate is 0.4
18
Quantitative Results QaQbQSOMaQSOMbQTaQTb Mean607.97578.91468.86534.49456.19545.61 StDv81.9276.9539.39160.4185.0757.98 Median608.75667.5485.11587.64442.62560.77 Min506.47528.67410.2354.25378.72481.55 Max723540.55495661.59547.22594.5 Mean/It0.760.720.590.670.570.68 StDv/It0.1 0.050.20.110.07
19
Qualitative Results QSOMa Motor speeds ranged from 2% to 50% Sonar values stuck between 90% and 94% QSOMb Sonar values range from 40% to 95% Best two runs arguably the best of the bunch Very smooth SOM values in both cases
20
Qualitative Results QTa Sonar values ranged from 10% to 100% Still a weak performer on average Best performer similar to QTb QTb Developed bump-sensor oriented behavior Made little use of sonar Highly uneven SOM values in both cases
21
Experimental Area
22
First Movie QSOMb Strong performer (Reward: 661.89) Minimum sonar value: 43.35% (110 cm)
23
Second Movie Also QSOMb Typical bad performer (Reward: 451.6) Learns to avoid by always driving backwards Baseline “not-forward” reward: 400.0 Minimum sonar value: 57.51% (146 cm) Hindered by small filming area
24
Discussion Use of SOM on NXT can be effective More research needed to address shortcomings Heterogeneity of sensors is a problem Need to try NXT experiments with multiple sonars Previous work involved homogeneous sensors Approachable by undergraduate students Technique taught in junior/senior AI course
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.