Download presentation
Presentation is loading. Please wait.
1
Policy Compression for MDPs
AA 228 December 5th, 2016 Kyle Julian
2
Motivation MDPs can solve a variety of problems
Many MDP solutions result in large lookup tables Implementing an MDP solution on limited hardware might be intractable Need to compress solution without loss of performance
3
Outline Policy Compression for Aircraft Collision Avoidance Systems
Problem formulation Neural network compression Results Neural Network Guidance for UAVs Neural Network compression Conclusions
4
Background β ACAS Xu ACAS X: Aircraft collision avoidance system optimized through Markov decision processes ACAS Xu: UAV version of ACAS X Horizontal advisories Seven discretized state dimensions π, π, π, π£ ππ€π , π£ πππ‘ , π, π ππππ£ 120 million possible states Five possible actions (heading rates in deg/s) π: Β±3, Β±1.5, Clear-of-Conflict (COC) MDP solution: table of score values (Q) for each state-action pair Scores represent costs of taking that action for a given state System takes actions with lowest cost M. J. Kochenderfer and J. P. Chryssanthacopoulos, βRobust airborne collision avoidance through dynamic programming,β Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-371, 2011.
5
Problem Formulation Seven dimensional table has 600 million Q values
2.4 GB of floats Too large for many certified avionics systems Must compress table Neural Network Compression
6
Neural Network Compression - Overview
Neural Network Representation π =π(π π‘ππ‘π) Table Representation Input: State Variables Output: Score Estimates Only parameters of network need to be stored, reducing required storage
7
Neural Network Compression β Neural Networks
2) Forward Pass Feed inputs and compute outputs 1) Initialize Weights Random, Gaussian 5) Update Weights Repeat process 3) Loss Function Error between network output and truth 4) Back-propagate Error Gradient descent methods
8
Neural Network Compression - Key Decisions
State Variables Size of network: Total parameters ~600,000 More than 6 hidden layers gives no extra benefit Optimizer Tried five different optimizers AdaMax performed best Fully Connected Layer: 128x7 Activation: ReLU Fully Connected Layer: 512x128 Activation: ReLU Fully Connected Layer: 512x512 Activation: ReLU Fully Connected Layer: 128x512 Activation: ReLU Fully Connected Layer: 128x128 Activation: ReLU Fully Connected Layer: 128x128 Activation: ReLU Output Layer: 5x128 Q Values
9
Neural Network Compression - Loss Function
Need accurate Q estimations while maintaining optimal actions MSE: Fails to maintain optimal actions Categorical cross-entropy: Fails to maintain Q values Solution: Asymmetric MSE Encourages separation between optimal action and suboptimal actions
10
Neural Network Compression β Implementation
Implemented in Python using Keras* with Theanoβ Trained on TITAN X GPUs Training data is normalized and shuffled Batch size: 216 Trained for 1200 training epochs Requires 4 days * F. Chollet. (2016). Keras: Deep learning library for Theano and TensorFlow, [Online]. Available: keras.io β Theano Development Team. (2016). Theano: A Python framework for fast computation of mathematical expressions
11
Results β Policy Plots Top down view of 90 β encounter π=0 π ππ
π ππππ£ = -1.5 Nearest neighbor interpolation Nearest neighbor interpolation of MDP table Neural network is a smooth representation
12
Results β Policy Plots Top down view of head-on encounter π=60 π ππ
π ππππ£ =πΆππΆ Neural network represents original table well
13
Results - Simulation Simulated on set of 1.5 million encounters
p(NMAC): Probability of a Near Midair Collision p(Alert): Probability the system will give an alert p(Reversal): Probability of reversing advisory direction
14
Results - Example Encounter
Network does not need to interpolate Q values Neural network alerts earlier than table Small difference grows larger over time Able to avoid intruder aircraft
15
Neural Network Guidance for UAVs - Background
Want to navigate to a waypoint and have some heading and bank angle when you arrive Five discretized state dimensions π, π, π, π£, π Reduce redundancy in states Two possible actions Ξπ=Β± 5 β Reward: -1 if not at waypoint with desired heading and bank angle, 0 otherwise Transitions: Assume steady-level flight Take action every 10Hz Propagate position assuming Gaussian noise in velocity and bank angle. If UAV is at the waypoint with desired heading and bank angle, donβt move Solution: 26 million state-action values M. J. Kochenderfer and J. P. Chryssanthacopoulos, βRobust airborne collision avoidance through dynamic programming,β Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-371, 2011.
16
MDP Solution Can start from any position
Smooth, flyable trajectories to the waypoint Complex trajectories are parameterized by waypoints
17
Problem Formulation Table requires 112 MB in memory
3DR Pixhawk has 256 KB of memory Really only 10 KB of memory Need compression of over 10,000 Train neural network to predict best action Classification, not regression
18
Neural Network Compression
State Variables Size of network: Total parameters ~1400 ~5.6 KB in memory Tried different combinations of numbers of layers and layer sizes 4 hidden layers of 20 perceptrons each Cross-entropy loss Softmax converts outputs to probabilities Training labels are one-hot vectors [0,1] or [1,0] Optimizer: AdaMax Fully Connected Layer: 20x5 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x2 Softmax Output: 2 Probability that each action is optimal
19
Neural network policy matches original MDP policy very well
Results β Policy Plots Neural network policy matches original MDP policy very well
20
Neural Network Trajectories
Simulated trajectories of neural network are almost identical to MDP trajectories
21
Performance Implemented neural network guidance in custom UAV
Flies well in calm or windy conditions Experimental flight in calm conditions is 1.3% slower than simulated flight
22
Conclusions Implementing MDP solutions in real systems may require compressed policies Neural networks can be trained to represent state-action values or policies Compression by factors of without performance loss Neural networks can be incorporated within limited memory systems
23
Questions? Kyle Julian
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.