Policy Compression for MDPs

Policy Compression for MDPs
AA 228 December 5th, 2016 Kyle Julian

Motivation MDPs can solve a variety of problems
Many MDP solutions result in large lookup tables Implementing an MDP solution on limited hardware might be intractable Need to compress solution without loss of performance

Outline Policy Compression for Aircraft Collision Avoidance Systems
Problem formulation Neural network compression Results Neural Network Guidance for UAVs Neural Network compression Conclusions

Background – ACAS Xu ACAS X: Aircraft collision avoidance system optimized through Markov decision processes ACAS Xu: UAV version of ACAS X Horizontal advisories Seven discretized state dimensions 𝜌, 𝜃, 𝜓, 𝑣 𝑜𝑤𝑛 , 𝑣 𝑖𝑛𝑡 , 𝜏, 𝑎 𝑝𝑟𝑒𝑣 120 million possible states Five possible actions (heading rates in deg/s) 𝑎: ±3, ±1.5, Clear-of-Conflict (COC) MDP solution: table of score values (Q) for each state-action pair Scores represent costs of taking that action for a given state System takes actions with lowest cost M. J. Kochenderfer and J. P. Chryssanthacopoulos, “Robust airborne collision avoidance through dynamic programming,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-371, 2011.

Problem Formulation Seven dimensional table has 600 million Q values
2.4 GB of floats Too large for many certified avionics systems Must compress table Neural Network Compression

Neural Network Compression - Overview
Neural Network Representation 𝑄 =𝑓(𝑠𝑡𝑎𝑡𝑒) Table Representation Input: State Variables Output: Score Estimates Only parameters of network need to be stored, reducing required storage

Neural Network Compression – Neural Networks
2) Forward Pass Feed inputs and compute outputs 1) Initialize Weights Random, Gaussian 5) Update Weights Repeat process 3) Loss Function Error between network output and truth 4) Back-propagate Error Gradient descent methods

Neural Network Compression - Key Decisions
State Variables Size of network: Total parameters ~600,000 More than 6 hidden layers gives no extra benefit Optimizer Tried five different optimizers AdaMax performed best Fully Connected Layer: 128x7 Activation: ReLU Fully Connected Layer: 512x128 Activation: ReLU Fully Connected Layer: 512x512 Activation: ReLU Fully Connected Layer: 128x512 Activation: ReLU Fully Connected Layer: 128x128 Activation: ReLU Fully Connected Layer: 128x128 Activation: ReLU Output Layer: 5x128 Q Values

Neural Network Compression - Loss Function
Need accurate Q estimations while maintaining optimal actions MSE: Fails to maintain optimal actions Categorical cross-entropy: Fails to maintain Q values Solution: Asymmetric MSE Encourages separation between optimal action and suboptimal actions

Neural Network Compression – Implementation
Implemented in Python using Keras* with Theano† Trained on TITAN X GPUs Training data is normalized and shuffled Batch size: 216 Trained for 1200 training epochs Requires 4 days * F. Chollet. (2016). Keras: Deep learning library for Theano and TensorFlow, [Online]. Available: keras.io † Theano Development Team. (2016). Theano: A Python framework for fast computation of mathematical expressions

Results – Policy Plots Top down view of 90 ∘ encounter 𝜏=0 𝑠𝑒𝑐
𝑎 𝑝𝑟𝑒𝑣 = -1.5 Nearest neighbor interpolation Nearest neighbor interpolation of MDP table Neural network is a smooth representation

Results – Policy Plots Top down view of head-on encounter 𝜏=60 𝑠𝑒𝑐
𝑎 𝑝𝑟𝑒𝑣 =𝐶𝑂𝐶 Neural network represents original table well

Results - Simulation Simulated on set of 1.5 million encounters
p(NMAC): Probability of a Near Midair Collision p(Alert): Probability the system will give an alert p(Reversal): Probability of reversing advisory direction

Results - Example Encounter
Network does not need to interpolate Q values Neural network alerts earlier than table Small difference grows larger over time Able to avoid intruder aircraft

Neural Network Guidance for UAVs - Background
Want to navigate to a waypoint and have some heading and bank angle when you arrive Five discretized state dimensions 𝜌, 𝜃, 𝜓, 𝑣, 𝜙 Reduce redundancy in states Two possible actions Δ𝜙=± 5 ∘ Reward: -1 if not at waypoint with desired heading and bank angle, 0 otherwise Transitions: Assume steady-level flight Take action every 10Hz Propagate position assuming Gaussian noise in velocity and bank angle. If UAV is at the waypoint with desired heading and bank angle, don’t move Solution: 26 million state-action values M. J. Kochenderfer and J. P. Chryssanthacopoulos, “Robust airborne collision avoidance through dynamic programming,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-371, 2011.

MDP Solution Can start from any position
Smooth, flyable trajectories to the waypoint Complex trajectories are parameterized by waypoints

Problem Formulation Table requires 112 MB in memory
3DR Pixhawk has 256 KB of memory Really only 10 KB of memory Need compression of over 10,000 Train neural network to predict best action Classification, not regression

Neural Network Compression
State Variables Size of network: Total parameters ~1400 ~5.6 KB in memory Tried different combinations of numbers of layers and layer sizes 4 hidden layers of 20 perceptrons each Cross-entropy loss Softmax converts outputs to probabilities Training labels are one-hot vectors [0,1] or [1,0] Optimizer: AdaMax Fully Connected Layer: 20x5 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x2 Softmax Output: 2 Probability that each action is optimal

Neural network policy matches original MDP policy very well
Results – Policy Plots Neural network policy matches original MDP policy very well

Neural Network Trajectories
Simulated trajectories of neural network are almost identical to MDP trajectories

Performance Implemented neural network guidance in custom UAV
Flies well in calm or windy conditions Experimental flight in calm conditions is 1.3% slower than simulated flight

Conclusions Implementing MDP solutions in real systems may require compressed policies Neural networks can be trained to represent state-action values or policies Compression by factors of without performance loss Neural networks can be incorporated within limited memory systems

Questions? Kyle Julian

Policy Compression for MDPs

Similar presentations

Presentation on theme: "Policy Compression for MDPs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Policy Compression for MDPs

Similar presentations

Presentation on theme: "Policy Compression for MDPs"— Presentation transcript:

Similar presentations

About project

Feedback