Presentation is loading. Please wait.

Presentation is loading. Please wait.

Policy Compression for MDPs

Similar presentations


Presentation on theme: "Policy Compression for MDPs"β€” Presentation transcript:

1 Policy Compression for MDPs
AA 228 December 5th, 2016 Kyle Julian

2 Motivation MDPs can solve a variety of problems
Many MDP solutions result in large lookup tables Implementing an MDP solution on limited hardware might be intractable Need to compress solution without loss of performance

3 Outline Policy Compression for Aircraft Collision Avoidance Systems
Problem formulation Neural network compression Results Neural Network Guidance for UAVs Neural Network compression Conclusions

4 Background – ACAS Xu ACAS X: Aircraft collision avoidance system optimized through Markov decision processes ACAS Xu: UAV version of ACAS X Horizontal advisories Seven discretized state dimensions 𝜌, πœƒ, πœ“, 𝑣 π‘œπ‘€π‘› , 𝑣 𝑖𝑛𝑑 , 𝜏, π‘Ž π‘π‘Ÿπ‘’π‘£ 120 million possible states Five possible actions (heading rates in deg/s) π‘Ž: Β±3, Β±1.5, Clear-of-Conflict (COC) MDP solution: table of score values (Q) for each state-action pair Scores represent costs of taking that action for a given state System takes actions with lowest cost M. J. Kochenderfer and J. P. Chryssanthacopoulos, β€œRobust airborne collision avoidance through dynamic programming,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-371, 2011.

5 Problem Formulation Seven dimensional table has 600 million Q values
2.4 GB of floats Too large for many certified avionics systems Must compress table Neural Network Compression

6 Neural Network Compression - Overview
Neural Network Representation 𝑄 =𝑓(π‘ π‘‘π‘Žπ‘‘π‘’) Table Representation Input: State Variables Output: Score Estimates Only parameters of network need to be stored, reducing required storage

7 Neural Network Compression – Neural Networks
2) Forward Pass Feed inputs and compute outputs 1) Initialize Weights Random, Gaussian 5) Update Weights Repeat process 3) Loss Function Error between network output and truth 4) Back-propagate Error Gradient descent methods

8 Neural Network Compression - Key Decisions
State Variables Size of network: Total parameters ~600,000 More than 6 hidden layers gives no extra benefit Optimizer Tried five different optimizers AdaMax performed best Fully Connected Layer: 128x7 Activation: ReLU Fully Connected Layer: 512x128 Activation: ReLU Fully Connected Layer: 512x512 Activation: ReLU Fully Connected Layer: 128x512 Activation: ReLU Fully Connected Layer: 128x128 Activation: ReLU Fully Connected Layer: 128x128 Activation: ReLU Output Layer: 5x128 Q Values

9 Neural Network Compression - Loss Function
Need accurate Q estimations while maintaining optimal actions MSE: Fails to maintain optimal actions Categorical cross-entropy: Fails to maintain Q values Solution: Asymmetric MSE Encourages separation between optimal action and suboptimal actions

10 Neural Network Compression – Implementation
Implemented in Python using Keras* with Theano† Trained on TITAN X GPUs Training data is normalized and shuffled Batch size: 216 Trained for 1200 training epochs Requires 4 days * F. Chollet. (2016). Keras: Deep learning library for Theano and TensorFlow, [Online]. Available: keras.io † Theano Development Team. (2016). Theano: A Python framework for fast computation of mathematical expressions

11 Results – Policy Plots Top down view of 90 ∘ encounter 𝜏=0 𝑠𝑒𝑐
π‘Ž π‘π‘Ÿπ‘’π‘£ = -1.5 Nearest neighbor interpolation Nearest neighbor interpolation of MDP table Neural network is a smooth representation

12 Results – Policy Plots Top down view of head-on encounter 𝜏=60 𝑠𝑒𝑐
π‘Ž π‘π‘Ÿπ‘’π‘£ =𝐢𝑂𝐢 Neural network represents original table well

13 Results - Simulation Simulated on set of 1.5 million encounters
p(NMAC): Probability of a Near Midair Collision p(Alert): Probability the system will give an alert p(Reversal): Probability of reversing advisory direction

14 Results - Example Encounter
Network does not need to interpolate Q values Neural network alerts earlier than table Small difference grows larger over time Able to avoid intruder aircraft

15 Neural Network Guidance for UAVs - Background
Want to navigate to a waypoint and have some heading and bank angle when you arrive Five discretized state dimensions 𝜌, πœƒ, πœ“, 𝑣, πœ™ Reduce redundancy in states Two possible actions Ξ”πœ™=Β± 5 ∘ Reward: -1 if not at waypoint with desired heading and bank angle, 0 otherwise Transitions: Assume steady-level flight Take action every 10Hz Propagate position assuming Gaussian noise in velocity and bank angle. If UAV is at the waypoint with desired heading and bank angle, don’t move Solution: 26 million state-action values M. J. Kochenderfer and J. P. Chryssanthacopoulos, β€œRobust airborne collision avoidance through dynamic programming,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-371, 2011.

16 MDP Solution Can start from any position
Smooth, flyable trajectories to the waypoint Complex trajectories are parameterized by waypoints

17 Problem Formulation Table requires 112 MB in memory
3DR Pixhawk has 256 KB of memory Really only 10 KB of memory Need compression of over 10,000 Train neural network to predict best action Classification, not regression

18 Neural Network Compression
State Variables Size of network: Total parameters ~1400 ~5.6 KB in memory Tried different combinations of numbers of layers and layer sizes 4 hidden layers of 20 perceptrons each Cross-entropy loss Softmax converts outputs to probabilities Training labels are one-hot vectors [0,1] or [1,0] Optimizer: AdaMax Fully Connected Layer: 20x5 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x20 Activation: ReLU Fully Connected Layer: 20x2 Softmax Output: 2 Probability that each action is optimal

19 Neural network policy matches original MDP policy very well
Results – Policy Plots Neural network policy matches original MDP policy very well

20 Neural Network Trajectories
Simulated trajectories of neural network are almost identical to MDP trajectories

21 Performance Implemented neural network guidance in custom UAV
Flies well in calm or windy conditions Experimental flight in calm conditions is 1.3% slower than simulated flight

22 Conclusions Implementing MDP solutions in real systems may require compressed policies Neural networks can be trained to represent state-action values or policies Compression by factors of without performance loss Neural networks can be incorporated within limited memory systems

23 Questions? Kyle Julian


Download ppt "Policy Compression for MDPs"

Similar presentations


Ads by Google