Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Join the Conversation #OpenPOWERSummit
Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*) ** Super Human Humans: ~95%*** * http://image-net.org/challenges/LSVRC/ **http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10 *** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf
Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*) ** Super Human Humans: ~95%*** CNNs far outperform non AI methods CNNs deliver super-human accuracy * http://image-net.org/challenges/LSVRC/ **http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10 *** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf
CNNs Explained
The Computation
The Computation Page 6
Convolution Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights
Convolution Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights
Convolution Continue along the row ...
Convolution Before moving down to the next row
Convolution The first output feature map is complete
Convolution Move onto the next output feature map by switching weights, and repeat
Convolution Pattern repeats as before: same input volumes, different weight
Convolution Complete the second output feature map plane
Convolution Finally, after 256 weight sets have been used, the output feature map is complete
Fully Connected Layers
Fully Connected Layers
CNN Properties Compute: dominated by convolution (CONV) layers GOPs Per Layer Compute Memory BW: dominated by fully-connected (FC) layers Memory Access G Reads Per Layer Source: Yu Wang, Tsinghua University, Feb 2016
Humans are six orders of magnitude more efficient Humans vs Machines * Humans are six orders of magnitude more efficient *IBM Watson, ca 2012 Source: Yu Wang, Tsinghua University, Feb 2016
Cost of Computation Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016.
Cost of Computation Stay in on-chip memory (1/100 x power) Use Smaller Multipliers (8bits vs 32bits: 1/16 x power) Fixed-Point vs Float (don’t waste bits on dynamic range) Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016. Page 21
Improving Machine Efficiency Model Pruning Right-Sizing Precision Custom CNN Processor Architecture
Pruning Elements Remove Low Contribution Weights (Synapses) Retrain Remaining Weights Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks” http://arxiv.org/pdf/1506.02626v3.pdf
Pruning Results: AlexNet 9x Reduction In #Weights Most Reduction In FC Layers Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf
Pruning Results: AlexNet < 0.1% Accuracy Loss Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf
Inference with Integer Quantization
Right-Sizing Precision Network VGG16 Data Bits Single-float 16 8 Weight Bits 8 or 4 Data Precision N/A 2-2 2-5/2-1 Dynamic Weight Precision 2-15 2-7 Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0% Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6% Dynamic: Variable Format Fixed-Point (Per Layer) < 1% Accuracy Loss Source: Yu Wang, Tsinghua University, Feb 2016
Right-Sizing Precision Fixed-Point Sufficient For Deployment (INT16, INT8) No Significant Loss in Accuracy (< 1%) >10x Energy Efficiency OPs/J (INT8 vs FP32) 4x Memory Energy Efficiency Tx/J (INT8 vs FP32)
Improving Machine Efficiency CNN Model Model pruning FPGA Based Neural Network Processor Pruned Floating-Point Model Data/weight quantization Pruned Fixed-Point Model Run Compilation Instructions Modified From: Yu Wang, Tsinghua University, Feb 2016
Xilinx Kintex® UltraScale™ KU115 (20nm) 5520 DSP Cores,up to 500Mhz 5.5 T OPs int16 (peak) 4 GB DDR4-2400 & 38 GB/s 55W TDP & 100 G OPs/W Single Slot, Low Profile FF OpenPOWER CAPI AlphaData ADM-PCIE-8K5
FPGA Architecture 2D Array Architecture (scales with Moore’s Law) RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM . . . . RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM . . . . 2D Array Architecture (scales with Moore’s Law) Memory Proximate Computing (Minimize Data Moves) Broadcast Capable Interconnect (Data Sharing/Reuse)
FPGA Arithmetic & Memory Resources Custom Width Memory 16-bit Multiplier 48-bit Accumulator Custom Quantization INT4 INT8 INT16 INT32 FP16 FP32 Dj Q8.8 Q2.14 Qm.n Oi Wij Native 16-bit multiplier (or reduced power 8-bit) On-Chip RAMs store INT4, INT8, INT16, … Custom Quantization Formatting (Qm.n)
Convolver Unit + X Source: Yu Wang, Tsinghua University, Feb 2016 ⋯⋯ ⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays ① ② ③ Input Data Weight Output Source: Yu Wang, Tsinghua University, Feb 2016
Memory Proximate Compute Convolver Unit ⋯⋯ ⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays ① ② ③ Input Data Weight Output Serial to Parallel Data Reuse: 8/9 Memory Proximate Compute 2D Parallel Memory 2D Operator Array INT16 Serial to Parallel Ping/Pong Source: Yu Wang, Tsinghua University, Feb 2016
Processing Engine (PE) Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift … Source: Yu Wang, Tsinghua University, Feb 2016
Processing Engine (PE) Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift … Custom Quantization Memory Sharing Broadcast Weights Source: Yu Wang, Tsinghua University, Feb 2016
Top Level … Source: Yu Wang, Tsinghua University, Feb 2016 Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus … Source: Yu Wang, Tsinghua University, Feb 2016
Top Level SW Scheduled Dataflow Decompress weights on the fly Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus … SW Scheduled Dataflow Decompress weights on the fly Ping Pong Buffers Transfers Overlap with Compute Multiple PE Block Level Parallelism Source: Yu Wang, Tsinghua University, Feb 2016
FPGA Neural Net Processor Tiled Architecture (Parallelism & Scaling) Semi Static Dataflow (Pre-scheduled Data Transfers) Memory Reuse (Data Sharing across Convolvers)
OpenPOWER CAPI Peer Programming Model and Interaction Efficiency CAP PSL POWER8 CAP UNIT Shared Virtual Memory System-Wide Memory Coherency Low Latency Control Messages Peer Programming Model and Interaction Efficiency
OpenPOWER CAPI Power Xilinx FPGA AuvizDNN Kernel CAP PSL POWER8 CAP UNIT Power Caffe, TensorFlow, etc Load CNN Model Call AuvizDNN Library Xilinx FPGA AuvizDNN Kernel Scalable & Fully Parameterized Plug and Play Library
OpenPOWER CAPI 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP CAP PSL POWER8 CAP UNIT 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP
Take Aways FPGA: Ideal Dataflow CNN Processor POWER/CAPI: Elevates Accelerators As Peers to CPUs FPGA CNN Libraries
Thank You! 11/16/2018