Power-Efficient Machine Learning using FPGAs on POWER Systems

Power-Efficient Machine Learning using FPGAs on POWER Systems
Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Join the Conversation #OpenPOWERSummit

Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*)
** Super Human Humans: ~95%*** * ** pg 10 *** Russakovsky, et al 2014,

Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*)
** Super Human Humans: ~95%*** CNNs far outperform non AI methods CNNs deliver super-human accuracy * ** pg 10 *** Russakovsky, et al 2014,

CNNs Explained

The Computation

The Computation Page 6

Convolution Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights

Convolution Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights

Convolution Continue along the row ...

Convolution Before moving down to the next row

Convolution The first output feature map is complete

Convolution Move onto the next output feature map by switching weights, and repeat

Convolution Pattern repeats as before: same input volumes, different weight

Convolution Complete the second output feature map plane

Convolution Finally, after 256 weight sets have been used, the output feature map is complete

Fully Connected Layers

CNN Properties Compute: dominated by convolution (CONV) layers
GOPs Per Layer Compute Memory BW: dominated by fully-connected (FC) layers Memory Access G Reads Per Layer Source: Yu Wang, Tsinghua University, Feb 2016

Humans are six orders of magnitude more efficient
Humans vs Machines * Humans are six orders of magnitude more efficient *IBM Watson, ca 2012 Source: Yu Wang, Tsinghua University, Feb 2016

Cost of Computation Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016.

Cost of Computation Stay in on-chip memory (1/100 x power)
Use Smaller Multipliers (8bits vs 32bits: 1/16 x power) Fixed-Point vs Float (don’t waste bits on dynamic range) Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016. Page 21

Improving Machine Efficiency
Model Pruning Right-Sizing Precision Custom CNN Processor Architecture

Pruning Elements Remove Low Contribution Weights (Synapses)
Retrain Remaining Weights Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks”

Pruning Results: AlexNet
9x Reduction In #Weights Most Reduction In FC Layers Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”,

Pruning Results: AlexNet
< 0.1% Accuracy Loss Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”,

Inference with Integer Quantization

Right-Sizing Precision
Network VGG16 Data Bits Single-float 16 8 Weight Bits 8 or 4 Data Precision N/A 2-2 2-5/2-1 Dynamic Weight Precision 2-15 2-7 Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0% Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6% Dynamic: Variable Format Fixed-Point (Per Layer) < 1% Accuracy Loss Source: Yu Wang, Tsinghua University, Feb 2016

Right-Sizing Precision
Fixed-Point Sufficient For Deployment (INT16, INT8) No Significant Loss in Accuracy (< 1%) >10x Energy Efficiency OPs/J (INT8 vs FP32) 4x Memory Energy Efficiency Tx/J (INT8 vs FP32)

Improving Machine Efficiency
CNN Model Model pruning FPGA Based Neural Network Processor Pruned Floating-Point Model Data/weight quantization Pruned Fixed-Point Model Run Compilation Instructions Modified From: Yu Wang, Tsinghua University, Feb 2016

Xilinx Kintex® UltraScale™ KU115 (20nm)
5520 DSP Cores,up to 500Mhz 5.5 T OPs int16 (peak) 4 GB DDR & 38 GB/s 55W TDP & 100 G OPs/W Single Slot, Low Profile FF OpenPOWER CAPI AlphaData ADM-PCIE-8K5

FPGA Architecture 2D Array Architecture (scales with Moore’s Law)
RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM 2D Array Architecture (scales with Moore’s Law) Memory Proximate Computing (Minimize Data Moves) Broadcast Capable Interconnect (Data Sharing/Reuse)

FPGA Arithmetic & Memory Resources
Custom Width Memory 16-bit Multiplier 48-bit Accumulator Custom Quantization INT4 INT8 INT16 INT32 FP16 FP32 Dj Q8.8 Q2.14 Qm.n Oi Wij Native 16-bit multiplier (or reduced power 8-bit) On-Chip RAMs store INT4, INT8, INT16, … Custom Quantization Formatting (Qm.n)

Convolver Unit + X Source: Yu Wang, Tsinghua University, Feb 2016
⋯⋯ ⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays ① ② ③ Input Data Weight Output Source: Yu Wang, Tsinghua University, Feb 2016

Memory Proximate Compute
Convolver Unit ⋯⋯ ⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays ① ② ③ Input Data Weight Output Serial to Parallel Data Reuse: 8/9 Memory Proximate Compute 2D Parallel Memory 2D Operator Array INT16 Serial to Parallel Ping/Pong Source: Yu Wang, Tsinghua University, Feb 2016

Processing Engine (PE)
Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift … Source: Yu Wang, Tsinghua University, Feb 2016

Processing Engine (PE)
Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift … Custom Quantization Memory Sharing Broadcast Weights Source: Yu Wang, Tsinghua University, Feb 2016

Top Level … Source: Yu Wang, Tsinghua University, Feb 2016
Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus … Source: Yu Wang, Tsinghua University, Feb 2016

Top Level SW Scheduled Dataflow Decompress weights on the fly
Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus … SW Scheduled Dataflow Decompress weights on the fly Ping Pong Buffers Transfers Overlap with Compute Multiple PE Block Level Parallelism Source: Yu Wang, Tsinghua University, Feb 2016

FPGA Neural Net Processor
Tiled Architecture (Parallelism & Scaling) Semi Static Dataflow (Pre-scheduled Data Transfers) Memory Reuse (Data Sharing across Convolvers)

OpenPOWER CAPI Peer Programming Model and Interaction Efficiency
CAP PSL POWER8 CAP UNIT Shared Virtual Memory System-Wide Memory Coherency Low Latency Control Messages Peer Programming Model and Interaction Efficiency

OpenPOWER CAPI Power Xilinx FPGA AuvizDNN Kernel
CAP PSL POWER8 CAP UNIT Power Caffe, TensorFlow, etc Load CNN Model Call AuvizDNN Library Xilinx FPGA AuvizDNN Kernel Scalable & Fully Parameterized Plug and Play Library

OpenPOWER CAPI 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP
CAP PSL POWER8 CAP UNIT 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP

Take Aways FPGA: Ideal Dataflow CNN Processor
POWER/CAPI: Elevates Accelerators As Peers to CPUs FPGA CNN Libraries

Thank You! 11/16/2018

Power-Efficient Machine Learning using FPGAs on POWER Systems

Similar presentations

Presentation on theme: "Power-Efficient Machine Learning using FPGAs on POWER Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Power-Efficient Machine Learning using FPGAs on POWER Systems

Similar presentations

Presentation on theme: "Power-Efficient Machine Learning using FPGAs on POWER Systems"— Presentation transcript:

Similar presentations

About project

Feedback