Download presentation
Presentation is loading. Please wait.
Published byCornelius Young Modified over 6 years ago
1
Power-Efficient Machine Learning using FPGAs on POWER Systems
Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Join the Conversation #OpenPOWERSummit
2
Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*)
** Super Human Humans: ~95%*** * ** pg 10 *** Russakovsky, et al 2014,
3
Top-5 Accuracy Image Classification Image-Net Large-Scale Visual Recognition Challenge (ILSVRC*)
** Super Human Humans: ~95%*** CNNs far outperform non AI methods CNNs deliver super-human accuracy * ** pg 10 *** Russakovsky, et al 2014,
4
CNNs Explained
5
The Computation
6
The Computation Page 6
7
Convolution Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights
8
Convolution Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights
9
Convolution Continue along the row ...
10
Convolution Before moving down to the next row
11
Convolution The first output feature map is complete
12
Convolution Move onto the next output feature map by switching weights, and repeat
13
Convolution Pattern repeats as before: same input volumes, different weight
14
Convolution Complete the second output feature map plane
15
Convolution Finally, after 256 weight sets have been used, the output feature map is complete
16
Fully Connected Layers
17
Fully Connected Layers
18
CNN Properties Compute: dominated by convolution (CONV) layers
GOPs Per Layer Compute Memory BW: dominated by fully-connected (FC) layers Memory Access G Reads Per Layer Source: Yu Wang, Tsinghua University, Feb 2016
19
Humans are six orders of magnitude more efficient
Humans vs Machines * Humans are six orders of magnitude more efficient *IBM Watson, ca 2012 Source: Yu Wang, Tsinghua University, Feb 2016
20
Cost of Computation Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016.
21
Cost of Computation Stay in on-chip memory (1/100 x power)
Use Smaller Multipliers (8bits vs 32bits: 1/16 x power) Fixed-Point vs Float (don’t waste bits on dynamic range) Source: William Dally, “High Performance Hardware for Machine Learning” Cadence ENN Summit, 2/9/2016. Page 21
22
Improving Machine Efficiency
Model Pruning Right-Sizing Precision Custom CNN Processor Architecture
23
Pruning Elements Remove Low Contribution Weights (Synapses)
Retrain Remaining Weights Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks”
24
Pruning Results: AlexNet
9x Reduction In #Weights Most Reduction In FC Layers Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”,
25
Pruning Results: AlexNet
< 0.1% Accuracy Loss Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING ”,
26
Inference with Integer Quantization
27
Right-Sizing Precision
Network VGG16 Data Bits Single-float 16 8 Weight Bits 8 or 4 Data Precision N/A 2-2 2-5/2-1 Dynamic Weight Precision 2-15 2-7 Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0% Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6% Dynamic: Variable Format Fixed-Point (Per Layer) < 1% Accuracy Loss Source: Yu Wang, Tsinghua University, Feb 2016
28
Right-Sizing Precision
Fixed-Point Sufficient For Deployment (INT16, INT8) No Significant Loss in Accuracy (< 1%) >10x Energy Efficiency OPs/J (INT8 vs FP32) 4x Memory Energy Efficiency Tx/J (INT8 vs FP32)
29
Improving Machine Efficiency
CNN Model Model pruning FPGA Based Neural Network Processor Pruned Floating-Point Model Data/weight quantization Pruned Fixed-Point Model Run Compilation Instructions Modified From: Yu Wang, Tsinghua University, Feb 2016
30
Xilinx Kintex® UltraScale™ KU115 (20nm)
5520 DSP Cores,up to 500Mhz 5.5 T OPs int16 (peak) 4 GB DDR & 38 GB/s 55W TDP & 100 G OPs/W Single Slot, Low Profile FF OpenPOWER CAPI AlphaData ADM-PCIE-8K5
31
FPGA Architecture 2D Array Architecture (scales with Moore’s Law)
RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM RAM CLB DSP CLB RAM 2D Array Architecture (scales with Moore’s Law) Memory Proximate Computing (Minimize Data Moves) Broadcast Capable Interconnect (Data Sharing/Reuse)
32
FPGA Arithmetic & Memory Resources
Custom Width Memory 16-bit Multiplier 48-bit Accumulator Custom Quantization INT4 INT8 INT16 INT32 FP16 FP32 Dj Q8.8 Q2.14 Qm.n Oi Wij Native 16-bit multiplier (or reduced power 8-bit) On-Chip RAMs store INT4, INT8, INT16, … Custom Quantization Formatting (Qm.n)
33
Convolver Unit + X Source: Yu Wang, Tsinghua University, Feb 2016
⋯⋯ ⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays ① ② ③ Input Data Weight Output Source: Yu Wang, Tsinghua University, Feb 2016
34
Memory Proximate Compute
Convolver Unit ⋯⋯ ⋯ MUX Data buffer Weight buffer Multipliers Adder Tree X + 9 Data Inputs 9 Weight Inputs n Delays 𝑚 Delays ① ② ③ Input Data Weight Output Serial to Parallel Data Reuse: 8/9 Memory Proximate Compute 2D Parallel Memory 2D Operator Array INT16 Serial to Parallel Ping/Pong Source: Yu Wang, Tsinghua University, Feb 2016
35
Processing Engine (PE)
Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift … Source: Yu Wang, Tsinghua University, Feb 2016
36
Processing Engine (PE)
Convolver Complex + NL Pool Output Buffer Input Buffer Data Bias Weights Intermediate Data Controller PE Adder Tree Bias Shift shift … Custom Quantization Memory Sharing Broadcast Weights Source: Yu Wang, Tsinghua University, Feb 2016
37
Top Level … Source: Yu Wang, Tsinghua University, Feb 2016
Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus … Source: Yu Wang, Tsinghua University, Feb 2016
38
Top Level SW Scheduled Dataflow Decompress weights on the fly
Power CPU External Memory Processing System DMA w/ compression Data & Inst. Bus Input Buffer PE Computing Complex Output Buffer FIFO Controller Programmable Logic Config. Bus … SW Scheduled Dataflow Decompress weights on the fly Ping Pong Buffers Transfers Overlap with Compute Multiple PE Block Level Parallelism Source: Yu Wang, Tsinghua University, Feb 2016
39
FPGA Neural Net Processor
Tiled Architecture (Parallelism & Scaling) Semi Static Dataflow (Pre-scheduled Data Transfers) Memory Reuse (Data Sharing across Convolvers)
40
OpenPOWER CAPI Peer Programming Model and Interaction Efficiency
CAP PSL POWER8 CAP UNIT Shared Virtual Memory System-Wide Memory Coherency Low Latency Control Messages Peer Programming Model and Interaction Efficiency
41
OpenPOWER CAPI Power Xilinx FPGA AuvizDNN Kernel
CAP PSL POWER8 CAP UNIT Power Caffe, TensorFlow, etc Load CNN Model Call AuvizDNN Library Xilinx FPGA AuvizDNN Kernel Scalable & Fully Parameterized Plug and Play Library
42
OpenPOWER CAPI 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP
CAP PSL POWER8 CAP UNIT 14 Images/s/W (AlexNet) Batch Size 1 Low Profile TDP
43
Take Aways FPGA: Ideal Dataflow CNN Processor
POWER/CAPI: Elevates Accelerators As Peers to CPUs FPGA CNN Libraries
44
Thank You! 11/16/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.