Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim
Hardware design considerations of implementing neural-networks algorithms Introduction FPGA Choosing hardware platform Considerations Pros and Cons of each approach CNN implementation Methods cuDNN, cuFFT Winograd algorithm Benchmark
What is FPGA? A field-programmable gate array (FPGA) is an integrated circuit FPGA contain an array of programmable logic blocks (gates/LUTs) and a hierarchy of reconfigurable interconnects FPGA contain memory elements flip-flops or more complex memory blocks The FPGA configuration is generally specified using a hardware description language (HDL) May contain other blocks - MACs, Oscillators, PLLs… Large diversity of interface
Or in Other Words… FPGA is “flexible” hardware that allows us to configure it according to our considerations (area / power consumption/ speed-up…) and constrains in (almost) optimal way. Unlike MCU that multi-threads consumes resources (overhead) and sometimes is not “real”, in FPGA parallelism is very natural.
Diversity State of the art FPGA complexity performance and interface Altera and Xlinx are the market leaders Some blocks can handle More than 10GigaHz clocks 1000-3000 pins 2000 DSP blocks Built-in Quad-core 64 bit ARM Cortex-A5 processor
Diversity State of the art FPGA size and low power consumption (absolute) - Lattice ICE40 ultra lite family The smallest: 1.4x1.4[mm^2] 30uA static power Up to 8000 logic elements Up to 0.5GHz main clock
GPU vs FPGA Performance Comparison Winner Analysis Feature GPU The total floating-point operations per second of the best GPUs are higher than the FPGAs’ with the maximum DSP capabilities Floating-point Processing FPGA Algorithms implemented into FPGA provide deterministic timing, with latencies one order of magnitude less than GPUs. Timing Latency Measuring GFLOPS per watt, FPGAs are 3-4 times better. Although still far away, latest GPU products are dramatically improving the power burning. Processing / Watt GPUs interface via PCIe, while FPGA flexibility allows connection to any other device via -almost- any physical standard or custom interface. Interfaces Many algorithms are designed directly for GPUs, and FPGA developers are difficult and expensive to hire. Development FPGA lacks flexibility to modify the hardware implementation of the synthesized code, being a no-problem issue for GPUs developers. Reuse FPGA’s lower power consumption requires less thermal dissipation countermeasures, implementing the solution in smaller dimensions. Size Mid-class devices can be compared within the same order of magnitude, but GPU wins when considering money per GFLOP. Processing / €
GPU vs FPGA Qualitative Comparison
GPU vs FPGA Qualitative Comparison
GPU vs FPGA Performance Comparison
GPU vs FPGA Performance Comparison
CNN – the most computational intensive stage https://github.com/soumith/convnet-benchmarks In FPGA, Implementing multiplier by LUTs gates is inefficient. Usually the most “expensive” logic resource in terms of area and power is the multiplier (especially in FPGA). Recently a few multiply reducing algorithms were implemented and tested on GPUs and FPGA in context of CNN
Convolution Theorem Example:
Alternative approach – Freq domain multiply From “Fast Training of Convolutional Networks through FFTs” NYU 2014
Alternative approach – Freq domain multiply There are some variants of this approach that reduces the complexity (fbfft, cuFFT) From “Fast Training of Convolutional Networks through FFTs” NYU 2014
Alternative approach – Freq domain multiply FAST CONVOLUTIONAL NETS WITH fbfft : A GPU PERFORMANCE EVALUATION Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino & Yann LeCun Facebook AI Research 770 Broadway, New York, NY 10003, USA {ntv,jhj,myrhev,soumith,spiantino,yann}@fb.com
Figures 1-6 are performance summaries of cuFFT convolution versus cuDNN on a NVIDIA Tesla K40m, averaged across all three passes cuDNN faster cuFFT faster From : “FAST CONVOLUTIONAL NETS WITH fbfft :A GPU PERFORMANCE EVALUATION”
From “Fast Training of Convolutional Networks through FFTs”
Winograd Algorithm Shmuel Winograd
Winograd Convolution Algorithm
Winograd Convolution Algorithm 6MULs/4MULs = 1.5
Winograd algorithm Fast filtering algorithms can be written in matrix form as:
Evaluation - Accuracy
Evaluation - FP32 Speedup
Evaluation - FP16 Speedup
Evaluation - Results
For further Winograd algorithm design on FPGA… September 2016
Parallelism strategies Data Parallelism - splitting the data across different execution threads, but using the same model Model parallelism - splitting the model across different execution threads, but using the same data Pipeline Parallelism - operating different dependent steps of computation concurrently on different threads, so that output from one step is streamed as input to the next, while execution of steps is overlapping (mainly suited to feed forward).
Programming Model SDAccel OpenCL environment involves both host and kernel code The host code is used for programming the FPGA, passing data between the host’s memory and the FPGA’s global memory, and launching the kernel on the FPGA
FPGA programming The FPGA is segmented into two regions, the programmable region and the static region Static region - programmed upon power-up and it contains the interfaces to global memory and PCIe. Programmable region - contains the kernel, the computation to be accelerated. The kernel code is synthesized into hardware and configured into the programmable region of the FPGA.
Questions?