Download presentation
Presentation is loading. Please wait.
Published byChristal Mason Modified over 6 years ago
1
Hardware design considerations of implementing neural-networks algorithms
Presenter: Nir Hasidim
2
Hardware design considerations of implementing neural-networks algorithms
Introduction FPGA Choosing hardware platform Considerations Pros and Cons of each approach CNN implementation Methods cuDNN, cuFFT Winograd algorithm Benchmark
3
What is FPGA? A field-programmable gate array (FPGA) is an integrated circuit FPGA contain an array of programmable logic blocks (gates/LUTs) and a hierarchy of reconfigurable interconnects FPGA contain memory elements flip-flops or more complex memory blocks The FPGA configuration is generally specified using a hardware description language (HDL) May contain other blocks - MACs, Oscillators, PLLs… Large diversity of interface
4
Or in Other Words… FPGA is “flexible” hardware that allows us to configure it according to our considerations (area / power consumption/ speed-up…) and constrains in (almost) optimal way. Unlike MCU that multi-threads consumes resources (overhead) and sometimes is not “real”, in FPGA parallelism is very natural.
5
Diversity State of the art FPGA complexity performance and interface
Altera and Xlinx are the market leaders Some blocks can handle More than 10GigaHz clocks pins 2000 DSP blocks Built-in Quad-core 64 bit ARM Cortex-A5 processor
6
Diversity State of the art FPGA size and low power consumption (absolute) - Lattice ICE40 ultra lite family The smallest: 1.4x1.4[mm^2] 30uA static power Up to 8000 logic elements Up to 0.5GHz main clock
7
GPU vs FPGA Performance Comparison
Winner Analysis Feature GPU The total floating-point operations per second of the best GPUs are higher than the FPGAs’ with the maximum DSP capabilities Floating-point Processing FPGA Algorithms implemented into FPGA provide deterministic timing, with latencies one order of magnitude less than GPUs. Timing Latency Measuring GFLOPS per watt, FPGAs are 3-4 times better. Although still far away, latest GPU products are dramatically improving the power burning. Processing / Watt GPUs interface via PCIe, while FPGA flexibility allows connection to any other device via -almost- any physical standard or custom interface. Interfaces Many algorithms are designed directly for GPUs, and FPGA developers are difficult and expensive to hire. Development FPGA lacks flexibility to modify the hardware implementation of the synthesized code, being a no-problem issue for GPUs developers. Reuse FPGA’s lower power consumption requires less thermal dissipation countermeasures, implementing the solution in smaller dimensions. Size Mid-class devices can be compared within the same order of magnitude, but GPU wins when considering money per GFLOP. Processing / €
8
GPU vs FPGA Qualitative Comparison
9
GPU vs FPGA Qualitative Comparison
10
GPU vs FPGA Performance Comparison
11
GPU vs FPGA Performance Comparison
12
CNN – the most computational intensive stage
In FPGA, Implementing multiplier by LUTs gates is inefficient. Usually the most “expensive” logic resource in terms of area and power is the multiplier (especially in FPGA). Recently a few multiply reducing algorithms were implemented and tested on GPUs and FPGA in context of CNN
14
Convolution Theorem Example:
15
Alternative approach – Freq domain multiply
From “Fast Training of Convolutional Networks through FFTs” NYU 2014
16
Alternative approach – Freq domain multiply
There are some variants of this approach that reduces the complexity (fbfft, cuFFT) From “Fast Training of Convolutional Networks through FFTs” NYU 2014
17
Alternative approach – Freq domain multiply
FAST CONVOLUTIONAL NETS WITH fbfft : A GPU PERFORMANCE EVALUATION Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino & Yann LeCun Facebook AI Research 770 Broadway, New York, NY 10003, USA
18
Figures 1-6 are performance summaries of cuFFT convolution versus cuDNN on a NVIDIA Tesla K40m, averaged across all three passes cuDNN faster cuFFT faster From : “FAST CONVOLUTIONAL NETS WITH fbfft :A GPU PERFORMANCE EVALUATION”
20
From “Fast Training of Convolutional Networks through FFTs”
21
Winograd Algorithm Shmuel Winograd
22
Winograd Convolution Algorithm
23
Winograd Convolution Algorithm
6MULs/4MULs = 1.5
24
Winograd algorithm Fast filtering algorithms can be written in matrix form as:
27
Evaluation - Accuracy
28
Evaluation - FP32 Speedup
29
Evaluation - FP16 Speedup
30
Evaluation - Results
31
For further Winograd algorithm design on FPGA…
September 2016
32
Parallelism strategies
Data Parallelism - splitting the data across different execution threads, but using the same model Model parallelism - splitting the model across different execution threads, but using the same data Pipeline Parallelism - operating different dependent steps of computation concurrently on different threads, so that output from one step is streamed as input to the next, while execution of steps is overlapping (mainly suited to feed forward).
33
Programming Model SDAccel OpenCL environment involves both host and kernel code The host code is used for programming the FPGA, passing data between the host’s memory and the FPGA’s global memory, and launching the kernel on the FPGA
34
FPGA programming The FPGA is segmented into two regions, the programmable region and the static region Static region - programmed upon power-up and it contains the interfaces to global memory and PCIe. Programmable region - contains the kernel, the computation to be accelerated. The kernel code is synthesized into hardware and configured into the programmable region of the FPGA.
35
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.