Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.

Similar presentations


Presentation on theme: "Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim."— Presentation transcript:

1 Hardware design considerations of implementing neural-networks algorithms
Presenter: Nir Hasidim

2 Hardware design considerations of implementing neural-networks algorithms
Introduction FPGA Choosing hardware platform Considerations Pros and Cons of each approach CNN implementation Methods cuDNN, cuFFT Winograd algorithm Benchmark

3 What is FPGA? A field-programmable gate array (FPGA) is an integrated circuit FPGA contain an array of programmable logic blocks (gates/LUTs) and a hierarchy of reconfigurable interconnects FPGA contain memory elements flip-flops or more complex memory blocks The FPGA configuration is generally specified using a hardware description language (HDL) May contain other blocks - MACs, Oscillators, PLLs… Large diversity of interface

4 Or in Other Words… FPGA is “flexible” hardware that allows us to configure it according to our considerations (area / power consumption/ speed-up…) and constrains in (almost) optimal way. Unlike MCU that multi-threads consumes resources (overhead) and sometimes is not “real”, in FPGA parallelism is very natural.

5 Diversity State of the art FPGA complexity performance and interface
Altera and Xlinx are the market leaders Some blocks can handle More than 10GigaHz clocks pins 2000 DSP blocks Built-in Quad-core 64 bit ARM Cortex-A5 processor

6 Diversity State of the art FPGA size and low power consumption (absolute) - Lattice ICE40 ultra lite family The smallest: 1.4x1.4[mm^2] 30uA static power Up to 8000 logic elements Up to 0.5GHz main clock

7 GPU vs FPGA Performance Comparison
Winner Analysis Feature GPU The total floating-point operations per second of the best GPUs are higher than the FPGAs’ with the maximum DSP capabilities Floating-point Processing FPGA Algorithms implemented into FPGA provide deterministic timing, with latencies one order of magnitude less than GPUs. Timing Latency Measuring GFLOPS per watt, FPGAs are 3-4 times better. Although still far away, latest GPU products are dramatically improving the power burning. Processing / Watt GPUs interface via PCIe, while FPGA flexibility allows connection to any other device via -almost- any physical standard or custom interface. Interfaces Many algorithms are designed directly for GPUs, and FPGA developers are difficult and expensive to hire. Development FPGA lacks flexibility to modify the hardware implementation of the synthesized code, being a no-problem issue for GPUs developers. Reuse FPGA’s lower power consumption requires less thermal dissipation countermeasures, implementing the solution in smaller dimensions. Size Mid-class devices can be compared within the same order of magnitude, but GPU wins when considering money per GFLOP. Processing / €

8 GPU vs FPGA Qualitative Comparison

9 GPU vs FPGA Qualitative Comparison

10 GPU vs FPGA Performance Comparison

11 GPU vs FPGA Performance Comparison

12 CNN – the most computational intensive stage
In FPGA, Implementing multiplier by LUTs gates is inefficient. Usually the most “expensive” logic resource in terms of area and power is the multiplier (especially in FPGA). Recently a few multiply reducing algorithms were implemented and tested on GPUs and FPGA in context of CNN

13

14 Convolution Theorem Example:

15 Alternative approach – Freq domain multiply
From “Fast Training of Convolutional Networks through FFTs” NYU 2014

16 Alternative approach – Freq domain multiply
There are some variants of this approach that reduces the complexity (fbfft, cuFFT) From “Fast Training of Convolutional Networks through FFTs” NYU 2014

17 Alternative approach – Freq domain multiply
FAST CONVOLUTIONAL NETS WITH fbfft : A GPU PERFORMANCE EVALUATION Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino & Yann LeCun Facebook AI Research 770 Broadway, New York, NY 10003, USA

18 Figures 1-6 are performance summaries of cuFFT convolution versus cuDNN on a NVIDIA Tesla K40m, averaged across all three passes cuDNN faster cuFFT faster From : “FAST CONVOLUTIONAL NETS WITH fbfft :A GPU PERFORMANCE EVALUATION”

19

20 From “Fast Training of Convolutional Networks through FFTs”

21 Winograd Algorithm Shmuel Winograd

22 Winograd Convolution Algorithm

23 Winograd Convolution Algorithm
6MULs/4MULs = 1.5

24 Winograd algorithm Fast filtering algorithms can be written in matrix form as:

25

26

27 Evaluation - Accuracy

28 Evaluation - FP32 Speedup

29 Evaluation - FP16 Speedup

30 Evaluation - Results

31 For further Winograd algorithm design on FPGA…
September 2016

32 Parallelism strategies
Data Parallelism - splitting the data across different execution threads, but using the same model Model parallelism - splitting the model across different execution threads, but using the same data Pipeline Parallelism - operating different dependent steps of computation concurrently on different threads, so that output from one step is streamed as input to the next, while execution of steps is overlapping (mainly suited to feed forward).

33 Programming Model SDAccel OpenCL environment involves both host and kernel code The host code is used for programming the FPGA, passing data between the host’s memory and the FPGA’s global memory, and launching the kernel on the FPGA

34 FPGA programming The FPGA is segmented into two regions, the programmable region and the static region Static region - programmed upon power-up and it contains the interfaces to global memory and PCIe. Programmable region - contains the kernel, the computation to be accelerated. The kernel code is synthesized into hardware and configured into the programmable region of the FPGA.

35 Questions?


Download ppt "Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim."

Similar presentations


Ads by Google