Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
J. Christiansen, CERN - EP/MIC
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Basic Sequential Components CT101 – Computing Systems Organization.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Processor Architecture
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Programmable Logic Devices
Sequential Logic Design
Programmable Hardware: Hardware or Software?
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Benchmarking Deep Learning Inference
Complex Programmable Logic Device (CPLD) Architecture and Its Applications
Introduction to Programmable Logic
Enabling machine learning in embedded systems
ECE 4110–5110 Digital System Design
Constructing a system with multiple computers or processors
Architecture & Organization 1
Electronics for Physicists
FPGAs in AWS and First Use Cases, Kees Vissers
DESIGN AND IMPLEMENTATION OF DIGITAL FILTER
Department of Electrical & Computer Engineering
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Anne Pratoomtong ECE734, Spring2002
Programmable Logic Devices: CPLDs and FPGAs with VHDL Design
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Field Programmable Gate Array
Field Programmable Gate Array
Architecture & Organization 1
A Digital Signal Prophecy The past, present and future of programmable DSP and the effects on high performance applications Continuing technology enhancements.
Characteristics of Reconfigurable Hardware
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
Electronics for Physicists
Graphics Processing Unit
EE 193: Parallel Computing
6- General Purpose GPU Programming
♪ Embedded System Design: Synthesizing Music Using Programmable Logic
Programmable logic and FPGA
Martin Croome VP Business Development GreenWaves Technologies.
Presentation transcript:

Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim

Hardware design considerations of implementing neural-networks algorithms Introduction FPGA Choosing hardware platform Considerations Pros and Cons of each approach CNN implementation Methods cuDNN, cuFFT Winograd algorithm Benchmark

What is FPGA? A field-programmable gate array (FPGA) is an integrated circuit FPGA contain an array of programmable logic blocks (gates/LUTs) and a hierarchy of reconfigurable interconnects FPGA contain memory elements flip-flops or more complex memory blocks The FPGA configuration is generally specified using a hardware description language (HDL) May contain other blocks - MACs, Oscillators, PLLs… Large diversity of interface

Or in Other Words… FPGA is “flexible” hardware that allows us to configure it according to our considerations (area / power consumption/ speed-up…) and constrains in (almost) optimal way. Unlike MCU that multi-threads consumes resources (overhead) and sometimes is not “real”, in FPGA parallelism is very natural.

Diversity State of the art FPGA complexity performance and interface Altera and Xlinx are the market leaders Some blocks can handle More than 10GigaHz clocks 1000-3000 pins 2000 DSP blocks Built-in Quad-core 64 bit ARM Cortex-A5 processor

Diversity State of the art FPGA size and low power consumption (absolute) - Lattice ICE40 ultra lite family The smallest: 1.4x1.4[mm^2] 30uA static power Up to 8000 logic elements Up to 0.5GHz main clock

GPU vs FPGA Performance Comparison Winner Analysis Feature GPU The total floating-point operations per second of the best GPUs are higher than the FPGAs’ with the maximum DSP capabilities Floating-point Processing FPGA Algorithms implemented into FPGA provide deterministic timing, with latencies one order of magnitude less than GPUs. Timing Latency Measuring GFLOPS per watt, FPGAs are 3-4 times better. Although still far away, latest GPU products are dramatically improving the power burning. Processing / Watt GPUs interface via PCIe, while FPGA flexibility allows connection to any other device via -almost- any physical standard or custom interface. Interfaces Many algorithms are designed directly for GPUs, and FPGA developers are difficult and expensive to hire. Development FPGA lacks flexibility to modify the hardware implementation of the synthesized code, being a no-problem issue for GPUs developers. Reuse FPGA’s lower power consumption requires less thermal dissipation countermeasures, implementing the solution in smaller dimensions. Size Mid-class devices can be compared within the same order of magnitude, but GPU wins when considering money per GFLOP. Processing / €

GPU vs FPGA Qualitative Comparison

GPU vs FPGA Qualitative Comparison

GPU vs FPGA Performance Comparison

GPU vs FPGA Performance Comparison

CNN – the most computational intensive stage https://github.com/soumith/convnet-benchmarks In FPGA, Implementing multiplier by LUTs gates is inefficient. Usually the most “expensive” logic resource in terms of area and power is the multiplier (especially in FPGA). Recently a few multiply reducing algorithms were implemented and tested on GPUs and FPGA in context of CNN

Convolution Theorem Example:

Alternative approach – Freq domain multiply From “Fast Training of Convolutional Networks through FFTs” NYU 2014

Alternative approach – Freq domain multiply There are some variants of this approach that reduces the complexity (fbfft, cuFFT) From “Fast Training of Convolutional Networks through FFTs” NYU 2014

Alternative approach – Freq domain multiply FAST CONVOLUTIONAL NETS WITH fbfft : A GPU PERFORMANCE EVALUATION Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino & Yann LeCun Facebook AI Research 770 Broadway, New York, NY 10003, USA {ntv,jhj,myrhev,soumith,spiantino,yann}@fb.com

Figures 1-6 are performance summaries of cuFFT convolution versus cuDNN on a NVIDIA Tesla K40m, averaged across all three passes cuDNN faster cuFFT faster From : “FAST CONVOLUTIONAL NETS WITH fbfft :A GPU PERFORMANCE EVALUATION”

From “Fast Training of Convolutional Networks through FFTs”

Winograd Algorithm Shmuel Winograd

Winograd Convolution Algorithm

Winograd Convolution Algorithm 6MULs/4MULs = 1.5

Winograd algorithm Fast filtering algorithms can be written in matrix form as:

Evaluation - Accuracy

Evaluation - FP32 Speedup

Evaluation - FP16 Speedup

Evaluation - Results

For further Winograd algorithm design on FPGA… September 2016

Parallelism strategies Data Parallelism - splitting the data across different execution threads, but using the same model Model parallelism - splitting the model across different execution threads, but using the same data Pipeline Parallelism - operating different dependent steps of computation concurrently on different threads, so that output from one step is streamed as input to the next, while execution of steps is overlapping (mainly suited to feed forward).

Programming Model SDAccel OpenCL environment involves both host and kernel code The host code is used for programming the FPGA, passing data between the host’s memory and the FPGA’s global memory, and launching the kernel on the FPGA

FPGA programming The FPGA is segmented into two regions, the programmable region and the static region Static region - programmed upon power-up and it contains the interfaces to global memory and PCIe. Programmable region - contains the kernel, the computation to be accelerated. The kernel code is synthesized into hardware and configured into the programmable region of the FPGA.

Questions?