GPGPU platforms GP - General Purpose computation using GPU

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Distributed Arithmetic

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Date:102/1/9 Publisher:IEEE HPCC 2012 Author:Che-Lun Hung, Hsiao-hsi.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.

Distributed Arithmetic: Implementations and Applications

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

A COMPARATIVE STUDY OF MULTIPLY ACCCUMULATE IMPLEMENTATIONS ON FPGAS Using Distributed Arithmetic and Residue Number System.

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.

CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Implementation of Finite Field Inversion

Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.

J. Christiansen, CERN - EP/MIC

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

GPU Architecture and Programming

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Chapter 7 Logic Circuits 1.State the advantages of digital technology compared to analog technology. 2. Understand the terminology of digital circuits.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

1. Adaptive System Identification Configuration[2] The adaptive system identification is primarily responsible for determining a discrete estimation of.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

M211 – Central Processing Unit

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

1 3 Computing System Fundamentals 3.2 Computer Architecture.

My Coordinates Office EM G.27 contact time:

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Embedded Systems Design

FPGAs in AWS and First Use Cases, Kees Vissers

DESIGN AND IMPLEMENTATION OF DIGITAL FILTER

Multiplier-less Multiplication by Constants

Applications of Distributed Arithmetic to Digital Signal Processing:

Computer Evolution and Performance

Real time signal processing

6- General Purpose GPU Programming

Presentation transcript:

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters

GPGPU platforms GP - General Purpose computation using GPU GPU - Graphics Processing Unit CUDA and OpenCL are both frameworks for task-based and data-based general purpose parallel execution. Their architectures show a great similarity. The key difference between these two frameworks is that OpenCL is a cross-platform framework (implemented on CPUs, GPUs, DSPs and etc.); whereas, CUDA is supported only by NVIDIA GPUs

Memory Hierarchy Memory hierarchy of GPGPU architectures show similarity with CPU memory At the bottom level of the memory hierarchy the slowest but the largest capacity memory type resides. This type of memory is named as global memory in CUDA terminology. A typical global memory is 2 or 4 gigabyte size and resides outside of the GPU chip Global memories are usually manufactured using DRAM Constant memory is another memory type in CUDA devices and is optimized for broadcasting operations, so that it can be accessed faster than global memory. Like caches in CPU memory hierarchy, there is a faster but smaller memory type in CUDA memory hierarchy called as shared memory. Registers are other storage units in CUDA memory hierarchy which are private for each thread. Registers have the smallest latency and maximum throughput, but their amount is very limited.

Memory Hierarchy CUDA Memory Hierarchy

Filter Overview FIR FIR filter structure is constructed from its transfer function and linear difference equation which is obtained from taking inverse Z-transform of the transfer function of the filter

Filter Overview FIR The output stream y(n) is calculated by multiplying the input signal [x(n), x(n-1), … x(n-M+1)] with the corresponding filter coefficients [b0, b1, … ,bM-1] and adding all the multiplication results together.

Filter Overview FIR

GPU Implementation Three different implementation techniques are designed to compare the performance of GPUs with the FPGAs. Two of the designs are implemented using CUDA. The other design is an OpenCL kernel implementation. The first CUDA design is a naïve and simple kernel that does not include any significant optimization. The other optimized CUDA kernel uses shared memory and coalesce global memory accesses. The third one is just an OpenCL version of the highly optimized CUDA FIR filter implementation.

Basic CUDA FIR Filter

FPGA Implementation Three different implementation techniques are selected to synthesize FIR filters on various FPGAs: Direct-form, symmetric-form, and distributed arithmetic. It is possible to achieve massive level of parallelism by utilizing multiplier sources of the FPGAs. Most Xilinx FPGAs have DSP48 macro blocks embedded in their chips. These slices have 18x18-bit multiplier units with pre-accumulator, 48-bit accumulators and selection multiplexers in order to speed up DSP operations. For the direct-form and symmetric-form FPGA implementations Xilinx’s DSP48 macro slices are utilized.

FPGA Implementation Distributed arithmetic (DA) technique is an efficient method for implementing multiplication operations without using the DSP macro blocks of the FPGA. In the DA technique, the coefficients of the FIR filter is represented in two’s complement binary form and all possible sum combinations of the filter coefficients are stored in look-up tables (LUT). Using classical shift-adder method the multiplication operation can be performed effectively without using multiplier units of the FPGA. We used 4-input LUTs of the FPGA to implement the DA form of the FIR filter structure. We chose three different FPGAs to compare the performance results of the FIR filters. Utilized FPGAs and their properties are given in Table 1. Xilinx ISE v14.1 software is used to synthesize the circuits.

Results and Discussions

GPU and CPU Performance Results of FIR Filter Application (Million Samples per Second)

Conclusions FIR filter order has a noticeable effect on performance. For lower order FIR filters both FPGA and GPU achieved better performance with respect to higher order FIR filters. Serialization due to the lack of enough multiplier units is the main performance decrease reason for FPGAs. Logic resource capacity of an FPGA is another limiting factor to implement high order FIR filters FPGAs have relatively lower prices than GPUs, yet GPUs enjoy the ease of programmability where FPGAs are still tough to program. In general, FPGA performance is higher than GPU when the FIR filter is fully parallelized on FPGA device. However, GPU outperforms FPGA when the FIR filter has to be implemented with serial parts on FPGA.

Questions? No!