GPU VSIPL: High Performance VSIPL Implementation for GPUs

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Introduction to the CUDA Platform

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

SAGE: Self-Tuning Approximation for Graphics Engines

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

GTRI_B-1 ECRB - HPC - 1 Using GPU VSIPL & CUDA to Accelerate RF Clutter Simulation 2010 High Performance Embedded Computing Workshop 23 September 2010.

GPU-Accelerated Surface Denoising and Morphing with LBM Scheme Ye Zhao Kent State University, Ohio.

Radar Pulse Compression Using the NVIDIA CUDA SDK

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

Finding Body Parts with Vector Processing Cynthia Bruyns Bryan Feldman CS 252.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

QCAdesigner – CUDA HPPS project

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

GTRI_B-1 1 GPU Performance Assessment with HPEC Challenge High Performance Embedded Computing (HPEC) Workshop September 25, 2008 Andrew Kerr, Dan Campbell,

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.

An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Computer Engg, IIT(BHU)

Computer Graphics Graphics Hardware

GPU Architecture and Its Application

Generalized and Hybrid Fast-ICA Implementation using GPU

CSC391/691 Intro to OpenCV Dr. Rongzhong Li Fall 2016

GPU-based iterative CT reconstruction

EECE571R -- Harnessing Massively Parallel Processors ece

M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R

Graphics Processing Unit

NVIDIA Fermi Architecture

General Programming on Graphical Processing Units

CS/EE 217 – GPU Architecture and Parallel Programming

General Programming on Graphical Processing Units

All-Pairs Shortest Paths

Computer Graphics Graphics Hardware

VSIPL++: Parallel Performance HPEC 2004

Graphics Processing Unit

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Presentation transcript:

GPU VSIPL: High Performance VSIPL Implementation for GPUs Andrew Kerr, Dan Campbell*, Mark Richards, Mike Davis andrew.kerr@gtri.gatech.edu, dan.campbell@gtri.gatech.edu, mark.richards@ece.gatech.edu, mike.davis@gtri.gatech.edu High Performance Embedded Computing (HPEC) Workshop 24 September 2008 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of the authors. Distribution Statement (A): Approved for public release; distribution is unlimited

Signal Processing on Graphics Processors GPUs original role: turn 3-D polygons into 2-D pixels… …Which also makes them cheap & plentiful source of FLOPs Leverages volume & competition in entertainment industry Primary role highly parallel, very regular Typically <$500 drop-in addition to standard PC Outstripping CPU capacity, and growing more quickly Peak theoretical ~1TFlop Power draw: 280GTX = 200W Q6600 = 100W Still making improvements in market app with more parallelism, so growth continues Update with more current relevant info, less background more based on comments from Andrew. Try to use common slide with HPEC Challenge 2 2

GPU/CPU Performance Growth

GPGPU (Old) Concept of Operations Now we have compute-centric programming models… … But they require expertise to fully exploit GPGPU (Old) Concept of Operations “A=B+C” void main(float2 tc0 : TEXCOORD0, out float4 col : COLOR, uniform samplerRECT B, uniform samplerRECT C) { col = texRECT (B, tc0) + texRECT (C, tc0); } Arrays  Textures Render polygon with the same pixel dimensions as output texture Execute with fragment program to perform desired calculation Move data from output buffer to desired texture

VSIPL - Vector Signal Image Processing Library Portable API for linear algebra, image & signal processing Originally sponsored by DARPA in mid ’90s Targeted embedded processors – portability primary aim Open standard, Forum-based Initial API approved April 2000 Functional coverage Vector, Matrix, Tensor Basic math operations, linear algebra, solvers, FFT, FIR/IIR, bookkeeping, etc

VSIPL & GPU: Well Matched VSIPL is great for exploiting GPUs High level API with good coverage for dense linear algebra Allows non experts to benefit from hero programmers Explicit memory access controls API precision flexibility GPUs are great for VSIPL Improves prototyping by speeding algorithm testing Cheap addition allows more engineers access to HPC Large speedups without needing explicit parallelism at application level

GPU-VSIPL Implementation Full, compliant implementation of VSIPL Core-Lite Profile Fully encapsulated CUDA backend Leverages CUFFT library All VSIPL functions accelerated Core Lite Profile: Single precision floating point, some basic integer Vector & Sxalar, complex & real support Basic elementwise, FFT, FIR, histogram, RNG, support Full list: http://www.vsipl.org/coreliteprofile.pdf Also, some matrix support, including vsip_fftm_f

CUDA Programming & Optimization CUDA Programming Model Grid CUDA Optimization Considerations Maximize occupancy to hide memory latency Keep lots of threads in flight Carefully manage memory access to allow coalesce & avoid conflicts Avoid slow operations (e.g integer multiply for indexing) Minimize synch barriers Careful loop unrolling Hoist loop invariants Reduce register use for greater occupancy Block Datapath Shared Memory Register File Block Datapath Shared Memory Register File Block Thread Shared Memory Thread Thread Thread Register File Device Memory “GPU Performance Assessment with the HPEC Challenge” – Thursday PM CPU Dispatch Host Memory

GPU VSIPL Speedup: Unary 320X nVidia 8800GTX Vs Intel Q6600 20X

GPU VSIPL Speedup: Binary 40X nVidia 8800GTX Vs Intel Q6600 25X

GPU VSIPL Speedup: FFT 83X nVidia 8800GTX Vs Intel Q6600 39X

GPU VSIPL Speedup: FIR nVidia 8800GTX Vs Intel Q6600 157X 12 12

Application Example: Range Doppler Map Simple Range/Doppler data visualization application demo Intro app for new VSIPL programmer 59x Speedup TASP  GPU-VSIPL No changes to source code Section 9800GX2 Time (ms) Q6600 Time (ms) Speedup Admit 8.88 Baseband 67.77 1872.3 28 Zeropad 23.18 110.71 5 Fast time FFT 47.25 5696.3 121 Multiply 8.11 33.92 4 Fast Time FFT-1 48.59 5729.04 118 Slow time FFT, 2x CT 12.89 3387 263 log10 |.|2 22.2 470.15 21 Release 54.65 Total: 293.52 17299.42 59

GPU-VSIPL: Future Plans Expand matrix support Move toward full Core Profile More linear algebra/solvers VSIPL++ Double precision support

Conclusions GPUs are fast, cheap signal processors VSIPL is a portable, intuitive means to exploit GPUs GPU-VSIPL allows easy access to GPU performance without becoming an expert CUDA/GPU programer 10-100x speed improvement possible with no code change Not yet released, but unsupported previews may show up at: http://gpu-vsipl.gtri.gatech.edu