GTRI_B-1 1 GPU Performance Assessment with HPEC Challenge High Performance Embedded Computing (HPEC) Workshop September 25, 2008 Andrew Kerr, Dan Campbell,

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA and the Memory Model (Part II). Code executed on GPU.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

BENCHMARK SUITE RADAR SIGNAL & DATA PROCESSING CERES EPC WORKSHOP

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

1 Chapter 04 Authors: John Hennessy & David Patterson.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Implementing a Speech Recognition System on a GPU using CUDA

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Radar Pulse Compression Using the NVIDIA CUDA SDK

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

QCAdesigner – CUDA HPPS project

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Computer Engg, IIT(BHU)

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

GPU VSIPL: High Performance VSIPL Implementation for GPUs

Mattan Erez The University of Texas at Austin

VSIPL++: Parallel Performance HPEC 2004

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

GTRI_B-1 1 GPU Performance Assessment with HPEC Challenge High Performance Embedded Computing (HPEC) Workshop September 25, 2008 Andrew Kerr, Dan Campbell, Mark Richards Distribution Statement (A): Approved for public release; distribution is unlimited This work was supported in part by DARPA and AFRL under contracts FA and FA C The opinions expressed are those of the authors.

GTRI_B-2 2 General Purpose GPU Computing Modern GPUs have unified shader architecture Highly parallel programmable processing units Flexibility extends GPU beyond rasterized 3D graphics New vendor focus on high-performance computing: NVIDIA’s CUDA, ATI’s CTM High theoretical performance (500 GFLOPs or more) Leverages volume & competition in entertainment industry Worldwide GPUs: $5B, 10M units per year U.S. Video Games: $7.5B, 250M units 2004 Holds down unit-price, drives advancement Outstripping CPU capacity, and growing more quickly

GTRI_B-3 3 General Purpose GPU Computing Modern GPUs have unified shader architecture Highly parallel programmable processing units Flexibility extends GPU beyond rasterized 3D graphics New vendor focus on high-performance computing: NVIDIA’s CUDA, ATI’s CTM High theoretical performance (500 GFLOPs or more) Leverages volume & competition in entertainment industry Worldwide GPUs: $5B, 10M units per year U.S. Video Games: $7.5B, 250M units 2004 Holds down unit-price, drives advancement Outstripping CPU capacity, and growing more quickly

GTRI_B-4 4 GPU Performance Trends: Unified Shaders R580 NV40 Dual Core

GTRI_B-5 5 HPEC Challenge Benchmarks HPEC Challenge How will candidate architecture perform in real application? Nine kernel benchmarks and one application benchmark. Seven attempted: Corner turn, Time-domain FIR, Frequency-domain FIR, Constant False Alarm Rate, Pattern Matching, Graph Optimization via Genetic Algorithm, QR Factorization Experimental System NVIDIA GeForce 8800 GTX Intel Core2 Q GHz Windows XP Professional, Visual C host C++ compiler NVIDIA CUDA 1.1

GTRI_B-6 6 CUDA Programming Model Compute Unified Device Architecture (CUDA) C-like programming language for executing kernels on GPU without casting as 3D graphics operation Keywords denote memory placement, grid environment, thread index Built-in functions for synchronization, fast math, cycle counts Runtime API for memory management, launching kernels, synchronizing host

GTRI_B-7 7 GPU Architecture (G80) Programmable units arranged as 16 “multiprocessors” For multiprocessor: eight datapaths Single-precision and int 16 kB scratchpad 8,192 word register file Scheduler 384-bit memory bus handles requests from all threads 1.3 GHz core clock, 575 MHz memory GPU Multiprocessor Datapath Shared Memory Register File Texture cache Multiprocessor Datapath Shared Memory Register File Multiprocessor Datapath Shared Memory Register File Global Memory

GTRI_B-8 CUDA Grids, Threads, and Blocks 8 Problem logically decomposed into “blocks” Scheduler maps blocks to available multiprocessors for concurrent execution Execution order not defined, synchronization not defined Blocks partitioned into threads Threads meant to be executed in SIMD manner on multiprocessor More threads than datapaths set of active threads known as “warp” scheduler devotes two cycles per “half warp” floating-point MADD has latency of 4 cycles When threads stall due to memory accesses, another warp is activated

GTRI_B-9 Corner Turn 9 Benchmark: Compute real-valued transpose out of place Strategies: coalesce reads and writes of adjacent threads to adjacent global memory locations transpose in shared memory minimize overhead of address computation Good match for GPU: Set 1: 0.30 ms – 8.32x speedup Set 2: 4.60 ms – 11.4x speedup T T Shared memory

GTRI_B Time-Domain FIR Benchmark: convolve a set of FIR filters with a set of input vectors Strategies: filter coefficients fit in shared memory map each filter to a block large number of threads per block overlap computation with streaming of input vector loop unrolling to improve utilization Good match for GPU Set 1: 2.54 ms - 151x speedup Set 2: 0.09 ms – 22.2x speedup Y block [thread] = h block [0] * x block [ thread ] + h block [1] * x block [ thread – 1] + h block [2] * x block [ thread – 2] +.

GTRI_B Frequency-Domain FIR Benchmark: fast convolution of set of FIR filters in the frequency domain Strategies: NVIDIA’s CUFFT library provides Fast Fourier Transform kernel performs complex element-wise multiplication Good match for GPU FFT speedup greater for large input vectors Set 1: 3.25 ms – 19.7x speedup Set 2: 0.26 ms – 11.5x speedup

GTRI_B Constant False Alarm Rate Detection Benchmark: Beams x Range Gates x Doppler Bins Normalize each cell by surrounding noise estimate Strategies: map each (beam, Doppler bin) to a block Stream range gates and compute noise estimate Good match for GPU Set 1: 0.29 ms – 2.3x speedup Set 2: 3.5 ms – 166x speedup Set 3: 3.4 ms – 46.8x speedup Set 4: 2.7 ms – 25.6x speedup C(i, j, k) = T(i, j, k) -1 | C(i, j, k) | 2

GTRI_B-13 Pattern Matching 13 Benchmark: Compute mean squared error (MSE) of input vector with template library Determine optimal shift and scale for minimum MSE Strategies : Process each pattern in parallel (one per block) Each thread computes one shift then one gain Good match for GPU Pattern Matching { for each of K patterns { for each of Sr shift values { find MSE of input with shifted pattern; } select shift with least MSE; for each of Sm magnitudes { find MSE of input with scaled pattern; } choose gain with least MSE; } choose gain, shift, pattern with least MSE; } Set 1: 0.24 ms – 12.7x speedup Set 2: 1.65 ms – 23.1x speedup

GTRI_B Graph Optimization via Genetic Algorithms Benchmark: use a genetic algorithm to search a problem space Roulette wheel selection Evaluation based on lookup table Elite chromosomes immune to mutation Strategies batch kernel calls to perform iteration Implement parallel RNG Selection and reproduction is a gather operation Crossover, mutation are parallel Evaluation is parallel Genetic Algorithm { Initialization; Evaluation; while !finished { Selection; Reproduction; Crossover; Mutation; Evaluation; } Set 1: 0.5 ms – 15.6x speedup Set 2: 11.7 ms – 33.3x speedup Set 3: 1.0 ms – 21.9x speedup Set 4: 4.1 ms – 23.7x speedup

GTRI_B QR Factorization: Fast Givens Benchmark: A = QR, Q H Q = I, R upper triangular Fast Givens: few square roots fine-grain parallelization streaming implementation requires different programs to run on several nodes GPU Characteristics: Fine-grain parallelization among threads of one block SIMD execution among threads Square roots inexpensive Shared memory capacity limited M = eye(m, m); d = ones(m); for j = 1 : n { for i = m: -1: j+1 { [  ] = fast.givens( A(i-1:i, j:n), d(i-1:i)); A(i-1:i, j:n) = G(  ) T A(i-1:i, j:n); M(j:m, i-1:i) = M(j:m, i-1:i) G(  ); } D = diag(d); Q = M D -1/2 ; R = D 1/2 A;

GTRI_B-16 Fast Givens: GPU Strategy 16 Fast Givens { do { // kernel 1 – one block load several columns of A; move up columns rotating A with threads staggered; write rotations to global memory; // kernel 2 – sixteen blocks load rotations; load columns from remaining submatrix of A; apply rotations to A in order; load submatrix of M; apply rotations to M in order; move active window right; } until all columns zeroed; } A K1 A K2 A …. M K2

GTRI_B-17 QR on GPU Conclusions 17 Fast Givens not greatest match Parallelism well-suited to synchronous data flow architecture Avoids calculations that are fast on GPU 2n 2 (m-n/3) flops Results: Set 1: 20. ms – 4.6x speedup Set 2: 4.5 ms – 1.5x speedup Set 3: 1.8 ms – 5.6x speedup Other QR methods: Householder reflections: compute v such that (I –  v v T )x = ||x|| e 1 A – v (  A T v) T  A serial, parallel, serial, parallel, … fast with batched calls 2n 2 (m-n/3) flops

GTRI_B GPU Limitations GPU Memory Architecture G80 lacks globally visible, writable cache Global memory has high latency Shared memory fast, limited in capacity Fine-grain Parallelism Threads share data directly with fast synchronization Blocks share via global memory, multiple kernel invocations Atomic memory operations possible with newer GPUs Kernel latency CPU  GPU communications limited by PCI-Express Bus Newer GPUs permit DMA while kernels execute (G92) Delay incurred when calling kernel, copying results Tolerable for large data sizes and batched calls

GTRI_B Conclusions GPU speedup possible for most classes of problems Memory hierarchy and threading model drive implementation High memory bandwidth, high parallelism good implementation of streaming architecture Cleverness required for fast implementations High performance Fine-grain parallelism not great match No formal synchronization across blocks Benchmarks should grant flexibility to implementation don’t require obscure algorithms to solve common problems don’t define metrics biased away from coprocessors without necessity

GTRI_B-20 References 20 HPEC Challenge Benchmarks Golub and Van Loan. Matrix Computations. Johns Hopkins University Press, 3 rd edition NVIDIA CUDA Programming Guide 1.1

GTRI_B-21 Questions Questions? 21