General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Slides:

Advertisements

Similar presentations

Multi-GPU and Stream Programming Kishan Wimalawarne.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Multi-GPU System Design with Memory Networks

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

OpenSSL acceleration using Graphics Processing Units

Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Martin Kruliš by Martin Kruliš (v1.0)1.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

GPU Programming with CUDA – Optimisation Mike Griffiths

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

GPU Architecture and Its Application

CS427 Multicore Architecture and Parallel Computing

Heterogeneous Programming

CUDA Execution Model – III Streams and Events

Graphics Processing Unit

6- General Purpose GPU Programming

Presentation transcript:

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12 Distribution Statement

Outline ▼ Background ▼ NVIDIA’s CUDA ▼ Decomposition & Porting ▼ CUDA Optimizations ▼ GPU Results ▼ Conclusion 9/12/12 2

Background ▼ Parallel Programming on GPUs  General-Purpose Computation on Graphics Processing Units (GPGPU)  Compute Unified Device Architecture (CUDA)  Open Computing Language (OpenCL TM ) 9/12/12 3

Background ▼ GPUs vs. CPUs  GPU and CPU cores not the same  CPU core is faster and more robust but, fewer cores  GPU not as robust nor fast, but handles repetitive tasks quickly ▼ NVIDIA GeForce GTX 470  448 cores  Memory Bandwidth = GB/sec  GFLOPS DP ▼ Intel Core i7-965  4 cores  Memory Bandwidth = 25.6 GB/sec  GFLOPS DP 9/12/12 4

CUDA by NVIDIA ▼ Compute Unified Device Architecture  Low and High Level API available  C for CUDA  High latency memory transfers  Limited Cache  Scalable programming model  Requires NVIDIA graphics cards 9/12/12 5

Decomposition and Porting ▼ Amdhal’s and Gustafson’s Law ▼ Estimate Speed Up  P amount of parallel scaling achieved  γ is the fraction of algorithm that is serial 9/12/12 6

Decomposition and Porting ▼ TAU Profile  Determine call paths and consider subroutine calls  Pay attention to large for loops or redundant computations ▼ Visual Studio 2008  Initialize Profile: TAU_PROFILE(“StartFor”, “Main”, TAU_USER);  Place Timers: −TAU_START(“FunctionName”) −TAU_STOP(“FunctionName”) 9/12/12 7

Decomposition and Porting ▼ CUDA Overhead  High latency associated with memory transfers  Can be hidden with large amounts of mathematical computations  Reduce Device to Host memory transfers −Many small transfers vs. fewer but larger transfers −Perform serial tasks using parallel processors 9/12/12 8

CUDA Optimizations ▼ Thread and Block Occupancy  Varies depending on graphics card ▼ Page Locked Memory  cudaHostAlloc()  Limited resource and should not be overused ▼ Streams  A queue of GPU operations  Such as GPU computation “kernels” and memory copies ▼ Asynchronous Memory Calls  Ensure non-blocking calls  cudaMemcpyAsync() or kernel call 9/12/12 9

Thread Occupancy ▼ Ensure enough threads are operating at the same time  256 threads per block  Max 1024 threads per block  Monitor occupancy 9/12/12 10

CUDA Optimizations ▼ Page Locked Host Memory  cudaHostAlloc() vs. malloc vs. new 9/12/12 11

CUDA Optimizations ▼ Stream Structure Non-Optimized  Processing time: 49.5ms 9/12/12 12 cudaMemcpyAsync(dataA0, stream0, HostToDevice) cudaMemcpyAsync(dataB0, stream0, HostToDevice) kernel >>(result0, dataA0, dataB0) cudaMemcpyAsync(result0, stream0, DeviceToHost) cudaMemcpyAsync(dataA1, stream1, HostToDevice) cudaMemcpyAsync(dataB1, stream1, HostToDevice) kernel >>(result1, dataA1, dataB1) cudaMemcpyAsync(result1, stream1, DeviceToHost)

CUDA Optimizations ▼ Stream Structure Optimized  Processing time: 49.4ms 9/12/12 13 cudaMemcpyAsync(dataA0, stream0, HostToDevice) cudaMemcpyAsync(dataA1, stream1, HostToDevice) cudaMemcpyAsync(dataB0, stream0, HostToDevice) cudaMemcpyAsync(dataB1, stream1, HostToDevice) kernel >>(result0, dataA0, dataB0) kernel >>(result1, dataA1, dataB1) cudaMemcpyAsync(result0, stream0, DeviceToHost) cudaMemcpyAsync(result1, stream1, DeviceToHost)

CUDA Optimizations ▼ Stream Structure Optimized & Modified  Processing time: 41.1ms 9/12/12 14 cudaMemcpyAsync(dataA0, stream0, HostToDevice) cudaMemcpyAsync(dataA1, stream1, HostToDevice) cudaMemcpyAsync(dataB0, stream0, HostToDevice) cudaMemcpyAsync(dataB1, stream1, HostToDevice) kernel >>(result0, dataA0, dataB0) cudaMemcpyAsync(result0, stream0, DeviceToHost) kernel >>(result1, dataA1, dataB1) cudaMemcpyAsync(result1, stream1, DeviceToHost)

CUDA Optimizations ▼ Stream Structure not always beneficial  Overhead could result in performance reduction  Profile to determine kernel execution vs. data transfer −NVIDIA Visual Profiler −cudaEventRecord() 9/12/12 15

GPU Results 9/12/12 16 ▼ Optimization Stages  0: No Optimizations (65 FPS)  1: Page Locking Memory (67 FPS)  2: Asynchronous GPU calls (81 FPS)  3: Non-optimized Streaming (82 FPS)  4: Optimized Streaming (85 FPS)

GPU Results ▼ ALF CPU vs. GPU Processing 9/12/12 17

Conclusion ▼ Test various thread per block allocations ▼ Use page locked memory for data transfers  Asynchronous memory transfer and non-blocking calls ▼ Ensure proper coordination of streams  Data Parallelism and Task Parallelism 9/12/12 18

QUESTIONS? 9/12/12 19

References ▼ Amdahl, G., "Validity of the single processor approach to achieving large scale computing capabilities." AFIPS Spring Joint Computer Conference, ▼ CUDA C Best Practices Guide Ver 4.0, 5/2011. ▼ Gustafson, J., "Reevaluating Amdahl's Law." Communications of the ACM, Vol. 31 Number 5, May ▼ Jason Sanders, Edward Kandrot. CUDA By Example, An Introduction to General-Purpose GPU Programming. Addison- Wesley. Copyright NVIDIA Corporation ▼ NVIDIA CUDA Programming Guide Ver 4.0, 5/6/2011. ▼ Tau-User Guide. Department of Computer and Information Science, University of Oregon Advanced Computing Laboratory /12/12 20