CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Slides:



Advertisements
Similar presentations
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
th DAC Embedded Systems and Software CML CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler-Microarchitecture.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA programming Performance considerations (CUDA best practices)
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Lecture 2: Intro to the simd lifestyle and GPU internals
General Purpose Graphics Processing Units (GPGPUs)
Graphics Processing Unit
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler and Microarchitecture Laboratory , Arizona State University DAC2011

Outline Introduction Preliminaries Motivating examples CuMAPz approach Experimental results and Conclusions

Introduction Currently, the computational power of Graphics Processing Units (GPUs) has reached teraFLOP scale. NVIDIA CUDA and OpenCL make GPGPU (General Purpose computation on GPUs) programming more easier. The performance will be heavily affected by memory performance for the sake of large data size.

Introduction (cont.) Shared memory is as fast as registers, and is the only fast memory where both reads and writes are enabled. Many factors affect performance: data reuse, global memory access coalescing, shared memory, bank conflict, channel skew. Develops CuMAPz (CUDA Memory Access Pattern analyZer) to analyze the memory performance of CUDA program.

Preliminaries NVIDIA GPU architecture. Comparisons between CPU and GPU. CUDA programming. Memory coalescing. Execution of GPU thread

Architecture of Nvidia GTX280 A collection of 30 multiprocessors, with 8 streaming processors each. The 30 multiprocessors share one off-chip global memory. Access time: about 300 clock cycles Each multiprocessor has a on-chip memory shared by that 8 streaming processors. Access time: 2 clock cycles

Architecture diagram

About some differences between GPU and CPU GPU (NVIDIA GeForce 8800 GTX) CPU (Intel Pentium 4) Cores and clock rate 128 / 575MHz (core clock), 1.35GHz (shader clock) 1 / 3.0GHz flops 345.6G ~12G Memory bandwidth 86.4GB/s (900MHz memory clock, 384 bit interface, 2 issues) 6.4GB/s (800MHz memory clock, 32 bit interface, 2 issues) Access time of global memory Slow (about 500 memory clock cycles) Fast (about 5 memory clock cycles)

Abstract comparisons of memory between GPU and CPU (cont.) CPU (Intel Pentium 4) GPU (NVIDIA GeForce 8800 GTX) Register Cache Texture cache or Constant cache Main memory Shared memory Hard disk Global memory, Texture memory, Constant memory

Memory coalescing Several memory transactions can be coalesced into one transaction when consecutive threads access consecutive memory locations. Due to access time of global memory is relatively large, it is important to achieve this.

CUDA programming Compute Unified Device Architecture The CPU code does the sequential part. Highly parallelized part usually implement in the GPU code, called kernel. Calling GPU function in CPU code is called kernel launch.

Execution of GPU thread Threads are grouped into thread blocks. Each thread block is assigned to a streaming multiprocessors (SMs), which contains multiple scalar processors (SPs), to be executed. The actual execution of threads on SPs is done in groups of 32 threads, called warps. SPs execute one warp at a time.

Motivating examples What to fetch into shared memory? How to access shared memory? How to access global memory?

What to fetch into shared memory? A simple program that does not use shared memory.

What to fetch into shared memory? (cont.) If we fetch row*MAX+col+1 to the shared memory…

What to fetch into shared memory? (cont.) Generally, higher data reuse should imply better performance. => But may not be true here. This counter-intuitive result is mainly caused by global memory access coalescing.

How to access shared memory? In Figure 2, data is accessed in a column-wise manner, as shown at Line 4, 9, 11, and 16. What if we change into row-wise manner (i.e. s_in[tIdx.y][tIdx.x]) or skewing the access pattern (i.e. __shared__ float s_in[BLKDIM][BLKDIM+1])?

How to access shared memory? (cont.) Shared memory bank conflicts occur if there are multiple requests to different addresses in the same bank. In this case, the requests are serialized.

How to access global memory? A programmer might have designed the global memory write reference at Line 18 in Figure 2 to be in a column-wise manner as in out[col*MAX+row]. This unexpected slowdown is caused by channel skew. Channel skew is the ratio of the number of concurrent accesses to the most used channel to theleast used channel.

Previous works [20] modeled the amount of parallelism employed in a program and the efficiency of a single kernel execution in a thread. Did not consider memory performance and their analysis is only for compute intensive benchmarks. [8] includes the effect of parallelism to hide global memory access latency. Does not take into account branch divergence. [14][15][16][17][18] automate optimization of GPGPU applications. None of the above work comes up with a comprehensive performance metric to estimate the efficiency of memory access pattern.

CuMAPz overview

Data Reuse Profit Estimation CuMAPz maintains a counter to count the number of times shared memory buffers are accessed. The degree of data reuse is represented in a term, data reuse, as follows:

Coalesced Access Profit Estimation Due to coalescing, the actual transfer size that will consume bus width can be different from the size of data requested from threads. CuMAPz calculates the bandwidth utilization as the following:

Channel Skew Cost Estimation Channel skew refers to the case where the concurrent memory accesses are not evenly distributed to all the channels but focused on only a few channels. When a kernel is launched, threads blocks are assigned to SMs in a sequential order so that adjacent blocks are executed on adjacent SMs. Then, it becomes unpredictable after the first round of schedule since the order in which thread blocks finish the execution cannot be determined [13].

Channel Skew Cost Estimation (cont.) The impact of channel skew can be stated in figures as the skewness of mapping to channels which can be calculated as follows:

Bank Conflict Cost Estimation Similarly to global memory channels, shared memory space is divided into multiple banks. Each bank can serve one address at a time. Efficiency of shared memory access is modeled as follows:

Branch Divergence Cost Estimation Branches are introduced when there is uncovered region that is not buffered into shared memory, as shown at Line 6 and 13 in Figure 2. When threads in a warp take different execution paths, then all paths are serialized. We simply model the impact of branch divergence as follows:

Overall Memory Performance Estimation Memory performance estimation is calculated by the following formula.

Experimental results Environments Using C language. CUDA driver version 3.2 on NVIDIA Tesla C1060. Benchmark are from benchmark suites in [6], and CUDA SDK.

Two experiments Validation: studying the correlation between our memory performance estimation and the performance of the benchmarks for different ways. Performance Optimization: trying to find the best way to accesses shared and global memory using CuMAPz and the previous technique [8].

Validation

Performance optimization

Runtime Considerations The timing complexity of the CuMAPz analysis is O(|W|*|R|*|B|), where W, R, and B are the set of all warps, global memory references, and shared memory buffers respectively.

Limitations Compile-time analysis Assume adequate occupancy Cannot handle any information that can only be determined during run-time. Assume adequate occupancy A measure of how many thread blocks can be scheduled on one SM so that the hardware is kept busy.

Conclusions GPU is a new platform for high-performance computing. Develops CuMAPz to analyze memory performance of CUDA. Considering many aspects like channel skew, etc. Experimental results show very high correlation between the actual execution times and CuMAPz estimation.