Addressing Software-Managed Cache Development Effort in GPGPUs

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Presented by Rengan Xu LCPC /16/2014
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Understanding Outstanding Memory Request Handling Resources in GPGPUs
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Full and Para Virtualization
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Sunpyo Hong, Hyesoon Kim
Performance in GPU Architectures: Potentials and Distances
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Parallel Programming Models
Computer Engg, IIT(BHU)
Employing compression solutions under openacc
ISPASS th April Santa Rosa, California
Basic CUDA Programming
Lecture 5: GPU Compute Architecture
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
L18: CUDA, cont. Memory Hierarchy and Examples
Lecture 5: GPU Compute Architecture for the last time
The Vector-Thread Architecture
ECE 8823: GPU Architectures
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
General Purpose Graphics Processing Units (GPGPUs)
Mattan Erez The University of Texas at Austin
Lecture 5: Synchronization and ILP
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Addressing Software-Managed Cache Development Effort in GPGPUs Ahmad Lashgar PhD Candidate at ECE Department Supervisors: Dr. A. Baniasadi Dr. N. J. Dimopoulos August 8, 2017 PhD Defense Session

Heterogeneous Computing Use of different kinds of processing cores for higher efficiency From high-performance to low-power embedded systems High-performance computing style Serial workloads run on multiple out-of-order fast cores Traditional CPUs Parallel workloads run on many in-order slow cores Accelerators: GPUs, FPGAs, XeonPhi, etc.

Accelerator Hardware/Software Stack Scope of this work: OpenACC-over-CUDA for NVIDIA GPUs

Thread Block Dispatcher Scoreboard & Registerfile Hardware NVIDIA-like GPGPU Microarchitecture … GPU Thread Block Dispatcher SMN Ready Thread Blocks … Block 0 Block K-1 Core1 CoreN Scoreboard & Registerfile … L1$ L1$ Interconnect … Load/Store Unit Arith/Logic SIMD Floating Point SIMD Special Fcn SIMD L2$ L2$ MC1 MCM L1 Cache -Thread block -> group of loop iterations (1 to 1024 loop iterations) executed in lock-step on SIMD Data cache Constant cache Texture cache Shared Memory DRAM

Software Interface OpenACC Higher Productivity CUDA Higher Performance int main(){ ... #pragma acc kernels copyin(a[0:LEN]) copyout(b[0:LEN]) #pragma acc loop independent for(i=0; i<LEN; ++i) { b[i] = a[i]+10; } CUDA Higher Performance __global__ void matrixMul(int *a, int *b, int len){ int i=threadIdx.x+blockIdx.x*blockDim.x; b[i] = a[i]+10; } int main(){ ... bytes=LEN*sizeof(int); cudaMalloc(&a_d, bytes); cudaMalloc(&b_d, bytes); cudaMemcpy(a_d, a, bytes, cudaMemcpyHostToDevice); dim3 gridSize(LEN/16,LEN/16); dim3 blockSize(16,16); matrixMul<<<gridSize,blockSize>>>(a_d,b_d,LEN); cudaMemcpy(c, b_d, bytes, cudaMemcpyDeviceToHost); -Not going to the detail of the code, just showing the relative code size -Each loop iteration is executed by one thread

Software-managed Cache In GPUs OpenACC ... #pragma acc cache(a[i:256]) { sum += a[i] + ... + a[i+255]; } CUDA __shared__ float smc[256]; smc[threadIdx.x] = a[i]; float sum = a[i] + ... + a[i+255]; GPU Core #1 Ready Thread Blocks Scoreboard & Regfile LDST units & SIMDs L1$ Shared Memory (SMC) Other caches -Private per core space -Trivial to use in OpenACC -In CUDA though, two memory spaces should be maintained -Data in shared memory is private and not backedup elsewhere

This Dissertation Chapter 7 Proposed Solutions Chapter 6 Chapter 5 TELEPORT Enhancing cache Directive Proposed Solutions Chapter 6 Implementing cache directive Chapter 5 CUDA OpenACC Chapter 4 Major performance bottleneck in high-level accelerator programming: SMC Observation Chapter 3 IPMACC OpenACC Compiler (microbenchmarking to tune the compiler) Infrastructure Compare high-level & low-level programming models Approach Criteria Productivity Performance

Outline Overview Chapter 3: IPMACC Chapter 4: Microbenchmarking GPGPUs Chapter 5: Implementing OpenACC cache Directive Chapter 6: Enhancing cache Directive Chapter 7: TELEPORT Conclusion

IPMACC Chapter 3: IPMACC Developed from scratch to run OpenACC codes on vector processors (GPUs and CPUs): Translated Source Hand Optimization Translator System Compilers to CUDA gcc/nvcc OpenACC Source Machine Binary to ISPC gcc/ispc to OpenCL gcc OpenACC API IPMACC libopenacc.so

Chapter 3: IPMACC (2) Simple usage: Features: $ ipmacc sourceacc.c -o gpubinary This generates an intermediate CUDA source code Features: Supported directives: data, kernels, loop, cache Supported backends: CUDA, OpenCL, and ISPC Supports user-defined datatypes and routines Available to public since Dec 6, 2014: https://github.com/lashgar/ipmacc Repository includes our benchmarks too

Chapter 3: IPMACC (2) Investigating performance gap between CUDA and OpenACC

Chapter 4: Micro-benchmarking GPGPUs Understanding micro-architecture through software stress-tests Purpose: To tune IPMACC compiler for best performance Targets: L1 cache memory request handling resources Shared memory organization Available to public since Jun 9, 2015: https://github.com/lashgar/microbenchmark

Chapter 4: Micro-benchmarking GPGPUs (2) Findings on Tesla C2070 L1 cache: MSHR 128 MSHR entry, utmost, 8 mergers per entry, at least MLP per SM: Coalesced = 1024 (128 x 8) & Un-coalesced = 128 Findings on Tesla K20c L1 cache: PRT 44 PRT entries, utmost, 32 requests per entry (warp size) Coalesced/Un-coalesced = 1408 (44 x 32)

Chapter 4: Micro-benchmarking GPGPUs (3) Findings on shared memory The layout of 2D arrays allocated in the shared memory is found to be the same as flattened 2D arrays Adding a small padding in allocation vastly resolves bank conflict

This Dissertation Proposed Solutions Chapter 5 Observation TELEPORT Enhancing cache Directive Proposed Solutions Implementing cache directive Chapter 5 CUDA OpenACC Major performance bottleneck in high-level accelerator programming: SMC Observation IPMACC OpenACC Compiler (microbenchmarking to tune the compiler) Infrastructure Compare high-level & low-level programming models Approach Criteria Productivity Performance

Chapter 5: Implementing cache Directive First work investigating cache directive implementation Example: #pragma acc kernels loop for(i=1; i<LEN-1; ++i){ int lower = i-1, upper = i+1; float sum = 0; #pragma acc cache(a[(i-1):3]) { for(j=lower; j<=upper; ++j) sum += a[j]; } b[i] = sum/(upper-lower+1); GPU Core #1 Ready Thread Blocks Scoreboard & Regfile LDST units & SIMDs L1$ Shared Memory (SMC) Other caches

Chapter 5: Implementing cache Directive (2) Directive is general, but implementation is hardware specific We target NVIDIA GPUs Critical optimizations: Cache sharing, fetch, write policy, index mapping, occupancy Proposed three implementations Our best implementation performs near to CUDA, while reducing development effort by 29% Evaluated various configurations under 3 test cases, 2 GPUs

This Dissertation Proposed Solutions Chapter 6 Observation TELEPORT Enhancing cache Directive Proposed Solutions Chapter 6 Implementing cache directive CUDA OpenACC Major performance bottleneck in high-level accelerator programming: SMC Observation IPMACC OpenACC Compiler (microbenchmarking to tune the compiler) Infrastructure Compare high-level & low-level programming models Approach Criteria Productivity Performance

Chapter 6: Enhancing cache Directive cache directive limitation: ... Iteration #1 Iteration #2 Iteration #N Read Read Read Private Read-only Read Read SMC SMC SMC Write Read Private Read-Write ... ... ... Shared Read-Only Write Write Write Write Write Shared Read-Write

Chapter 6: Enhancing cache Directive (2) Shared Read-Write SMC is very common in CUDA applications: E.g. Reduction, Pathfinder, Hotspot, Dyadic Convolution We proposed fcw directive to address this limitation Shared Read-Write SMC in OpenACC Notation is similar to cache directive Introduces an inter-iteration communication model for sharing intermediate results Evaluated under six test-cases: Runtime improvement over baseline OpenACC, at a lower development effort than CUDA

This Dissertation Chapter 7 Proposed Solutions Observation TELEPORT Enhancing cache Directive Proposed Solutions Implementing cache directive CUDA OpenACC Major performance bottleneck in high-level accelerator programming: SMC Observation IPMACC OpenACC Compiler (microbenchmarking to tune the compiler) Infrastructure Compare high-level & low-level programming models Approach Criteria Productivity Performance

Chapter 7: TELEPORT Key Idea: Advantages: Identify data with sharing opportunities in CUDA kernels Fetch the data “automatically” to shared memory Advantages: No extra development effort No extra instruction

Chapter 7: TELEPORT (2) tid range is a function of blockIdx Min = 0 Max = blockDim Constant blockDim, threadIdx, blockIdx, tid i Depends on an induction var Known at runtime id Data dependent Only memory reads are candidates Range of tid is known if blockIdx is known. Kernel<<< NTB, blockDim >>> ( <args> );

Chapter 7: TELEPORT (2) cudaSetCTATracker( nodes, <tid range>); blockDim, threadIdx, blockIdx, tid cudaSetCTATracker( nodes, <tid range>); cudaSetCTATracker( cost, <tid range>); Only memory reads are candidates Range of tid is known if blockIdx is known. Kernel<<< NTB, blockDim >>> ( <args> );

Thread Block Dispatcher Chapter 7: TELEPORT (2) GPU SM1 Thread Block Dispatcher On-Hold Thread Ready Threads … Preloader Scoreboard Block 0 Block 1 Block NTB-1 Load/Store Unit Execution Engine SM2 SM1 Cache Other caches Shared Memory Tag $ $ Interconnect … cudaSetCTATracker( nodes, <tid range>); cudaSetCTATracker( cost, <tid range>); L2$ L2$ Only memory reads are candidates Range of tid is known if blockIdx is known. MC1 MCN Kernel<<< NTB, blockDim >>> ( <args> );

Chapter 7: TELEPORT (3) Used GPGPU-sim 3.2.2 for modeling TELEPORT Analyzed 16 benchmarks: Limited evaluations to 5 test-cases Three methods are compared: Baseline no-SMC CUDA Baseline no-SMC CUDA + TELEPORT Hand-optimized CUDA Evaluations show advantages in: Development effort, runtime, dynamic instructions, DRAM accesses, DRAM row locality - -Not replaces all functionalities of CUDA shared memory notation

Conclusion Developed & introduced IPMACC OpenACC compiler Derived set of micro-benchmarkings to understand GPUs First work investigating cache directive implementation aspect Proposed a directive to address cache directive limitations Proposed a hardware/software alternative to CUDA shared memory programming

List of Publications Efficient Implementation of OpenACC cache Directive on NVIDIA GPUs Ahmad Lashgar and Amirali Baniasadi To appear in the International Journal of High Performance Computing and Networking (IJHPCN), Special Issue on High-level Programming Approaches for Accelerators OpenACC cache Directive: Opportunities and Optimizations Ahmad Lashgar and Amirali Baniasadi In proceedings of WACCPD 2016, (in conjunction with SC 2016), Salt Lake City, Utah, USA, November 14, 2016. Employing Compression Solutions under OpenACC Ebad Salehi, Ahmad Lashgar, and Amirali Baniasadi In proceedings of HIPC 2016, (in conjunction with IPDPS 2016), Chicago, Illinois, USA, May 23-27, 2016. A Case Study in Reverse Engineering GPGPUs: Outstanding Memory Handling Resources Ahmad Lashgar, Ebad Salehi, and Amirali Baniasadi ACM SIGARCH Computer Architecture News - HEART '15, Volume 43 Issue 4 Employing Software-Managed Caches in OpenACC: Opportunities and Benefits Ahmad Lashgar and Amirali Baniasadi ACM Transactions on Modeling and Performance Evaluation of Computing Systems (ACM ToMPECS), Volume 1 Issue 1, March 2016 Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar and Amirali Baniasadi In proceedings of HPCC’17, New York, NY, USA, August 24-26, 2015. Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar, Ebad Salehi, and Amirali Baniasadi In proceedings of HEART’6 [Best Paper Candidate], Boston MA, USA, June 1-2, 2015. IPMACC: Translating OpenACC API to OpenCL Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi In poster session of the 3rd International Workshop on OpenCL (IWOCL) Stanford University, California, USA, May 11-13, 2015. A Case Against Small Data Types on GPGPUs Ahmad Lashgar and Amirali Baniasadi In proceedings of IEEE ASAP’25, IBM Research, Zurich, Switzerland, June 18-20, 2014.

Thank You! Questions?