Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
More on threads, shared memory, synchronization
Performance Evaluation of Lock-free Data Structures on GPUs Performance Evaluation of Lock-free Data Structures on GPUs
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Advanced / Other Programming Models Sathish Vadhiyar.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Processes and threads.
CS427 Multicore Architecture and Parallel Computing
Introduction to GPU programming using CUDA
Computer Structure Multi-Threading
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Hyperthreading Technology
Linchuan Chen, Xin Huo and Gagan Agrawal
Introduction to CUDA heterogeneous programming
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
MASS CUDA Performance Analysis and Improvement
Introduction to CUDA C Slide credit: Slides adapted from
CUDA Parallelism Model
NVIDIA Fermi Architecture
More on GPU Programming
Computer Evolution and Performance
CUDA Execution Model – III Streams and Events
General Purpose Graphics Processing Units (GPGPUs)
GPU Scheduling on the NVIDIA TX2:
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs http://slideplayer.com/slide/3953720/ A presentation by Orr Goldman 11/1/18

Our plan for Today Intro GPU Architecture 101 – So many cores CUDA in a nutshell – GPU Programming Evaluating lock-free structures Evaluation results Conclusions

The following article is brought to you by: Mainak Chaudhuri Associate professor IIT Kanpur Prabhakar Misra Software Eng. Google (His M.Sc. Thesis) Indian Institute of Technology Kanpur India 2012

Why Evaluate on the GPU at all?

Intel SkyLake Core – Not our subject! Scheduler Allocate/Rename/Retire/ME/ZI In order OOO INT VEC Port 0 Port 1 MUL ALU FMA Shift LEA Port 5 Shuffle Port 6 JMP 1 JMP 2 DIV Load Buffer Store Buffer ReorderBuffer Port 4 32KB L1 D$ Port 2 Load/STA Store Data Port 3 Port 7 STA Load Data 2 Load Data 3 Memory Control 256KB L2$ Fill Buffers μop Cache 32KB L1 I$ Pre decode Inst Q Decoders Branch Prediction Unit μop Queue LSD 5 6 State not relevant Source: MAMAS course slides

GPU general scheme No OOO Execution Usually no more than 2 levels of cache Each core has less functionality E.g. no built-in AES unit Only one ALU per core More Space for cores! Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Nvidia GPUs – Fermi Architecture 15 cores or Streaming Multiprocessors (SMs) Each SM features 32 CUDA processors 480 CUDA processors Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Fermi’s Memory model 32K registers per SM Shared memory for same-SM cores L1 cache – per SM L2 cache – shared for all SMs No L3 cache  Global memory Sizes: Either 48 KB Shared / 16 KB of L1 cache Or 16 KB Shared / 48 KB of L1 cache User’s choice. L2 - L2 cache (768KB) Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Or “How I learned to stop worrying and love the GPU” CUDA Or “How I learned to stop worrying and love the GPU”

What is CUDA? CUDA Architecture CUDA C/C++ Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. Source: slides by Nvidia©

Simple Processing Flow PCI Bus Copy input data from CPU memory to GPU memory Source: slides by Nvidia©

Simple Processing Flow PCI Bus Copy input data from CPU memory to GPU memory Load GPU program and execute, caching data on chip for performance Source: slides by Nvidia©

Simple Processing Flow PCI Bus Copy input data from CPU memory to GPU memory Load GPU program and execute, caching data on chip for performance Copy results from GPU memory to CPU memory Source: slides by Nvidia©

Hello World! with Device Code __global__ void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf("Hello World!\n"); return 0; Source: slides by Nvidia© picture by Nvidia©

Addition on the Device: add() __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } __global__ kernels (functions) will always return void Allocate memory in advance Send it to kernel with pointers Let’s take a look at main()… Source: slides by Nvidia©

Addition on the Device: main() int main(void) { int a, b, c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Setup input values a = 2; b = 7; Source: slides by Nvidia©

Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } Source: slides by Nvidia©

SIMD – Single Instruction Multiple Data for(int i=0; i<4; i++) { c[i] = a[i] + b[i]; }

SIMT – Single Instruction Multiple Thread for(int i=0; i<4; i++) { if(i%2 == 0) c[i] = a[i] + b[i]; else c[i] = a[i] * b[i]; }

Addition on the Device: main() int main(void) { int a[N], b[N], c[N]; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int)*N; // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Setup input values a = …; b = …; Add SIMD Source: slides by Nvidia©

Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<N,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } Source: slides by Nvidia©

Vector Addition on the Device Terminology: each parallel invocation of add() is referred to as a block The set of blocks is referred to as a grid Each invocation can refer to its block index using blockIdx.x __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } By using blockIdx.x to index into the array, each block handles a different index Block 0 Block 1 Block 2 Block 3 c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; Source: slides by Nvidia©

Cuda in a nutshell Fast development to run code on GPU SIMT (SIMD) Support - big speedup Implements AtomicCAS, AtomicINC Both are pretty slow, as expected For more info Course 046278 at EE faculty

Evaluating lock-free structures –which? Implementing sets with the following: Lock-free linked list – Harris-Michael construction – Presented by Erez Lock-free chain-hashed Hash-table - Simplified variation on the impl. presented by Yoav 10 5 buckets Lock-free skipLists – Presented by Rana P=0.5 for coin toss, max-level =32 Lock-free priority queue – skiplist based impl. – presented by Michael T.

Evaluating lock-free structures –How? Run on CPU: Intel Xeon server 24 cores Run on GPU: Tesla C2070 Fermi GPU 14 SMs, 32 cores on each That’s 448 cores! Generate mix of adds, deletes and finds (For Pr. Queue, just add and remove min) Run on CPU Run on GPU Compare results

Make it solid – vary of data Different key ranges: 0,10 𝑖 | 2≤𝑖≤5 Different Op. number: From 10000 to 100000 in steps of 10000 Different op. mixes (add, delete, find) = (20, 20, 60) or (40, 40, 20) (add, remove min) = (80, 20) or (50, 50)

Make it solid – fine tune execution On GPU: Utilize thread block size: 64 for linked list, 512 for the rest Utilize number of thread blocks: 1, 4, 8 or 16 instructions per thread On CPU: Take the best results from each thread count: not more than 24 Memory management: All nodes for lists are pre-allocated, no re-use In most of the cases, the best configuration is the one in which a thread carries out just one operation in the kernel. This configuration essentially maximizes the number of CUDA threads

Results

Lock-free linked list Best performance on small key ranges and larger op counts Add % Delete % Search % No major difference between search-heavy and add/delete-heavy op strings Best speedup=7.3 Source: Writers’ slides

Lock-free hash table Consistent speedup across all key ranges and op mixes Best speedup = 11.3 Source: Writers’ slides

Lock-free skip list For identical key range, speedup improves with no. of Add ops Still good speedup at large key range for Add-heavy op strings Speedup drops with increasing key range Best speedup=30.7 Around 4x speedup Source: Writers’ slides

Lock-free priority queue Add % DeleteMin % Trends are similar to skip list: speedup increases with Add % Best speedup=30.8 Source: Writers’ slides

Hash table vs. linked list The data are shown for the largest key range On GPU, the hash table is 36x to 538x faster than linear list On CPU, the hash table is only 8x to 54x faster than linear list GPU exposes more concurrency in the lock-free hash table Source: Writers’ slides

Skip list vs. linked list On GPU, the skip list is 2x to 20x faster than linear list GPU still exposes more concurrency than CPU for skip list Hash table shows far better scalability than skip list Source: Writers’ slides

Throughput of hash table Hash table is the best performing data structure among the four evaluated For the largest key range, on a search-heavy op mix [20, 20, 60], the throughput ranges from 28.6 MOPS to 98.9 MOPS on the GPU For an add/delete-heavy op mix [40, 40, 20], the throughput range is 20.8 MOPS to 72.0 MOPS Nearly 100 MOPS on a search-heavy op mix Source: Writers’ slides

Summary First detailed evaluation of four lock-free data structures on CUDA-enabled GPU All four data structures offer moderate to high speedup on small to medium key ranges compared to CPU implementations Benefits are low for large key ranges in linear lists, skip lists, and priority queues Primarily due to CAS overhead and complex control flow in skip lists and priority queues Hash tables offer consistently good speedup on arbitrary key ranges and op mixes Nearly 100 MOPS throughput for search-heavy op mixes and more than 11x speedup over CPU Source: Writers’ slides

Questions? Thank you!