Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs
A presentation by Orr Goldman 11/1/18

Our plan for Today Intro GPU Architecture 101 – So many cores
CUDA in a nutshell – GPU Programming Evaluating lock-free structures Evaluation results Conclusions

The following article is brought to you by:
Mainak Chaudhuri Associate professor IIT Kanpur Prabhakar Misra Software Eng. Google (His M.Sc. Thesis) Indian Institute of Technology Kanpur India 2012

Why Evaluate on the GPU at all?

Intel SkyLake Core – Not our subject!
Scheduler Allocate/Rename/Retire/ME/ZI In order OOO INT VEC Port 0 Port 1 MUL ALU FMA Shift LEA Port 5 Shuffle Port 6 JMP 1 JMP 2 DIV Load Buffer Store Buffer ReorderBuffer Port 4 32KB L1 D$ Port 2 Load/STA Store Data Port 3 Port 7 STA Load Data 2 Load Data 3 Memory Control 256KB L2$ Fill Buffers μop Cache 32KB L1 I$ Pre decode Inst Q Decoders Branch Prediction Unit μop Queue LSD 5 6 State not relevant Source: MAMAS course slides

GPU general scheme No OOO Execution
Usually no more than 2 levels of cache Each core has less functionality E.g. no built-in AES unit Only one ALU per core More Space for cores! Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Nvidia GPUs – Fermi Architecture
15 cores or Streaming Multiprocessors (SMs) Each SM features 32 CUDA processors 480 CUDA processors Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Fermi’s Memory model 32K registers per SM
Shared memory for same-SM cores L1 cache – per SM L2 cache – shared for all SMs No L3 cache  Global memory Sizes: Either 48 KB Shared / 16 KB of L1 cache Or 16 KB Shared / 48 KB of L1 cache User’s choice. L2 - L2 cache (768KB) Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Or “How I learned to stop worrying and love the GPU”
CUDA Or “How I learned to stop worrying and love the GPU”

What is CUDA? CUDA Architecture CUDA C/C++
Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. Source: slides by Nvidia©

Simple Processing Flow
PCI Bus Copy input data from CPU memory to GPU memory Source: slides by Nvidia©

PCI Bus Copy input data from CPU memory to GPU memory Load GPU program and execute, caching data on chip for performance Source: slides by Nvidia©

PCI Bus Copy input data from CPU memory to GPU memory Load GPU program and execute, caching data on chip for performance Copy results from GPU memory to CPU memory Source: slides by Nvidia©

Hello World! with Device Code
__global__ void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf("Hello World!\n"); return 0; Source: slides by Nvidia© picture by Nvidia©

Addition on the Device: add()
__global__ void add(int *a, int *b, int *c) { *c = *a + *b; } __global__ kernels (functions) will always return void Allocate memory in advance Send it to kernel with pointers Let’s take a look at main()… Source: slides by Nvidia©

Addition on the Device: main()
int main(void) { int a, b, c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Setup input values a = 2; b = 7; Source: slides by Nvidia©

// Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } Source: slides by Nvidia©

SIMD – Single Instruction Multiple Data
for(int i=0; i<4; i++) { c[i] = a[i] + b[i]; }

SIMT – Single Instruction Multiple Thread
for(int i=0; i<4; i++) { if(i%2 == 0) c[i] = a[i] + b[i]; else c[i] = a[i] * b[i]; }

int main(void) { int a[N], b[N], c[N]; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int)*N; // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Setup input values a = …; b = …; Add SIMD Source: slides by Nvidia©

// Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<N,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } Source: slides by Nvidia©

Vector Addition on the Device
Terminology: each parallel invocation of add() is referred to as a block The set of blocks is referred to as a grid Each invocation can refer to its block index using blockIdx.x __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } By using blockIdx.x to index into the array, each block handles a different index Block 0 Block 1 Block 2 Block 3 c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; Source: slides by Nvidia©

Cuda in a nutshell Fast development to run code on GPU
SIMT (SIMD) Support - big speedup Implements AtomicCAS, AtomicINC Both are pretty slow, as expected For more info Course at EE faculty

Evaluating lock-free structures –which?
Implementing sets with the following: Lock-free linked list – Harris-Michael construction – Presented by Erez Lock-free chain-hashed Hash-table - Simplified variation on the impl. presented by Yoav 10 5 buckets Lock-free skipLists – Presented by Rana P=0.5 for coin toss, max-level =32 Lock-free priority queue – skiplist based impl. – presented by Michael T.

Evaluating lock-free structures –How?
Run on CPU: Intel Xeon server 24 cores Run on GPU: Tesla C2070 Fermi GPU 14 SMs, 32 cores on each That’s 448 cores! Generate mix of adds, deletes and finds (For Pr. Queue, just add and remove min) Run on CPU Run on GPU Compare results

Make it solid – vary of data
Different key ranges: 0,10 𝑖 | 2≤𝑖≤5 Different Op. number: From to in steps of 10000 Different op. mixes (add, delete, find) = (20, 20, 60) or (40, 40, 20) (add, remove min) = (80, 20) or (50, 50)

Make it solid – fine tune execution
On GPU: Utilize thread block size: 64 for linked list, 512 for the rest Utilize number of thread blocks: 1, 4, 8 or 16 instructions per thread On CPU: Take the best results from each thread count: not more than 24 Memory management: All nodes for lists are pre-allocated, no re-use In most of the cases, the best configuration is the one in which a thread carries out just one operation in the kernel. This configuration essentially maximizes the number of CUDA threads

Results

Lock-free linked list Best performance on small key ranges and larger op counts Add % Delete % Search % No major difference between search-heavy and add/delete-heavy op strings Best speedup=7.3 Source: Writers’ slides

Lock-free hash table Consistent speedup across all key ranges and op mixes Best speedup = 11.3 Source: Writers’ slides

Lock-free skip list For identical key range, speedup improves with no. of Add ops Still good speedup at large key range for Add-heavy op strings Speedup drops with increasing key range Best speedup=30.7 Around 4x speedup Source: Writers’ slides

Lock-free priority queue
Add % DeleteMin % Trends are similar to skip list: speedup increases with Add % Best speedup=30.8 Source: Writers’ slides

Hash table vs. linked list
The data are shown for the largest key range On GPU, the hash table is 36x to 538x faster than linear list On CPU, the hash table is only 8x to 54x faster than linear list GPU exposes more concurrency in the lock-free hash table Source: Writers’ slides

Skip list vs. linked list
On GPU, the skip list is 2x to 20x faster than linear list GPU still exposes more concurrency than CPU for skip list Hash table shows far better scalability than skip list Source: Writers’ slides

Throughput of hash table
Hash table is the best performing data structure among the four evaluated For the largest key range, on a search-heavy op mix [20, 20, 60], the throughput ranges from 28.6 MOPS to 98.9 MOPS on the GPU For an add/delete-heavy op mix [40, 40, 20], the throughput range is 20.8 MOPS to 72.0 MOPS Nearly 100 MOPS on a search-heavy op mix Source: Writers’ slides

Summary First detailed evaluation of four lock-free data structures on CUDA-enabled GPU All four data structures offer moderate to high speedup on small to medium key ranges compared to CPU implementations Benefits are low for large key ranges in linear lists, skip lists, and priority queues Primarily due to CAS overhead and complex control flow in skip lists and priority queues Hash tables offer consistently good speedup on arbitrary key ranges and op mixes Nearly 100 MOPS throughput for search-heavy op mixes and more than 11x speedup over CPU Source: Writers’ slides

Questions? Thank you!

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Similar presentations

Presentation on theme: "Performance Evaluation of Concurrent Lock-free Data Structures on GPUs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Similar presentations

Presentation on theme: "Performance Evaluation of Concurrent Lock-free Data Structures on GPUs"— Presentation transcript:

Similar presentations

About project

Feedback