Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Similar presentations


Presentation on theme: "Performance Evaluation of Concurrent Lock-free Data Structures on GPUs"— Presentation transcript:

1 Performance Evaluation of Concurrent Lock-free Data Structures on GPUs
A presentation by Orr Goldman 11/1/18

2 Our plan for Today Intro GPU Architecture 101 – So many cores
CUDA in a nutshell – GPU Programming Evaluating lock-free structures Evaluation results Conclusions

3 The following article is brought to you by:
Mainak Chaudhuri Associate professor IIT Kanpur Prabhakar Misra Software Eng. Google (His M.Sc. Thesis) Indian Institute of Technology Kanpur India 2012

4 Why Evaluate on the GPU at all?

5 Intel SkyLake Core – Not our subject!
Scheduler Allocate/Rename/Retire/ME/ZI In order OOO INT VEC Port 0 Port 1 MUL ALU FMA Shift LEA Port 5 Shuffle Port 6 JMP 1 JMP 2 DIV Load Buffer Store Buffer ReorderBuffer Port 4 32KB L1 D$ Port 2 Load/STA Store Data Port 3 Port 7 STA Load Data 2 Load Data 3 Memory Control 256KB L2$ Fill Buffers μop Cache 32KB L1 I$ Pre decode Inst Q Decoders Branch Prediction Unit μop Queue LSD 5 6 State not relevant Source: MAMAS course slides

6 GPU general scheme No OOO Execution
Usually no more than 2 levels of cache Each core has less functionality E.g. no built-in AES unit Only one ALU per core More Space for cores! Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

7 Nvidia GPUs – Fermi Architecture
15 cores or Streaming Multiprocessors (SMs) Each SM features 32 CUDA processors 480 CUDA processors Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

8 Fermi’s Memory model 32K registers per SM
Shared memory for same-SM cores L1 cache – per SM L2 cache – shared for all SMs No L3 cache  Global memory Sizes: Either 48 KB Shared / 16 KB of L1 cache Or 16 KB Shared / 48 KB of L1 cache User’s choice. L2 - L2 cache (768KB) Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

9 Or “How I learned to stop worrying and love the GPU”
CUDA Or “How I learned to stop worrying and love the GPU”

10 What is CUDA? CUDA Architecture CUDA C/C++
Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. Source: slides by Nvidia©

11 Simple Processing Flow
PCI Bus Copy input data from CPU memory to GPU memory Source: slides by Nvidia©

12 Simple Processing Flow
PCI Bus Copy input data from CPU memory to GPU memory Load GPU program and execute, caching data on chip for performance Source: slides by Nvidia©

13 Simple Processing Flow
PCI Bus Copy input data from CPU memory to GPU memory Load GPU program and execute, caching data on chip for performance Copy results from GPU memory to CPU memory Source: slides by Nvidia©

14 Hello World! with Device Code
__global__ void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf("Hello World!\n"); return 0; Source: slides by Nvidia© picture by Nvidia©

15 Addition on the Device: add()
__global__ void add(int *a, int *b, int *c) { *c = *a + *b; } __global__ kernels (functions) will always return void Allocate memory in advance Send it to kernel with pointers Let’s take a look at main()… Source: slides by Nvidia©

16 Addition on the Device: main()
int main(void) { int a, b, c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Setup input values a = 2; b = 7; Source: slides by Nvidia©

17 Addition on the Device: main()
// Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } Source: slides by Nvidia©

18 SIMD – Single Instruction Multiple Data
for(int i=0; i<4; i++) { c[i] = a[i] + b[i]; }

19 SIMT – Single Instruction Multiple Thread
for(int i=0; i<4; i++) { if(i%2 == 0) c[i] = a[i] + b[i]; else c[i] = a[i] * b[i]; }

20 Addition on the Device: main()
int main(void) { int a[N], b[N], c[N]; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int)*N; // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Setup input values a = …; b = …; Add SIMD Source: slides by Nvidia©

21 Addition on the Device: main()
// Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<N,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } Source: slides by Nvidia©

22 Vector Addition on the Device
Terminology: each parallel invocation of add() is referred to as a block The set of blocks is referred to as a grid Each invocation can refer to its block index using blockIdx.x __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } By using blockIdx.x to index into the array, each block handles a different index Block 0 Block 1 Block 2 Block 3 c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; Source: slides by Nvidia©

23 Cuda in a nutshell Fast development to run code on GPU
SIMT (SIMD) Support - big speedup Implements AtomicCAS, AtomicINC Both are pretty slow, as expected For more info Course at EE faculty

24 Evaluating lock-free structures –which?
Implementing sets with the following: Lock-free linked list – Harris-Michael construction – Presented by Erez Lock-free chain-hashed Hash-table - Simplified variation on the impl. presented by Yoav 10 5 buckets Lock-free skipLists – Presented by Rana P=0.5 for coin toss, max-level =32 Lock-free priority queue – skiplist based impl. – presented by Michael T.

25 Evaluating lock-free structures –How?
Run on CPU: Intel Xeon server 24 cores Run on GPU: Tesla C2070 Fermi GPU 14 SMs, 32 cores on each That’s 448 cores! Generate mix of adds, deletes and finds (For Pr. Queue, just add and remove min) Run on CPU Run on GPU Compare results

26 Make it solid – vary of data
Different key ranges: 0,10 𝑖 | 2≤𝑖≤5 Different Op. number: From to in steps of 10000 Different op. mixes (add, delete, find) = (20, 20, 60) or (40, 40, 20) (add, remove min) = (80, 20) or (50, 50)

27 Make it solid – fine tune execution
On GPU: Utilize thread block size: 64 for linked list, 512 for the rest Utilize number of thread blocks: 1, 4, 8 or 16 instructions per thread On CPU: Take the best results from each thread count: not more than 24 Memory management: All nodes for lists are pre-allocated, no re-use In most of the cases, the best configuration is the one in which a thread carries out just one operation in the kernel. This configuration essentially maximizes the number of CUDA threads

28 Results

29 Lock-free linked list Best performance on small key ranges and larger op counts Add % Delete % Search % No major difference between search-heavy and add/delete-heavy op strings Best speedup=7.3 Source: Writers’ slides

30 Lock-free hash table Consistent speedup across all key ranges and op mixes Best speedup = 11.3 Source: Writers’ slides

31 Lock-free skip list For identical key range, speedup improves with no. of Add ops Still good speedup at large key range for Add-heavy op strings Speedup drops with increasing key range Best speedup=30.7 Around 4x speedup Source: Writers’ slides

32 Lock-free priority queue
Add % DeleteMin % Trends are similar to skip list: speedup increases with Add % Best speedup=30.8 Source: Writers’ slides

33 Hash table vs. linked list
The data are shown for the largest key range On GPU, the hash table is 36x to 538x faster than linear list On CPU, the hash table is only 8x to 54x faster than linear list GPU exposes more concurrency in the lock-free hash table Source: Writers’ slides

34 Skip list vs. linked list
On GPU, the skip list is 2x to 20x faster than linear list GPU still exposes more concurrency than CPU for skip list Hash table shows far better scalability than skip list Source: Writers’ slides

35 Throughput of hash table
Hash table is the best performing data structure among the four evaluated For the largest key range, on a search-heavy op mix [20, 20, 60], the throughput ranges from 28.6 MOPS to 98.9 MOPS on the GPU For an add/delete-heavy op mix [40, 40, 20], the throughput range is 20.8 MOPS to 72.0 MOPS Nearly 100 MOPS on a search-heavy op mix Source: Writers’ slides

36 Summary First detailed evaluation of four lock-free data structures on CUDA-enabled GPU All four data structures offer moderate to high speedup on small to medium key ranges compared to CPU implementations Benefits are low for large key ranges in linear lists, skip lists, and priority queues Primarily due to CAS overhead and complex control flow in skip lists and priority queues Hash tables offer consistently good speedup on arbitrary key ranges and op mixes Nearly 100 MOPS throughput for search-heavy op mixes and more than 11x speedup over CPU Source: Writers’ slides

37 Questions? Thank you!


Download ppt "Performance Evaluation of Concurrent Lock-free Data Structures on GPUs"

Similar presentations


Ads by Google