Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

Similar presentations


Presentation on theme: "© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:"— Presentation transcript:

1 © 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor: Kevin Skadron

2 © 2010 Michael Boyer2 Outline GPU architecture Programming GPUs using CUDA Case study: Leukocyte Tracking Current work: CPU-GPU Task Sharing

3 © 2010 Michael Boyer3 Graphics Processors Graphics Processing Units (GPUs) are designed specifically for graphics rendering applications Courtesy of GameSpot

4 © 2010 Michael Boyer4 Graphics Applications Graphics applications involve applying the same operation to many pieces of data Application characteristics: – Massively parallel – Only aggregate performance matters

5 © 2010 Michael Boyer5 CPU vs. GPU: Architectural Difference 1 Fetch/ Decode Fetch/ Decode Execute Register File OOO Logic Branch Predictor Data Cache Memory Pre-Fetcher Memory Pre-Fetcher Fetch/ Decode Fetch/ Decode Execute Register File CPUGPU Avoid structures that only improve single-thread performance

6 © 2010 Michael Boyer6 CPU vs. GPU: Architectural Difference 2 Fetch/ Decode Fetch/ Decode Execute Register File Amortize the overhead of control logic across multiple execution units (SIMD processing) EXE RF EXE RF EXE RF EXE RF Thread Group Fetch/ Decode Fetch/ Decode Execute Register File OOO Logic Branch Predictor Data Cache Memory Pre-Fetcher Memory Pre-Fetcher Fetch/ Decode Fetch/ Decode GPUCPU

7 © 2010 Michael Boyer7 EXE RF EXE RF EXE RF EXE RF CPU vs. GPU: Architectural Difference 3 Fetch/ Decode Fetch/ Decode EXE RF EXE RF EXE RF EXE RF Thread GroupThread Group 1 Thread Group 2 Thread Group 3 Thread Group 4 Use multiple groups of threads to keep execution units busy and hide memory latency Fetch/ Decode Fetch/ Decode Execute Register File OOO Logic Branch Predictor Data Cache Memory Pre-Fetcher Memory Pre-Fetcher Fetch/ Decode Fetch/ Decode GPUCPU

8 © 2010 Michael Boyer8 CPU vs. GPU: Architectural Difference 4 EXE RF EXE RF EXE RF EXE RF Fetch/ Decode Fetch/ Decode Replicate cores to leverage more parallelism Fetch/ Decode Fetch/ Decode Execute Register File OOO Logic Branch Predictor Data Cache Memory Pre-Fetcher Memory Pre-Fetcher Fetch/ Decode Fetch/ Decode GPUCPU CPU Core GPU Core Core 1 Core 2 Core 3 Core 4 Core 7 Core 8 Core 9 Core 10 Core 12 Core 13 Core 14 Core 15 Core 17 Core 18 Core 19 Core 20 Core 22 Core 23 Core 24 Core 25 Core 27 Core 28 Core 29 Core 30 Core 2 Core 3 Core 4 Core 5 Core 6 Core 11 Core 16 Core 21 Core 26 Core 1

9 © 2010 Michael Boyer9 CPU vs. GPU: Architectural Differences Summary: take advantage of abundant parallelism – Lots of threads, so focus on aggregate performance – Parallelism in space: SIMD processing in each core Many independent SIMD cores across the chip – Parallelism in time: Multiple SIMD groups in each core

10 © 2010 Michael Boyer10 CPU vs. GPU: Peak Performance Processor TypeCPUGPU Product Intel Xeon W5590 (Nehalem) AMD Radeon HD 5870 Throughput (GFLOPs) 1072,720 Memory Bandwidth (GB/s) 32154 Cost$1,700$450 Note that these are peak numbers What we really care about is performance on real-world applications

11 © 2010 Michael Boyer11 General-Purpose Computing on GPUs Lots of recent interest in using GPUs to run non- graphics applications (GPGPU) Why GPUs? Why now? – Recent increases in performance via parallelism – Recent increases in programmability – Ubiquity in multiple market segments Old approach: graphics languages New approach: GPGPU languages – OpenCL, CUDA

12 © 2010 Michael Boyer12 CUDA Programming model for running general-purpose applications on NVIDIA GPUs Extension to the C programming language GPU is a co-processor: – Main program runs on the CPU – Large computations (kernels) are offloaded to the GPU – CPU and GPU have separate memory, so data must be transferred back and forth

13 © 2010 Michael Boyer13 CUDA: Typical Program Structure void function(…) { Allocate memory on the GPU Transfer input data to the GPU Launch kernel on the GPU Transfer output data to CPU } __global__ void kernel(…) { Code executed on the GPU goes here… } CPU CPU Memory GPU GPU Memory

14 © 2010 Michael Boyer14 CUDA: Typical Program Transformation for (i = 0; i < N; i++) { Process array element i } Body of loop becomes body of kernel __global__ void kernel(…) { Determine this thread’s value of i Process array element i }

15 © 2010 Michael Boyer15 CUDA Kernel Scalar program invoked across many threads – Typically one thread per data element Overall computation decomposed into a grid of thread blocks – Thread blocks are independent and cannot communicate (with some exceptions) – Threads within the same block can communicate Thread Block 1 Thread Block 2 Thread Block 3 Thread Block 4 Thread Block 5

16 © 2010 Michael Boyer16 12345678 A 910111213141516 B 1012141618202224 C + = Simple Example: Vector Addition C = A + B

17 © 2010 Michael Boyer17 C Code float *CPU_add_vectors(float *A, float *B, int N) { // Allocate memory for the result float *C = (float *) malloc(N * sizeof(float)); // Compute the sum; for (int i = 0; i < N; i++) { C[i] = A[i] + B[i]; } // Return the result return C; }

18 © 2010 Michael Boyer18 CUDA Kernel // GPU kernel that computes the vector sum C = A + B // (each thread computes a single value of the result) __global__ void kernel(float *A, float *B, float *C, int N) { // Determine which element this thread is computing int i = blockDim.x * blockIdx.x + threadIdx.x; // Compute a single element of the result vector if (i < N) { C[i] = A[i] + B[i]; }

19 © 2010 Michael Boyer19 CUDA Host Code float *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) { // Allocate GPU memory for the inputs and the result int vector_size = N * sizeof(float); float *A_GPU, *B_GPU, *C_GPU; cudaMalloc((void **) &A_GPU, vector_size); cudaMalloc((void **) &B_GPU, vector_size); cudaMalloc((void **) &C_GPU, vector_size); // Transfer the input vectors to GPU memory cudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice); cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice); // Execute the kernel to compute the vector sum on the GPU int num_blocks = ceil((double) N / (double) THREADS_PER_BLOCK); kernel >> (A_GPU, B_GPU, C_GPU, N); // Transfer the result vector from the GPU to the CPU float *C_CPU = (float *) malloc(vector_size); cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost); return C_CPU; }

20 © 2010 Michael Boyer20 Example Program Output./vector_add 50,000,000 GPU: Transfer to GPU: 0.236 sec Kernel execution: 0.005 sec Transfer from GPU: 0.152 sec Total: 0.404 sec CPU: 0.136 sec Execution: GPU outperformed CPU by 27.2x Overall: CPU outperformed GPU by 2.97x Vector addition does not do enough work per memory operation to justify offload!

21 © 2010 Michael Boyer21 Case Study: Leukocyte Tracking

22 © 2010 Michael Boyer22 Leukocyte Tracking Important for evaluating inflammatory drugs Velocity measured by tracking leukocytes through multiple frames Current approaches: – Manual analysis: 1 minute video in tens of hours – MATLAB: 1 minute video in 5 hours

23 © 2010 Michael Boyer23 Goal: Leverage CUDA and a GPU to accelerate leukocyte tracking to near real-time speeds

24 © 2010 Michael Boyer24 Acceleration 1.Translation: convert MATLAB code to C 2.Parallelization: – OpenMP for multi-core CPU – CUDA for GPU Experimental setup: – CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770 – GPU: NVIDIA GeForce GTX 280 (PCIe 2.0)

25 © 2010 Michael Boyer25 Tracking Algorithm Inputs: Video frame Location of cells in previous frame Output: Location of cells in current frame For each cell: – Extract sub-image near cell’s old location – Compute MGVF matrix over sub-image – Evolve active contour using MGVF matrix → 99.8%

26 © 2010 Michael Boyer26 Computing the MGVF Matrix Motion Gradient Vector Flow MGVF matrix is approximated via an iterative solution procedure Sub-image near cellMGVF

27 © 2010 Michael Boyer27 MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) MGVF Pseudo-code

28 © 2010 Michael Boyer28 Naïve CUDA Implementation Kernel is called ~50,000 times per frame Amount of work per call is small Runtime dominated by CUDA overheads: – Memory allocation, memory copying, kernel call overhead

29 © 2010 Michael Boyer29 Kernel Overhead Kernel calls are not cheap! – Overhead of one kernel call: 9 microseconds – Overhead of one CPU function: 3 nanoseconds – Kernel call is 3,000 times more expensive Heaviside kernel: – 27% of kernel runtime due to computation – 73% of kernel runtime due to kernel overhead

30 © 2010 Michael Boyer30 Lesson 1: Reduce Kernel Overhead Increase amount of work per kernel call – Decrease total number of kernel calls – Amortize overhead of each kernel call across more computation

31 © 2010 Michael Boyer31 Larger Kernel Implementation MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged)

32 © 2010 Michael Boyer32 Larger Kernel Implementation

33 © 2010 Michael Boyer33 Memory Allocation Overhead

34 © 2010 Michael Boyer34 Lesson 2: Reduce Memory Management Overhead Reduce the number of memory allocations – Allocate memory once and reuse it throughout the application – If memory size is not known a priori, estimate and only re- allocate if estimate is too small

35 © 2010 Michael Boyer35 Reduced Allocation Implementation

36 © 2010 Michael Boyer36 Memory Transfer Overhead

37 © 2010 Michael Boyer37 Lesson 3: Reduce Memory Transfer Overhead If the CPU operates on values produced by the GPU: – Move the operation to the GPU – May improve performance even if the operation itself is slower on the GPU Operation (GPU) Time values produced by GPU values consumed by GPU Memory Transfer Operation (CPU) Memory Transfer

38 © 2010 Michael Boyer38 MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) GPU Reduction Implementation

39 © 2010 Michael Boyer39 GPU Reduction Implementation

40 © 2010 Michael Boyer40 MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) Persistent Thread Block

41 © 2010 Michael Boyer41 Persistent Thread Block Problem: need a global memory fence – Multiple thread blocks compute the MGVF matrix – Thread blocks cannot communicate with each other – So each iteration requires a separate kernel call Solution: compute entire matrix in one thread block – Arbitrary number of iterations can be computed in a single kernel call

42 © 2010 Michael Boyer42 Persistent Thread Block: Example 132 465 798 1 1 1 111 111 Canonical CUDA Approach (1-to-1 mapping between threads and data elements) Persistent Thread Block MGVF Matrix

43 © 2010 Michael Boyer43 GPU SM Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9 SM Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Cell 1 Persistent Thread Block: Example Cell 1 Canonical CUDA Approach (1-to-1 mapping between threads and data elements) Persistent Thread Block SM = Streaming Multiprocessor (GPU core)

44 © 2010 Michael Boyer44 Lesson 4: Avoid Global Memory Fences Confine dependent computations to a single thread block – Execute an iterative algorithm until convergence in a single kernel call – Only efficient if there are multiple independent computations

45 © 2010 Michael Boyer45 Persistent Thread Block Implementation 27x

46 © 2010 Michael Boyer46 Absolute Performance

47 © 2010 Michael Boyer47 Video Example

48 © 2010 Michael Boyer48 Conclusions CUDA overheads can be significant bottlenecks CUDA provides enormous performance improvements for leukocyte tracking – 200x over MATLAB – 27x over OpenMP Processing time reduced from > 4.5 hours to < 1.5 minutes Real-time analysis feasible in near future M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors.“ In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), May 2009.

49 © 2010 Michael Boyer49 Current work: CPU-GPU Task Sharing

50 © 2010 Michael Boyer50 CPU-GPU Task Sharing Offloading decision is generally considered to be binary GPU?CPU?

51 © 2010 Michael Boyer51 CPU-GPU Task Sharing Offload decision does not need to be binary! – Dividing a task between the CPU and GPU can provide improved performance over either device alone GPU CPU GPU?CPU?

52 © 2010 Michael Boyer52 Theoretical Performance

53 © 2010 Michael Boyer53 Research Goal 1.Given an input program written in CUDA or OpenCL, automatically generate a program that can execute on the CPU and GPU concurrently 2.Automatically determine best division of work: – When beneficial, share work between CPU and GPU – Otherwise, execute on CPU or GPU exclusively – Optimal decision can change at runtime: With different inputs With contention

54 © 2010 Michael Boyer54 Proposed System OpenCL code Source-to-source Translation Framework OpenCL Compiler Modified OpenCL code CPU/GPU binary Transform all GPU memory allocations, memory transfers, and kernel launches into a form supporting concurrent CPU-GPU execution

55 © 2010 Michael Boyer55 Potential Problems One version of the kernel for multiple devices – Optimizations for GPU may hurt performance on CPU and vice versa Possible (but rare) for thread blocks to communicate with each other – Do we try to support this? Statically predicting data access patterns can be hard (or even impossible for some applications)

56 © 2010 Michael Boyer56 Data Sharing CPUGPU If we cannot predict data access patterns statically, then the CPU and the GPU must have a consistent view of memory 1) Computation 2) Data Transfer

57 © 2010 Michael Boyer57 Data Sharing (2) CPUGPU If we can predict data access patterns statically, then we can minimize the data transfer overhead 1) Computation 2) Data Transfer

58 © 2010 Michael Boyer58 Preliminary Results (HotSpot)

59 © 2010 Michael Boyer59 Conclusions GPUs are designed to provide good performance on graphics workloads – But they have evolved to support any workload with abundant parallelism GPUs can provide large performance improvements – But we need to take into account the overheads involved to fully take advantage Allowing the CPU and GPU to work together can provide an even larger performance improvement

60 © 2010 Michael Boyer60 Acknowledgements Funding provided by: – NSF grant IIS-0612049 – SRC grant 1607.001 – NVIDIA research grant – GRC AMD/Mahboob Kahn Ph.D. fellowship Equipment donated by NVIDIA

61 © 2010 Michael Boyer61 BACKUP

62 © 2010 Michael Boyer62 3D Rendering APIs Graphics Application Vertex Program Rasterization Fragment Program Display High-level abstractions for rendering geometry Courtesy of D. Luebke, NVIDIA

63 © 2010 Michael Boyer63 CUDA: Abstractions 1.Kernel function – Mapped onto a grid of thread blocks 2.Scratchpad memory – For sharing data within a thread block 3.Barrier synchronization – For synchronizing within a thread block

64 © 2010 Michael Boyer64 Kernel Function __global__ void kernel(int *in, int *out) { // Determine this thread ’ s index int i = threadIdx.x; // Add one to the input value out[i] = in[i] + 1; }

65 © 2010 Michael Boyer65 Grid of Thread Blocks Grid: 2-dimensional ≤ 4.3 billion blocks Thread block: 3-dimensional ≤ 512 threads

66 © 2010 Michael Boyer66 Launching a Kernel int num_threads =...; int threads_per_block = 256; // Determine how many thread blocks are needed // (using either of the two methods shown below) int num_blocks = ceil(num_threads / threads_per_block); int num_blocks = (num_threads + threads_per_block – 1) / threads_per_block; // Make structures for grid and thread block dimensions dim3 grid(num_blocks, 1); dim3 thread_block(threads_per_block, 1, 1); // Launch the kernel kernel >> (in, out);

67 © 2010 Michael Boyer67 Scratchpad Memory Each multiprocessor has 16 KB of software-controlled shared memory Variables declared “__shared__” get mapped into this memory Values can only be shared among threads within the same thread block

68 © 2010 Michael Boyer68 Scratchpad Memory: Example __global__ void kernel() { int i = threadIdx.x; // Compute some function int v = foo(i); // Write the value into shared memory __shared__ int values[THREADS_PER_BLOCK]; values[i] = v; // Use the shared values... }

69 © 2010 Michael Boyer69 Barrier Synchronization __syncthreads() function Each thread waits for all other threads in the thread block All values written by every thread are now visible to all other threads

70 © 2010 Michael Boyer70 Barrier Synchronization: Example __global__ void kernel(float *out, int N) { int i = threadIdx.x; __shared__ int values[THREADS_PER_BLOCK]; values[i] = foo(i); // Wait to ensure all values have been written __syncthreads(); // Compute average of two values out[i] = (values[i] + values[(i + 1) % N]); }

71 © 2010 Michael Boyer71 CUDA Overheads Driver initialization: 0.14 seconds Kernel launch: 13 μs GPU memory allocation and deallocation: orders of magnitude slower than on CPU Memory transfer: 15 μs + 1 ms/MB

72 © 2010 Michael Boyer72 Program Allocate GPU memory Transfer input data Launch kernel Transfer results Free GPU memory Acceleration using CUDA CPUGPU Step 1: Determine which code to offload to the GPU as a CUDA kernel Step 2: Write the CPU-side CUDA code Step 3: Write and optimize the GPU kernel CUDA kernel

73 © 2010 Michael Boyer73 Performance Issues Branch divergence Memory coalescing Key concept: Warp – Group of threads that execute concurrently – In current hardware, warp size is 32 threads

74 © 2010 Michael Boyer74 Branch Divergence Remember: hardware is SIMD What if threads in the same warp follow two different paths? Solution: entire warp executes both paths – Unneeded values are simply ignored – Performance can suffer with many divergent branches

75 © 2010 Michael Boyer75 Memory Coalescing Threads in the same half-warp access memory together If all threads access successive memory locations: – All of the accesses are combined (coalesced) – Result: significantly improved memory performance Otherwise: – Each thread accesses memory separately – Result: significantly reduced memory performance

76 © 2010 Michael Boyer76 Memory Coalescing: Examples Coalesced Accesses Non-Coalesced Access

77 © 2010 Michael Boyer77 Parallelization Granularity CPU CPU Memory GPU GPU Memory

78 © 2010 Michael Boyer78 Memory Transfer Kernel Overhead Revisited Overhead depends on calling pattern: – One at a time (synchronous): 9 microseconds – Back-to-back (asynchronous): 3 microseconds Kernel Call Synchronous: Asynchronous: Implicit Synchronization Kernel Call

79 © 2010 Michael Boyer79 Lesson 1 Revisited: Reduce Kernel Overhead Increase amount of work per kernel call – Decrease total number of kernel calls – Amortize overhead of each kernel call across more computation Launch kernels back-to-back – Kernel calls are asynchronous: avoid explicit or implicit synchronization between kernel calls – Overlap kernel execution on the GPU with driver access on the CPU


Download ppt "© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:"

Similar presentations


Ads by Google