University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir Hormati 2 Janghaeng Lee 1 and Scott Mahlke University of Michigan - Ann Arbor 2 Microsoft Research, Microsoft
University of Michigan Electrical Engineering and Computer Science Amdahl’s Law GPGPU may have <100x speedup but... 2 NO GPU utilizationGPU Executable NO GPU utilization 50% Even 1000x here does NOT bring more than 2x in overall Execution Time
University of Michigan Electrical Engineering and Computer Science General Purpose Computing on GPU Limitation of –Massive Data-Parallelism –Linear array access –NO Indirect array access –NO Pointers Leaves GPUs underutilized –GPUs are not so much generalized 3 GPU Executable How can GPUs be more GENERAL?
University of Michigan Electrical Engineering and Computer Science Motivation – More Generalization Reduce Sections –Non-Linear array access –Indirect array access –Array access through pointers Difficult for programmers to verify –Loop-Carried Dependencies 4 NO GPU utilization for(y=0; y<ny; y++) for(x=0; x<nx; x++){ xr = x % squaresize[XUP]; yr = y % squaresize[YUP]; i = xr + yr; lattice[i].x = x; lattice[i].y = y; } for(i=1; i<m; i++) for(j=iaL[i]; j<iaL[i+1]-1; j++) x[i] = x[i] - aL[j] * x[jaL[j]]; for(int i=0; i<n; i++){ *c = *a + *b; a++; b++; c++; }
University of Michigan Electrical Engineering and Computer Science Motivation – More Generalization Reduce Sections –Non-Linear array access –Indirect array access –Array access through pointers Difficult for programmers to verify –Loop-Carried Dependencies 5 NO GPU utilization
University of Michigan Electrical Engineering and Computer Science Paragon Execution 6 SequentialLoop 1 Loop 2 CPU SequentialDO-ALL Sequential Loop 3Sequential CPU L2 L3 Sequential Conflict Check L2 Sequential L1 GPU Possibly-Parallel
University of Michigan Electrical Engineering and Computer Science Paragon Execution with Conflict 7 SequentialLoop 1 Loop 2 CPU GPU SequentialDO-ALLPossibly-ParallelDO-ALLSequential Loop 3Sequential CPU Sequential L1L2 L3 Sequential Conflict L2
University of Michigan Electrical Engineering and Computer Science Paragon Process Flow Input: Sequential Code Loop Classification Instrumentation Offline Compilation CUDA + pThread Profiling Execution without Profiling Conflict Management Unit Conflict Management Unit Runtime Kernel Management 8
University of Michigan Electrical Engineering and Computer Science Offline Compilation Loop classification –Sequential Loops Dependence determined at compile-time Assign to CPU statically –DO-ALL Loops Assign to GPU statically –Possible DO-ALL Loops Dependence can be determined at RUNTIME 9
University of Michigan Electrical Engineering and Computer Science Runtime Profiling Spawns thread on CPU –Sequential execution thread –Monitoring thread Keeps track of memory foot print Marks loop –Sequential If many conflicts –Parallelizable If no/few conflicts 10 Assigned to CPU and GPU
University of Michigan Electrical Engineering and Computer Science Conflict Detection - Logging Lazy conflict detection Allocate memory when executing kernel –“write-set” for store –“read-set” for load 11 for (i = 0; i < N; i++){ idx = I[i]; C[idx] = A[idx] + B[idx]; } AtomicInc(C_wr_log[idx]); int C_wr_log[sizeof_C]; bool C_rd_log[sizeof_C]; for (i = tid; i < N; i += ThreadCnt){ idx = I[i]; C[idx] = A[idx] + B[idx]; }
University of Michigan Electrical Engineering and Computer Science Conflict Detection - Checking Done in parallel following kernel Conflict if –Address written more than once –Address read and written at least once 12 F... C_wr_logC_rd_log Thread 1 Thread 2 Thread 3 Thread 4 F F F F Thread N F T T2 OK Conflict [0] [1] [2] [3] [N] [0] [1] [2] [3] [N]
University of Michigan Electrical Engineering and Computer Science Experimental Setup 13 CPU –Intel Core i7 GPU –NVIDIA GTX 560 with 2GB DDR5 Benchmark –Loops with pointers FDTD, Siedel, Jacobi2d, GEMM, TMV –Indirect/Non-Linear access Saxpy, House, Ipvec, Ger, Gemver, SOR, FWD
University of Michigan Electrical Engineering and Computer Science Results for Pointer Access 14 36x
University of Michigan Electrical Engineering and Computer Science Results for Indirect Access x
University of Michigan Electrical Engineering and Computer Science Conclusion Paragon improves performance –More GPU Utilization –Speculatively run possibly-parallel loops on GPU No performance penalty on mis-speculation –Letting CPU run sequentially at the same time –Conflict checking is done in GPU 16
University of Michigan Electrical Engineering and Computer Science Q & A 17
University of Michigan Electrical Engineering and Computer Science Overhead Breakdown 18
University of Michigan Electrical Engineering and Computer Science Overhead Breakdown 19