Data copy forms part of an auto-tuning compiler framework. Auto-tuning compiler, while using the library, can empirically evaluate the different implementations. Future extensions of the library will be easily incorporated by the auto-tuning framework. A Library of data-copy and data-layout optimizations. Malik Muhammad Zaki Murtaza Khan, Chun Chen, Jacqueline Chame, Mary Hall. Code Example: Jacobi Modern architectures are incorporating complex memory hierarchies. Heterogeneous memory (DRAM/SRAM, different latency and bandwidth properties, different access policies) Software-managed storage Data copy is an important transformation for orchestrating data movement and reorganization. A compiler framework can incorporate data copy to optimize memory hierarchy utilization. Motivation Representation & Implementation Motivation & Overview Data copy library with Auto-tuning Compiler Data copy library with Auto-tuning Compiler Without framework support (from original iteration space): for(t1 = 1; t1 <= n; t1++) { s1(t1,1); s2(t1,1); if (t1 <= 1) { s3(1,1); } for(t2 = 2; t2 <= t1-1; t2++) { s2(t1,t2); } if (t1 >= 2) { s2(t1,t1); s3(t1,t1); } Data copy library Funded by NSF awards CSR , DOE grant DE-FC02-06ER25765 and by a gift from Intel Corporation. Data copy dynamically rearranges data layouts by copying data between different memory locations, possibly in different memory structures. Conventional architecture : ♦ Improves locality and avoids cache/memory- bank conflicts. Accelerators : ♦ Copying data into memory of accelerators (e.g., FPGAs and GPUS) a precursor to execution. ♦ Rearranging data can improve parallelism of memory accesses and increase memory bandwidth. ♦ Automating this process can ease the programmer's job of using the accelerator. What’s the need? ♦ Automatic code generation for a number of different architectures.. ♦ Many similar reorganizations required across platforms, but specific code is not portable. What’s the solution? ♦ A library of different copy implementations, with a shared front-end and architecture- specific code generation. ♦ Compiler technology to correctly rewrite access expressions automatically. Data copy A powerful polyhedral model supports rewriting of the affine access expressions. Multiple data copy implementations form part of library as callable functions. Script level interface provides the mechanism to use the implementation functions by the compilers and programmers. Multiple data layouts can be generated for different computations at different stages of optimization. Polyhedral model provides simple representations of different program structures. Allows for the iteration domains and statement instances to be viewed as objects in a polyhedron. Omega library Plus* provides the framework to implements the model. Efficient loop code generation by the advanced Omega code generation tool. *Omega Library Plus is a new version of old Omega Library, with unique features.[ Polyhedral Model How does it work? Polyhedral Framework Source Code Data copy library Data Copy Implementations Auto-tuning Compiler Framework Optimized Code Data Copy Function Calls { Sym=[n] [_t1,_t2] : 1 <= _t2 < n && 1 <= _t1 < n } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = 1+Out_2 } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && 1+_t2 = Out_2 }……. DO I=1,N-1 DO J=1,N-1 A(I,J) = ( B(I-1,J) + B(I+1,J) + B(I,J-1)+ B(I,J+1) )/4 Polyhedral framework producing a simple representation of iteration domains. Simple relations, based on linear equations, representing indices in the array accesses. int A[34][18]; int B[34][18]; for (i = 1; i < 33; i++) for (j = 1; j < 17; j++) { A[i][j] = (B[i + 1][j] + B[i − 1][j] + B[i][j + 1] + B[i][j − 1])/ 4; } (a) Jacobi unannotated kernel for (i = 1; i < 33; i+=2) /* unroll by 1 */ for (j = 1; j < 17; j+=2) { /* unroll by 1 */ A[i][j] = (B[i + 1][j] + B[i − 1][j] + B[i][j + 1] + B[i][j − 1]) / 4; A[i][j + 1] = (B[i + 1][j + 1] + B[i − 1][j + 1] + B[i][j + 2] + B[i][j]) / 4; A[i + 1][j] = (B[i + 2][j] + B[i][j] + B[i + 1][j + 1] + B[i + 1][j − 1]) / 4; A[i + 1][j + 1] = (B[i + 2][j + 1] + B[i][j + 1] + B[i + 1][j + 2] + B[i + 1][j]) / 4; } (b) unroll-and-jam 1- Custom Data Layout for Memory Parallelism. Byoungro So, Mary Hall, and Heidi Ziegler. (CGO'04), Palo Also, CA, March 20-24, 2004 A(0,0) A(0,2).....… B(0,0) B(0,2) ….. A(1,0) A(1,2).....… B(1,0) B(1,2) ….. A(0,1) A(0,3).....… B(0,1) B(0,3) ….. A(1,1) A(1,3).....… B(1,1) B(1,3) ….. Array references partitioned in different memory banks. Memory Layout for FPGA. Exploiting memory parallelism offered by multiple memory banks. Unroll the loops. Apply scalar replacement Doing reuse analysis. Partitioning Array references in different memory banks according to access patterns. Memory Layout for GPU. Exploiting parallelism in memory hierarchy of a GPU. Split a task into subtasks. Divide input data in blocks that fit shared memory. Copy from global memory into shared memory. Copy results from shared memory back to global memory. #define N 16 __global__ void jacobi_GPU(int* a, int* c) {__shared__ float b[4][4]; int thidx =....; int thidy =....; if (blockIdx.x == 0) { if(threadIdx.x == 0 ) b[...]= a[...]; if(threadIdx.x == 0 && (blockIdx.y == threadIdx.y)) b[...]= a[...] ; if(threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.y == 0) b[...]= a[...]; } …. if (thidx > 0 && thidx 0 && thidy < N-1 ) c[...] = 0.8* (b[...] + b[...] + b[...] + b[...]); } int main(){..... dim3 dimBlock(N/2,N/2); dim3 dimGrid(N/4, N/4); cudaMalloc((void **)&a_gpu,N*N*sizeof(int)); cudaMalloc((void **)&c_gpu,N*N*sizeof(int)); cudaMemcpy(a_gpu,a,N*N*sizeof(int),cudaMemcpyHostToDevice); jacobi_GPU >>(a_gpu,c_gpu); cudaMemcpy(cr_gpu,c_gpu,N*sizeof(int),cudaMemcpyDeviceToHost);.... return 0; } Copying data to the shared memory Shared memory space allocation Memory hierarchy in a GPU Multiple memory banks in an FPGA {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = Out_2 } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = Out_2 }……. Array access expressions can be modified to implement required optimizations DO I=1,N-1 · · · = D0[i]; · · · = D1[i]; DO J=1,N-1 · · · = · · · + B0[i+j]*· · ·; // u(0,0) · · · = · · · + B1[i+j]*· · ·; // u(0,1) · · · = · · · + · · · ∗ · · ·; // u(1,0) · · · = · · · + B0[i+j+1]*· · ·; // u(1,1) } D1[i] = · · ·; D0[i] = · · ·; } Modified array references or newly created arrays help set up the storage for the data copy optimization. source:jacobi.sp2 procedure: 0 … unroll(...) datacopy(...) (a) Script Interface