HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University

HiPC 2010 GPGPU General Purpose Programming on GPUs (accelerators) ‏ High performance/price ratio High language support CUDA Performance vs Productivity Hard to program Memory hierarchy to manage...

HiPC 2010 Architecture of GPU

HiPC 2010 Automatic code generation Device memory access is expensive Using shared memory Texture and constant memory Coalescing device memory access... Get High Performance from GPU And Make the Programming Simple!

HiPC 2010 FEATURES OF SHARED MEMORY Small, fast, like a cache 16KB on each multiprocessor (no more than 48KB even on the latest GPU) Read-write Software controlled __shared__ float data[n][n]; Allocating shared memory: Similar to register allocation

HiPC 2010 Problem Formulation for Shared Memory Arrangement Consider variables and basic blocks in a function Element of array, array, section of array Each variable can have several live ranges in the function Access feature of live range: read, write, read-write, temp Determine in which basic block a variable is allocated to shared memory Assign_point[i][k]: variable i, basic block k

HiPC 2010 Integer Programming Problem Integer Linear Programming Objective function Maximize z = C T x Constraints Solution Values of x Special case of linear programming All the unknown variables are integers (1-0 in our case) ‏ Solvable for reasonable size of problems

HiPC 2010 Integer Programming for Shared Memory Arrangement Objective Function Maximize shared memory usage Minimize data transfer between memory hierarchies

HiPC 2010 Integer Programming for Shared Memory Arrangement (cnt’d) ‏ Objective Function

HiPC 2010 An Example to Show size_alloc for (int i=0; i<n; i++) ‏ for (int j=0; j<m; j++) ‏ for (int k = 0; k<r; k++) ‏ C[k] += A[i][k]- B[j][k];......

HiPC 2010 Integer Programming for Shared Memory Arrangement (cnt’d) ‏ Constraints Total allocation does not exceed the limit of shared memory at any time Only at most one assign_point is 1 in each live range

HiPC 2010 Integer Programming for Shared Memory Arrangement (cnt’d) ‏ Obtaining parameters Using LLVM compiler framework Pass 1: get access features Read, write, read-write, temp Pass 2: get live ranges, loop information, indices, and all other parameters

HiPC 2010 Code Generation According to the shared memory arrangement obtained from the integer programming model Under the framework in previous work Move data to cover gap caused by data evicted from shared memory

HiPC 2010 An Example A: n*r B: m*r C: r n: 2048 m: 3 r: 3 NUM_THREADS: 256 assign_point[0][1]=1; assign_point[1][0]=1; assign_point[2][0]=1; /* all other elements of assign_point are 0 */ for (int i=0; i<n; i++) ‏ for (int j=0; j<m; j++) ‏ for (int k = 0; k<r; k++) ‏ C[k] += A[i][k]- B[j][k];...... Integer Programming Solver

HiPC 2010 An Example (cnt’d) ‏ Generated Code: __shared__ float s_B[m][r]; __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; for(int i=0;i<m*r;i++) s_B[i]=B[i]; for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) ‏ s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++) ‏ for(int k=0;k<r;k++) ‏ s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k];...... } /* Synchronize and combination of C */

HiPC 2010 Suggesting Loop Transformation for (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++)‏ tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]]; } for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0; for(int c = 0;c<numCol;c++)‏ { /* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++)‏ { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; }

HiPC 2010 Experiments Effectiveness of using shared memory Compare with intuitive approach in previous work Greedy sorting: sort all the variables in increasing order of size, and allocation them on shared memory until to the limit of shared memory Effectiveness of loop transformation suggested by the integer programming model

HiPC 2010 Experiment Results

HiPC 2010 Experiment Results K-means EM

HiPC 2010 Experiment Results (cnt’d) ‏ PCA Co-clustering

HiPC 2010 Effect of Loop Transformation PCA Co-clustering

HiPC 2010 Conclusion and Future Work Proposed an integer programming model for shared memory arrangement on GPU Consider numeric variable, array, and section of array Suggested loop transformation for optimization Got better results than the intuitive method Will automate the code generation and loop transformation selection in future

HiPC 2010 THANK YOU! Questions?

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.

Similar presentations

Presentation on theme: "HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.

Similar presentations

Presentation on theme: "HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University."— Presentation transcript:

Similar presentations

About project

Feedback