HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Intermediate GPGPU Programming in CUDA
Symbol Table.
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
SCIP Optimization Suite
One Dimensional Arrays
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Carnegie Mellon Lecture 7 Instruction Scheduling I. Basic Block Scheduling II.Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 – 10.4 M.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA and the Memory Model (Part II). Code executed on GPU.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Programmer's view on Computer Architecture by Istvan Haller.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Mar 16, Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Martin Kruliš by Martin Kruliš (v1.0)1.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Lecture 5: GPU Compute Architecture
Lecture 5: GPU Compute Architecture for the last time
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
Presentation transcript:

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University

HiPC 2010 GPGPU General Purpose Programming on GPUs (accelerators) ‏ High performance/price ratio High language support CUDA Performance vs Productivity Hard to program Memory hierarchy to manage...

HiPC 2010 Architecture of GPU

HiPC 2010 Automatic code generation Device memory access is expensive Using shared memory Texture and constant memory Coalescing device memory access... Get High Performance from GPU And Make the Programming Simple!

HiPC 2010 FEATURES OF SHARED MEMORY Small, fast, like a cache 16KB on each multiprocessor (no more than 48KB even on the latest GPU) Read-write Software controlled __shared__ float data[n][n]; Allocating shared memory: Similar to register allocation

HiPC 2010 Problem Formulation for Shared Memory Arrangement Consider variables and basic blocks in a function Element of array, array, section of array Each variable can have several live ranges in the function Access feature of live range: read, write, read-write, temp Determine in which basic block a variable is allocated to shared memory Assign_point[i][k]: variable i, basic block k

HiPC 2010 Integer Programming Problem Integer Linear Programming Objective function Maximize z = C T x Constraints Solution Values of x Special case of linear programming All the unknown variables are integers (1-0 in our case) ‏ Solvable for reasonable size of problems

HiPC 2010 Integer Programming for Shared Memory Arrangement Objective Function Maximize shared memory usage Minimize data transfer between memory hierarchies

HiPC 2010 Integer Programming for Shared Memory Arrangement (cnt’d) ‏ Objective Function

HiPC 2010 An Example to Show size_alloc for (int i=0; i<n; i++) ‏ for (int j=0; j<m; j++) ‏ for (int k = 0; k<r; k++) ‏ C[k] += A[i][k]- B[j][k];......

HiPC 2010 Integer Programming for Shared Memory Arrangement (cnt’d) ‏ Constraints Total allocation does not exceed the limit of shared memory at any time Only at most one assign_point is 1 in each live range

HiPC 2010 Integer Programming for Shared Memory Arrangement (cnt’d) ‏ Obtaining parameters Using LLVM compiler framework Pass 1: get access features Read, write, read-write, temp Pass 2: get live ranges, loop information, indices, and all other parameters

HiPC 2010 Code Generation According to the shared memory arrangement obtained from the integer programming model Under the framework in previous work Move data to cover gap caused by data evicted from shared memory

HiPC 2010 An Example A: n*r B: m*r C: r n: 2048 m: 3 r: 3 NUM_THREADS: 256 assign_point[0][1]=1; assign_point[1][0]=1; assign_point[2][0]=1; /* all other elements of assign_point are 0 */ for (int i=0; i<n; i++) ‏ for (int j=0; j<m; j++) ‏ for (int k = 0; k<r; k++) ‏ C[k] += A[i][k]- B[j][k]; Integer Programming Solver

HiPC 2010 An Example (cnt’d) ‏ Generated Code: __shared__ float s_B[m][r]; __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; for(int i=0;i<m*r;i++) s_B[i]=B[i]; for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) ‏ s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++) ‏ for(int k=0;k<r;k++) ‏ s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; } /* Synchronize and combination of C */

HiPC 2010 Suggesting Loop Transformation for (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++)‏ tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]]; } for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0; for(int c = 0;c<numCol;c++)‏ { /* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++)‏ { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; }

HiPC 2010 Experiments Effectiveness of using shared memory Compare with intuitive approach in previous work Greedy sorting: sort all the variables in increasing order of size, and allocation them on shared memory until to the limit of shared memory Effectiveness of loop transformation suggested by the integer programming model

HiPC 2010 Experiment Results

HiPC 2010 Experiment Results K-means EM

HiPC 2010 Experiment Results (cnt’d) ‏ PCA Co-clustering

HiPC 2010 Effect of Loop Transformation PCA Co-clustering

HiPC 2010 Conclusion and Future Work Proposed an integer programming model for shared memory arrangement on GPU Consider numeric variable, array, and section of array Suggested loop transformation for optimization Got better results than the intuitive method Will automate the code generation and loop transformation selection in future

HiPC 2010 THANK YOU! Questions?