Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The Ohio State University

Sep 11, 2009 Outline of Contents Introduction Accelerators, GPGPU and GPU cluster Difficulty of GPU programming Current Work Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Future work Optimize shared memory usage Support more categories of MATLAB programs Extend FREERIDE framework

Sep 11, 2009 Introduction Accelerators, GPGPU and GPU cluster –Multi-core architectures are more and more popular in high performance computing –GPU, Cell Processor, FPGA –GPU has good performance/price ratio Difficulty of Programming –How to program a cluster with accelerators on each node ?

Sep 11, 2009 Our Approach Provide high-level support for programming emerging high-end configurations High-level language as input Target code for particular devices as output Focus on specific application classes Simplifies code generation

Sep 11, 2009 Specific Implementations C/C++ MATLAB …… GPU cluster GPU ……

Sep 11, 2009 Outline of Contents Introduction Current Work Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Future work Optimize shared memory usage Support more categories of MATLAB programs Extend FREERIDE framework

Sep 11, 2009 Translation System for Enabling Data Mining Applications on GPUs Focus on data mining applications Common structure Translation system Simple user input Optimization for GPU execution

Sep 11, 2009 Complications of GPU Programming User has to have thoroughly understand the architecture of GPU and the programming model of the language Has to deal with the memory allocation Need to develop and optimize strategies for data movement between memory hierarchies (example is the shared memory in GPU)

Sep 11, 2009 Parallel Data Mining Common structure of data mining applications (adapted from FREERIDE)‏ while() { /* Reduction loop */ Foreach (element e) { (i, val) = process(e); Reduc(i) = Reduc(i) op val; } Finalize(); }

Sep 11, 2009 Basic Idea of Data Mining on GPU Parallel shared memory programming 3 steps involved in the main loop Reading data Computing update Writing back update

Sep 11, 2009 User input Code analyzer Analysis of variables (variable type and size)‏ Analysis of reduction functions (sequential code from the user)‏ Code Generator ( generating CUDA code and C++ code invoking the kernel function)‏ Optimization GREENRIDE: Translation System for Enabling Data Mining Applications on GPUs

Sep 11, 2009 Architecture of the System Variable information Reduction functions Optional functions Code Analyzer( In LLVM)‏ Variable Analyze r Code Generato r Variable Access Pattern and Combination Operations Host Program Data copy and thread grid configuration Kernel functions Executable User Input

Sep 11, 2009 User Input A sequential reduction function Optional functions (initialization function, combination function…)‏ Value of variable / size of array Variables to be used in the reduction function

Sep 11, 2009 Analysis of Sequential Code Identify the parallel loop One major loop is parallelized Get the access features of each variable Figure out the data to be replicated Get the operator for global combination Calculate the size of shared memory to use and which data to be kept in shared memory

Sep 11, 2009 Get the Access Features for Each Variable Use LLVM to generate the intermediate representation Analyze the intermediate representation and get the information of each pointer Trace the pointers used in “store” operations, which are output pointers The other pointers in argument list are input variables The pointers that don’t appear in the argument list are temporary storage

Sep 11, 2009 Extract Information of Variable Access Variable analyzer IR from LLVM Extract output variables Argument list Extract read-only variables User input Extract temporary variables

Sep 11, 2009 Data to be Replicated Variables that need to be replicated for each thread Variables to be written Different threads may access the same object Need combination at the end

Sep 11, 2009 Get the Operator for Local Combination Also done by the code analyzer using LLVM Only needed for output variables Associative and commutative operators currently support ADD and MUL will support MAX, MIN… in future

Sep 11, 2009 Generate CUDA and C++/C Code Revisit Common structure of data mining applications (adapted from FREERIDE)‏ while() { /* Reduction loop */ Foreach (element e){ (i, val) = process(e); Reduc(i) = Reduc(i) op val; } Finalize(); }

Sep 11, 2009 Generate CUDA and C++/C Code Memory allocation and copy Reduction functions and global combination Optimization Using shared memory on GPU

Sep 11, 2009 Memory Allocation and Copy Device Memory allocation, copying between memory hierarchies, replication for threads Need replication for each thread T0T1 T2 T3 T4 T61T62 T63T0T1 …… T0T1 T2T3T4 T61T62 T63T0T1 …… A.A.A.A. B.B.B.B.

Sep 11, 2009 Thread Grid Configuration Use 16 blocks in the thread grid 1- dimensional configuration for thread blocks

Sep 11, 2009 Reduction Functions Global function Invoke the kernel reduction, global combination function, Invoke user specified initialization and combination functions, if any Kernel reduction function Generated out of the original sequential code Divide the main loop by block number and thread number Rewrite the access indices …… Global combination

Sep 11, 2009 Program Structure of the Generated CUDA Code Host program Copy_data(HOST_TO_DEVICE); Device_reduct(); Copy_data(DEVICE_TO_HOST); Combination(); Device function: Device_reduct() { Reduction(); __syncthreads(); Combination(); __syncthreads(); }

Sep 11, 2009 Global Combination Assume all updates are combined with commutative and associative operator from each thread An automatically generated global combination function which is invoked by 1 thread Will consider cascaded combination (currently unnecessary)‏

Sep 11, 2009 Optimizations Use shared memory on GPU ( almost “must”! )‏ If provided, use user-specified initialization functions and combination functions Specify variables that are allocated once and reused by multiple invocations of the kernel …

Sep 11, 2009 Deal with Shared Memory Use simple heuristics Size = length * sizeof(type) * thread_info –length: size of the array –type: char, int, and float –thread_info: whether it’s copied to each thread Sort the arrays according to Size Mark each array as shared until the size exceeds the limit of shared memory

Sep 11, 2009 No sorting Shared Memory B A CD

Sep 11, 2009 Greedy sorting Shared Memory BACD BACD

Sep 11, 2009 Other Optimizations Reduce memory allocation and copy overhead Arrays shared by multiple iterations can be allocated and copied only once User defined combination function Possibly reduce the amount of data to combine

Sep 11, 2009 Experiments Applications: popular data mining applications K-means clustering EM clustering PCA Compare performance of manual CUDA code and automatically generated code The effect of optimization, especially the use of shared memory

Sep 11, 2009 Experiment Results (K-means) Results on GeForce 8800GTXResults on GeForce 9800GX2

Sep 11, 2009 Experiment Results (EM)

Sep 11, 2009 Experiment Results (PCA)

Sep 11, 2009 GMAT-DM: Automatic Transformation from MATLAB for GPUs MATLAB code GMAT-DM OCTAVE parser C code CUDA code GREENRIDE

Sep 11, 2009 Main Issues in GMAT-DM Matrix Manipulation Inner representation as 1-d arrays Type and shape information based on user input Basic problem: matrix operation translation Matrix Multiplication Chain Function combination Other optimizations

Sep 11, 2009 A * B * C A:6*100; B: 100*2; C: 2*1 1.(A*B)*C Instruction count: 1212 Intermediate storage: 12 2.A*(B*C)‏ Instruction count: 800 Intermediate storage: 100 Use weighted metrics to determine the multiplication chain Matrix Multiplication Chain

Sep 11, 2009 Function Combination for (i = 0; i < maxRow; i++)‏ for (j = 0; j < k; j++)‏ { /* code to compute distances */ d[i][j] = sqrt(distances); } for(int i1 = 0; i1<maxRow; i1++)‏ { z[i1] = 0; g[i1] = MAX_VALUE; for(int i2 = 0; i2 < k; i2++)‏ { if(z[i1] > d[i1][i2])‏ { z[i1] = d[i1][i2]; g[i1] = i2; } float temp 0, temp z; int temp g; for (i = 0; i < maxRow; i++)‏ { temp_z = MAX_VALUE; temp_g = 0; for (j = 0; j < k; j++)‏ { /* code to compute distances */ temp_0 = sqrt(distances); if(temp_z > temp_0)‏ { temp_z = temp_0; temp_g = j; } z[i] = temp_z; g[i] = temp_g; }

Sep 11, 2009 Other Optimizations Avoid memory allocation overhead – use index transformation Row extraction Matrix transposition Rename when a single element is accessed through many iterations

Sep 11, 2009 Experiment Results (K-means)

Sep 11, 2009 Experiment Results (EM) E phase M phase

Sep 11, 2009 Experiment Results (PCA)

Sep 11, 2009 AUTO-GC: Automatic Code Generation for FREERIDE with GPU Support Revisit the common structure of data mining applications (adapted from FREERIDE)‏ While () { { * Reduction Loop * } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; GPU } Finalize(); }

Sep 11, 2009 Code Generation System

Sep 11, 2009 Experiment Results (K-means) 1.5GB data set3GB data set

Sep 11, 2009 Experiment Results (PCA) 64M rows, 4 components2M rows, 64 components

Sep 11, 2009 Optimize Shared Memory Usage Use constant and texture memory cached and read-only Integer programming model Loop transformation

Sep 11, 2009 Integer Programming Model Variables: x[i][j], 0 ≤ i ≤ m, 0 ≤ j ≤ n, whether variable i is kept in shared memory at loop j Objective function: maximize z = Σ i ∈ {1..m}, j ∈ {1..n} (Acc[i][j] − 1) ∗ x[i][j] ∗ Wp[i][j]; Constraints: For i in 1..m: Σ j ∈ 1..n x[i][j] ∗ W[i][j] ∗ Sizep[i][j] ≤ S; For j in 1..n: For i in 1..nLive[j]: x[live[j][i].start][j] = x[live[j][i].start + 1][j] =... = x[live[j][i].end][j];

Sep 11, 2009 Loop Transformation Loop Fusion –Reduce intermediate memory –For loops with the same size, and variable[k] is temporary storage Loop 1 (size[1] = n): Active[1][k]=1...... Loop 2 (size[2] = n): Active[2][k]=1......

Sep 11, 2009 Loop Fission –Fit more data in shared memory –Should not violate data dependency Loop Transformation 1. for (int r = 0; r < n; r++) //Loop 1 2. { 3. · · · 4.for(int c = 0; c < m; c + +) //Loop 2 5. i[c] · · · ; 6. for(int c = 0; c < m; c + +) //Loop 3 7. j[c] · · · ; 8. · · · 9. }

Sep 11, 2009 Loop Switching –Sort all the loops in decreasing order of iteration numbers –Enable better usage of shared memory –Should not violate dependency Loop Transformation

Sep 11, 2009 Support More Categories of MATLAB Programs Include library functions and determining thread grid configuration Deal with sparse matrix computation Investigate basic functions in SPM library and parallelization of ICA

Sep 11, 2009 Include Library Functions and Determining Thread Grid Configuration 2-D blocking or 1-D blocking Use library functions Depending on the input data Might need separating kernels to use different configurations Need a model to evaluate the overhead of multiple kernels

Sep 11, 2009 Deal with Sparse Matrix Computation Represent sparse matrix in GPU Coordinate representation Diagonal representation (easy to be combined with computing context) Might use Sparse Accelerator (evaluate efficiency of shared memory usage) Representation conversion Better to be done on CPU

Sep 11, 2009 Investigate Basic Functions in SPM Library and Parallelization of ICA Enrich GMAT-DM Provide GPU support for more functions and operations in MATLAB, especially sparse matrix computation Study possible optimization to communication and synchronization for SPM and ICA For clusters or GPU, depending on my progress and time

Sep 11, 2009 Extend FREERIDE Framework Convert MATLAB programs into C++ and CUDA code in the FREERIDE framework –Based on GMAT-DM and AUTO-GC Cost model to determine device configuration cost[i] = (mem[i] ∗ LATENCY + comp[i] + shared[i]) ∗ Iter[i]+shared[i] ∗ LATENCY total_cost =∑ m i=1 cost[i]/number_iterations cost_combine =∑ i ∈ output size[i]*total_threads If total_cost + cost_combine > T Then use GPU Else use CPU

Sep 11, 2009 Conclusion Contributions –GREENRIDE: Translating Data Mining Applications onto GPUs –GMAT-DM: Automatic Translation of Data Mining Applications from MATLAB to GPUs –AUTO-GC: Automatic Code Generation for FREERIDE with GPU Support Proposed work –Optimize shared memory usage –Support more categories of MATLAB programs –Extend FREERIDE framework

Sep 11, 2009 Thank you ! Questions?

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.

Similar presentations

Presentation on theme: "Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.

Similar presentations

Presentation on theme: "Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The."— Presentation transcript:

Similar presentations

About project

Feedback