Presentation is loading. Please wait.

Presentation is loading. Please wait.

PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal

Similar presentations


Presentation on theme: "PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal"— Presentation transcript:

1 Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters
PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal The Ohio State University 1 1

2 Outline of Contents Motivation Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture 2 2

3 Introduction Accelerators, GPGPU and GPU cluster
Multi-core architectures are more and more popular in high performance computing GPU, Cell Processor, FPGA GPU has good performance/price ratio Difficulty of Programming How to program a cluster with accelerators on each node ? Might bring the pictures here 3

4 Our Approach Provide high-level support for programming emerging high-end configurations Effective and simple optimization strategies Focus on specific application classes Data mining application Tensor contraction expressions

5 Outline of Contents Motivation Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture 5 5

6 Shared memory on GPU Features of shared memory on GPU
Small in size Software controllable Much faster than device memory Need a strategy to arrange data on shared memory Arrange by hand: Time consuming and not optimal Previous work: intuitive solution

7 An Example to show shared memory usage
Void Kernel_function(float *A, float *C, …) { __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++)‏ /* load A in device memory into s_A */ for(int j=0;j<m;j++)‏ for(int k=0;k<r;k++)‏ /* load C in device memory into s_C*/ ...... } /* load B in device memory into s_A */

8 Problem Formulation for Shared Memory Arrangement
What to Consider A kernel function (with a number of basic blocks) Array, section of array, element of array Live ranges of each variable Determine in which basic block a variable is allocated to shared memory Assign_point[i][k]: variable i, basic block k

9 Integer Linear Programming
Objective function Maximize z = CT x Constraints Ax≤b Solution Values of vector x Special case of linear programming All the unknown variables are integers (within {1,0} in our case)‏ Solvable for reasonable size of problems

10 Integer Programming for Shared Memory Arrangement (cnt’d)‏
Objective Function Maximize shared memory usage Minimize data transfer between memory hierarchies Maximize z = ∑i∈{1…nVar}, k ∈{1…nLive[i]}Agg_SMrefik – ∑ i ∈{1..nVar}, k ∈{1…nLive[i]}Total_memcopyik

11 { Integer Programming for Shared Memory Arrangement Objective Function
Agg_SMrefik =∑j∈{live_blocks[i][j]}Is_assignedij×Refsij×itersj Total_memcopyik =Data_transij×itersj { 2×size_allocij , if Accessik = readwrite Data_transij = , if Accessik = temp size_allocij , otherwise

12 An Example to Show size_alloc
for (int i=0; i<n; i++)‏ for (int j=0; j<m; j++)‏ for (int k = 0; k<r; k++)‏ C[k] += A[i][k]- B[j][k]; ...... Size_alloc = r*m Size_alloc = r*m Size_alloc = r Size_alloc = 1

13 Integer Programming for Shared Memory Arrangement
Constraints Total allocation does not exceed the limit of shared memory at any time Only at most one assign_point is 1 in each live range ∑i∈{live_list[j]}Is_assignedij×size_allocij≤limit ∑i∈{live_blocks[j][k]}assign_pointij≤1

14 Integer Programming Solver
An Example for (int i=0; i<n; i++)‏ for (int j=0; j<m; j++)‏ for (int k = 0; k<r; k++)‏ C[k] += A[i][k]- B[j][k]; ...... Integer Programming Solver A: n*r B: m*r C: r n: 2048 m: 3 r: 3 NUM_THREADS: 256 assign_point[i][j]: i denotes variable I, j denotes basic block j. Variables 0, 1, 2 correspond to A, B, C in the code. assign_point[0][1]=1; assign_point[1][0]=1; assign_point[2][0]=1; /* all other elements of assign_point are 0 */

15 An Example (cnt’d)‏ Generated Code: __shared__ float s_B[m][r];
__shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; /* load B to s_B */ for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++)‏ s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++)‏ for(int k=0;k<r;k++)‏ s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; ...... } /* Synchronize and combination of C */ for (int i=0; i<n; i++)‏ for (int j=0; j<m; j++)‏ for (int k = 0; k<r; k++)‏ C[k] += A[i][k]- B[j][k]; ......

16 Suggesting Loop Transformation
for (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++)‏ tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]]; } for (int rc = 0; rc < nRowCl; rc++) { tempDis[rc] = 0; } for(int c = 0;c<numCol;c++)‏ { /* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++)‏ { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]];

17 Experiment Results K-means EM

18 Experiment Results PCA Co-clustering

19 Effect of Loop Transformation
PCA Co-clustering

20 Outline of Contents Motivation Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture 20 20

21 Tensor Contraction on GPU and Auto-tuning
Tensor contraction expressions Motivated by the CCSD(T) part of NWchem In the form of high-dimensional matrix multiplication Example: r[h1 h2 p3 p4] += t[h6 h7 h1 h2] * v[p3 p4 h6 h7] Auto-tuning Compile-time and Run-time optimization Selecting best implementation with given input problem

22 Original Algorithm and Optimization
Original Algorithm on T10 GPU Loading input matrices to shared memory Index Calculation Flattening and index combination Optimization for Fermi Register tiling Registers serve as a second level of “cache” Larger shared memory and register file on Fermi Modified index calculation order Different output/input access ratio for each thread r[h1 h2 p4 p3] += t[h6 h7 h1 h2] * v[p3 p4 h6 h7]

23 Motivation of auto-tuning for tensor contractions on GPU
Running time of two functions on Fermi with different index orders Favor input output Ex 1 (a) 0.425 0.504 (b) 0.487 0.584 (c) 0.51 0.671 (d) 0.681 0.881 Ex 2 (A) 13.6 11 (B) 105.5 41.5 (C) 199.7 149.9 (D) 27.1 22.6 Algorithm modification for different architectures Different algorithm choices for different inputs

24 Approaches of Auto-tuning
Existing approaches Analytical cost model Hard to capture complex architecture features Empirical search Not practical when search space is large Our approach Parametrizable micro-benchmarks Focusing on main features that affect performance

25 Auto-tuning with Parametrizable Micro-benchmarks
Target Expressions Different Implementations Architecture Features Micro Benchmark Parameter Space Expression and problem size in application Execution Models and Thresholds Implementation Choice

26 Auto-tuning Approach for Tensor Contractions on Different GPUs
Auto-tuning tool Parametrizable micro-benchmarks Auto-tuning parameters Memory access pattern Kernel Consolidation

27 Micro-benchmark Evaluation for Memory Access
Access Stride on device memory makes big difference Coalesced accesses: adjacent threads access contiguous words in device memory Cache L1 and L2 Mapping to tensor contractions Index calculation order For uncommon index: in the order of input/output For common index: in the order of each input

28 Mapping to tensor contractions
r[h1 h2 p4 p3] += t[h6 h7 h1 h2] * v[p3 p4 h6 h7] Mapping to C[a,b] = A[a,c] * B[c,b] Collaborative loading of the input ThreadID.x Index c of B Index p3 of v calculate with input order: p3 is the inner loop Accessing v Strides between two thread with adjacent x index: 1 Calculate with output order: p4 is the inner loop Strides between two thread with adjacent x index: range(p3)

29 Micro-benchmark Evaluation for Memory Access
A simple micro-benchmark Three types of stride: stride_x, stride_y, stride_iter Fermi A[tid.x*stride_x + tid.y*stride_y+ i*stride_iter] /* i is the index of the loop */ T10

30 Experiments Memory access for single expression
* Actual values are running time in ms Tile size Predicted choice Actual (in order) (out order) 12 in order 0.241 0.295 13 0.312 0.302 14 0.425 0.504 15 0.487 0.584 16 0.51 0.671 17 0.681 0.881 18 1.078 1.471 Tile size Predicted choice Actual (in order) (out order) 12 out order 0.222 0.214 13 0.28 0.27 14 0.364 0.354 15 0.511 0.482 16 0.854 0.644 17 Equal 0.943 0.92 18 1.193 1.124

31 Choice of kernel consolidation
Tightly coupled consolidation For functions with large data movement cost Loosely coupled consolidation For functions with comparable computation and data movement Foreach (task i) data copy (host to device) launch the kernels data copy (device to host) Foreach (task i) data copy for task i (host to device) launch kernel(i) data copy for task i (device to host)

32 Experiment Running on collections of tensor contractions
Fermi: without data copy T10: without data copy Fermi: with data copy

33 Outline of Contents Motivation Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture 35 35

34 Motivation of loop fusion for sequence of tensor contractions
Tensor contraction Sequence T3(a, q, r, s) = ∑pC4(p, a) ×A(p, q, r, s); T2(a, b, r, s) = ∑qC3(q, b) × T3(a, q, r, s); T1(a, b, c, s) = ∑rC2(r, c)× T2(a, b, r, s); B(a, b, c, d) = ∑sC1(s, d)× T1(a, b, c, s); Need to find the “fusion chains” Memory limit at different levels With GPU, memory limitation is more strict

35 Tensor contractions in multi-level memory hierarchy
Memory hierarchy in GPU clusters α: disk β: global memory γ: local memory/GPU memory None of the levels could be bypassed A higher level is smaller and faster than a lower level

36 Loop transformation for tensor contraction sequences on multi-level memory architecture
Single tensor contraction Memory and data movement cost on multi-level memory Tensor contractions represented as Z=X×Y Loop fusion for sequence of tensor contractions Condition for fusion Fusion on multi-level memory hierarchy

37 Single Tensor Contraction on Multi-level memory Hierarchy
One array fits in memory X[x; y], Y [y; z], Z[x; z] , assume X fits in memory Memory cost: Nx×Ny+min(Nx, Ny)+1 ≤ Mβ No redundant data movement No array fits in memory To minimize data movement, a preferred solution is Ti = Tj = T ≈ Multi-level memory hierarchy Tile size determined with particular system parameters and problem sizes

38 Fusion Conditions A sequence Only when data movement dominates
Factor determining the ratio Common index of the first contraction Uncommon index of the smaller matrix in the second contraction I1(d, c2,..., cn) = I0(d, c1, …, cn) ×B0(d, c1, …, cn) I2(d, c3,…, cn) = I1(d, c2, …, cn) ×B1(d, c2, …, cn) In(d) = In-1(d, cn) × Bn-1(d, cn) |Ii(ci+1)| ≤ |Ii(ci+1)| ≤ |Ii(ci+1)| ≤ |Ii(ci+1)| ≤

39 Fusion Conditions |Ii(ci+1)| ≤ |Bi|≤ |Bi| = ×|Ii(ci+1)| The “B” matrices in the middle of the chain should be very small Bi resides in memory The first B and the last B could be large Tile sizes are determined as in single contraction

40 Memory requirement and data movement cost of fused loops
S1=Ii∩Ii+1∩ Ii+2 S2=Ii ∩ Ii+1, S3 = Ii+2 S4 = Ii for sx∈S1 do { Allocate Ii+1[sx]}; for sy ∈ S2-S1 do { Allocate Ii[sy]}; for sz ∈ S4-S2 do { Produce Ii[sz]}; end for { Update Ii+1[sy]}; for sw ∈ S3-S1 do { Allocate Ii+2[sw]}; { Produce Ii+2[sw]}; I1(d, c2,..., cn) = I0(d, c1, …, cn) ×B0(d, c1, …, cn) I2(d, c3,…, cn) = I1(d, c2, …, cn) ×B1(d, c2, …, cn) In(d) = In-1(d, cn) × Bn-1(d, cn)

41 Algorithm to determine fusion chains
For a “fusable” contraction list With one matrix fitting to memory in each contraction Memory cost: When memory cost exceeds memory limit, a split is made to break the fusion chain f(i, j) = 0, if j<i = = = , otherwise

42 Fusion in multi-level memory hierarchy
With given chains at the lower level, determine subchains at the higher level Reduced memory requirement forβ level Same procedure to select fusion chains f(i, j) = 0, if j<i , if memoryγ(i, j) ≤Mγ = , otherwise =

43 Evaluation Fusion at Global Memory level Fusion at disk level

44 Outline Motivation Accelerators, GPGPU and GPU cluster
Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture 46 46

45 GREENRIDE: A Translation system for enabling data mining applications on GPUs
User input Code analyzer Analysis of variables (variable type and size)‏ Analysis of reduction functions (sequential code from the user)‏ Code Generator ( generating CUDA code and C++ code invoking the kernel function)‏ Optimization

46 GREENRIDE: A Translation system for enabling data mining applications on GPUs
Variable Analyzer Host Program User Input Variable information Variable Access Pattern and Combination Operations Kernel functions Reduction functions Code Generator Data copy and thread grid configuration Optional functions Code Analyzer( In LLVM)‏ Executable 48 48 48

47 GMAT-DM: Automatic Transformation from MATLAB for GPUs
MATLAB code OCTAVE parser GMAT-DM Transform MATLAB code for GPU Convert MATLAB code to C Use GREENRIDE to convert to CUDA Matrix manipulation Modified metric for matrix multiplication chain Function combination C code GREENRIDE CUDA code 49 49

48 AUTO-GC: Automatic Code Generation for FREERIDE with GPU Support
Variable Information Reduction Functions Optional Functions User input Add support to GPU clusters! Code Analyzer Access Pattern Reduction Objects Combination Operation Variable Analyzer Variable Info Parallel Loop Cluster of CPUs FREERIDE Code Code Generator CUDA Code GPU on Each Node

49 Future Work Extend the code generation system for data mining applications to more structures Improve and apply ILP approach for shared memory arrangement for other architectures Include more parameters in the auto-tuning framework Extend loop transformation to heterogeneous structures 51 51

50 Conclusion Code generation for data mining applications
Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture

51 Thank you ! 53 53


Download ppt "PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal"

Similar presentations


Ads by Google