Mar 16, 2011 1 Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Sunpyo Hong, Hyesoon Kim

My Coordinates Office EM G.27 contact time:

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

CS/EE 217 – GPU Architecture and Parallel Programming

Code Optimization.

PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal

CS427 Multicore Architecture and Parallel Computing

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Year 2 Updates.

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

CS/EE 217 – GPU Architecture and Parallel Programming

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Data-Intensive Computing: From Clouds to GPU Clusters

6- General Purpose GPU Programming

Presentation transcript:

Mar 16, Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters PhD Oral Defence: Wenjing Ma Advisor: Dr Gagan Agrawal The Ohio State University

Mar 16, Outline of Contents Motivation Accelerators, GPGPU and GPU cluster Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture

Mar 16, Introduction Accelerators, GPGPU and GPU cluster –Multi-core architectures are more and more popular in high performance computing –GPU, Cell Processor, FPGA –GPU has good performance/price ratio Difficulty of Programming –How to program a cluster with accelerators on each node ?

Mar 16, Our Approach Provide high-level support for programming emerging high-end configurations Effective and simple optimization strategies Focus on specific application classes Data mining application Tensor contraction expressions

Mar 16, Outline of Contents Motivation Accelerators, GPGPU and GPU cluster Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture

Mar 16, Shared memory on GPU Features of shared memory on GPU –Small in size –Software controllable –Much faster than device memory –… Need a strategy to arrange data on shared memory –Arrange by hand: Time consuming and not optimal –Previous work: intuitive solution

Mar 16, An Example to show shared memory usage Void Kernel_function(float *A, float *C, …) { __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) ‏ /* load A in device memory into s_A */ for(int j=0;j<m;j++) ‏ for(int k=0;k<r;k++) ‏ /* load C in device memory into s_C*/ } /* write C in s_C to device memory… */

Mar 16, Problem Formulation for Shared Memory Arrangement What to Consider A kernel function (with a number of basic blocks) Array, section of array, element of array Live ranges of each variable Determine in which basic block a variable is allocated to shared memory Assign_point[i][k]: variable i, basic block k

Mar 16, Integer Linear Programming Linear Programming Objective function Maximize z = C T x Constraints Ax ≤ b Solution Values of x Special case of linear programming All the unknown variables are integers (within {1,0} in our case)‏ Solvable for reasonable size of problems

Mar 16, Integer Programming for Shared Memory Arrangement (cnt’d)‏ Objective Function Maximize shared memory usage Minimize data transfer between memory hierarchies Maximize z = ∑ i ∈ {1…nVar}, k ∈ {1…nLive[i]} Agg_SMref i k – ∑ i ∈ {1..nVar}, k ∈ {1…nLive[i]} Total_memcopy i k

Mar 16, Integer Programming for Shared Memory Arrangement Objective Function Agg_SMref i k =∑ j ∈ {live_blocks[i][j]} Is_assigned i j ×Refs i j ×iters j Total_memcopy i k =Data_trans i j ×iters j 2×size_alloc i j, i f Access i k = readwrite Data_trans i j = 0, i f Access i k = temp size_alloc i j, otherwise {

Mar 16, An Example to Show size_alloc for (int i=0; i<n; i++) ‏ for (int j=0; j<m; j++) ‏ for (int k = 0; k<r; k++) ‏ C[k] += A[i][k]- B[j][k]; Size_alloc = r Size_alloc = 1 Size_alloc = r*m

Mar 16, Integer Programming for Shared Memory Arrangement Constraints Total allocation does not exceed the limit of shared memory at any time Only at most one assign_point is 1 in each live range ∑ i ∈ {live_list[j]} Is_assigned i j ×size_alloc i j ≤ limit ∑ i ∈ {live_blocks[j][k]} assign_point i j ≤ 1

Mar 16, An Example A: n*r B: m*r C: r n: 2048 m: 3 r: 3 NUM_THREADS: 256 assign_point[0][1]=1; assign_point[1][0]=1; assign_point[2][0]=1; /* all other elements of assign_point are 0 */ for (int i=0; i<n; i++) ‏ for (int j=0; j<m; j++) ‏ for (int k = 0; k<r; k++) ‏ C[k] += A[i][k]- B[j][k]; Integer Programming Solver assign_point[i][j]: i denotes variable I, j denotes basic block j. Variables 0, 1, 2 correspond to A, B, C in the code.

Mar 16, An Example (cnt’d)‏ Generated Code: __shared__ float s_B[m][r]; __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; /* load B to s_B */ for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) ‏ s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++) ‏ for(int k=0;k<r;k++) ‏ s_C[k*tid]+=s_A[tid*r+k]- s_B[j][k]; } /* Synchronize and combination of C */ for (int i=0; i<n; i++) ‏ for (int j=0; j<m; j++) ‏ for (int k = 0; k<r; k++) ‏ C[k] += A[i][k]- B[j][k];......

Mar 16, Suggesting Loop Transformation for (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++)‏ tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]]; } for (int rc = 0; rc < nRowCl; rc++) { tempDis[rc] = 0; } for(int c = 0;c<numCol;c++)‏ { /* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++)‏ { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; }

Mar 16, Experiment Results K-meansEM

Mar 16, Experiment Results PCACo-clustering

Mar 16, Effect of Loop Transformation PCA Co-clustering

Mar 16, Outline of Contents Motivation Accelerators, GPGPU and GPU cluster Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture

Mar 16, Tensor Contraction on GPU and Auto-tuning Tensor contraction expressions –Motivated by the CCSD(T) part of NWchem –In the form of high-dimensional matrix multiplication –Example: r[h1 h2 p3 p4] += t[h6 h7 h1 h2] * v[p3 p4 h6 h7] Auto-tuning –Compile-time and Run-time optimization –Selecting best implementation with given input problem

Mar 16, Original Algorithm and Optimization Original Algorithm on T10 GPU –Loading input matrices to shared memory –Index Calculation Flattening and index combination Optimization for Fermi –Register tiling Larger shared memory and register file on Fermi –Modified index calculation order Different output/input access ratio for each thread r[h1 h2 p4 p3] += t[h6 h7 h1 h2] * v[p3 p4 h6 h7]

Mar 16, Motivation of auto-tuning for tensor contractions on GPU Algorithm modification for different architectures Different algorithm choices for different inputs Favor input Favor output Ex 1(a) (b) (c) (d) Ex 2(A) (B) (C) (D) Running time of two functions on Fermi with different index orders

Mar 16, Approaches of Auto-tuning Existing approaches –Analytical cost model Hard to capture complex architecture features –Empirical search Not practical when search space is large Our approach –Parametrizable micro-benchmarks –Focusing on main features that affect performance

Mar 16, Auto-tuning Approach for Tensor Contractions on Different GPUs Auto-tuning tool –Parametrizable micro-benchmarks Auto-tuning parameters –Memory access pattern –Kernel Consolidation

Mar 16, Auto-tuning with Parametrizable Micro-benchmarks Different Implementations Target Expressions Architecture Features Micro Benchmark Parameter Space Execution Models and Thresholds Implementation Choice Expression and problem size in application

Mar 16, Micro-benchmark Evaluation for Memory Access Access Stride on device memory makes big difference –Coalesced accesses: adjacent threads access contiguous words in device memory –Cache L1 and L2 –… Mapping to tensor contractions –Index calculation order –For uncommon index: favor input/output –For common index: favor each of the input

Mar 16, Mapping to tensor contractions r[h1 h2 p4 p3] += t[h6 h7 h1 h2] * v[p3 p4 h6 h7] calculate with input order: p3 is the inner loop –Accessing v Loading v from device memory to shared memory Strides between two thread with adjacent x index: 1 –Accessing r Update r in device memory Strides between two threads with adjacent x index: h1*h2*p4

Mar 16, Micro-benchmark Evaluation for Memory Access A simple micro- benchmark Three types of : stride_x, stride_y, stride_iter Fermi T10 A[tid.x*stride_x + tid.y*stride_y+ i*stride_iter] /* i is the index of iterations */

Mar 16, Micro-benchmark Evaluation for Kernel Consolidation Launching multiple kernels at the same time With data copy –Overlapping of computing and data transfer Without data copy –Better utilization of the computing resource Using a matrix-matrix multiplication kernel as micro-benchmark

Mar 16, Choice of kernel consolidation Tightly coupled consolidation –For functions with large data movement cost Loosely coupled consolidation –For functions with comparable computation and data movement Foreach (task i) data copy (host to device) Foreach (task i) launch the kernels Foreach (task i) data copy (device to host) Foreach (task i) data copy for task i (host to device) launch kernel(i) data copy for task i (device to host)

Mar 16, Experiments Memory access for single expression Tile size Predicted choice Actual (in order) Actual (out order) 12in order in order in order in order in order in order in order * Actual values are running time in ms Tile size Predicted choice Actual (in order) Actual (out order) 12out order out order out order out order out order Equal Equal

Mar 16, Experiments Kernel Consolidation for single expression Micro-benchmarkReal contraction

Mar 16, Experiment Running on collections of tensor contractions T10: without data copy Fermi: without data copy Fermi: with data copy

Mar 16, Outline of Contents Motivation Accelerators, GPGPU and GPU cluster Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture

Mar 16, Motivation of loop fusion for sequence of tensor contractions Tensor contraction Sequence = ∑ p C4(p, a) ×A(p, q, r, s); = ∑ q C3(q, b) × = ∑ r C2(r, c) × B(a, b, c, d)= ∑ s C1(s, d) × T1(a, b, c, s); T2(a, b, r, s);T1(a, b, c, s) T2(a, b, r, s)T3(a, q, r, s); T3(a, q, r, s) Need to find the “fusion chains” Memory limit at different levels –With GPU, memory limitation is more strict

Mar 16, Tensor contractions in multi- level memory hierarchy Memory hierarchy in GPU clusters –α: disk –β: global memory –γ: local memory/GPU memory None of the levels could be bypassed A higher level is smaller and faster than a lower level

Mar 16, Loop transformation for tensor contraction sequences on multi- level memory architecture Single tensor contraction –Memory and data movement cost on multi-level memory Loop fusion for sequence of tensor contractions –Condition for fusion –Fusion on multi-level memory hierarchy

Mar 16, Single Tensor Contraction on Multi-level memory Hierarchy One array fits in memory –X[x; y], Y [y; z], Z[x; z], assume X fits in memory – Memory cost : N x ×N y +min(N x, N y )+1 ≤ M β – No redundant data movement No array fits in memory – To minimize data movement, a preferred solution is T i = T j = T ≈ Multi-level memory hierarchy – Tile size determined with particular system parameters and problem sizes

Mar 16, Fusion Conditions A sequence Only consider the case where communication dominates –Common index of the first contraction –Uncommon index of the smaller matrix in the second contraction I 1 (d, c 2,..., c n ) = I 0 (d, c 1, …, c n ) ×B 0 (d, c 1, …, c n ) I 2 (d, c 3,…, c n ) = I 1 (d, c 2, …, c n ) ×B 1 (d, c 2, …, c n ) … I n (d) = I n-1 (d, c n ) × B n-1 (d, c n ) |I i (c i+1 )| ≤

Mar 16, Fusion Conditions Size of the matrix that is not eliminated The first B and the last B could be large –Tile sizes are determined as in single contraction |B i |≤

Mar 16, Algorithm to determine fusion chains For a “fusable” contraction list – With one matrix fitting to memory in each contraction – Memory cost: – When memory cost exceeds memory limit, a split is made to break the fusion chain = f(i, j) = 0, if j<i, otherwise==

Mar 16, Fusion in multi-level memory hierarchy With given chains at the lower level, determine subchains at the higher level – Reduced memory requirement forβ level – Same procedure to select fusion chains f(i, j) = 0, if j<i, otherwise = =, if memory γ (i, j) ≤M γ

Mar 16, Evaluation Fusion at Global Memory levelFusion at disk level

Mar 16, Outline Motivation Accelerators, GPGPU and GPU cluster Difficulty of GPU programming Framework and Approaches Code generation for data mining applications Translation system for enabling data mining applications on GPUs Automatic translation of data mining applications from MATLAB to GPUs Automatic code generation for data mining on clusters with GPU support Arranging data on shared memory with ILP Solver Code optimization for tensor contractions Auto-tuning approach for tensor contractions on GPUs Loop transformation for tensor contraction sequences on multi-level memory architecture

Mar 16, GREENRIDE: A Translation system for enabling data mining applications on GPUs User input Code analyzer –Analysis of variables (variable type and size)‏ –Analysis of reduction functions (sequential code from the user)‏ Code Generator ( generating CUDA code and C++ code invoking the kernel function)‏ –Optimization

Mar 16, GREENRIDE: A Translation system for enabling data mining applications on GPUs Variable information Reduction functions Optional functions Code Analyzer( In LLVM)‏ Variable Analyze r Code Generato r Variable Access Pattern and Combination Operations Host Program Data copy and thread grid configuration Kernel functions Executable User Input

Mar 16, GMAT-DM: Automatic Transformation from MATLAB for GPUs MATLAB code OCTAVE parser C code CUDA code GREENRIDE GMAT-DM Transform MATLAB code for GPU –Convert MATLAB code to C –Use GREENRIDE to convert to CUDA Matrix manipulation Modified metric for matrix multiplication chain Function combination

Mar 16, AUTO-GC: Automatic Code Generation for FREERIDE with GPU Support Add support to GPU clusters! Variable Information Reduction Functions Optional Functions Variable Info Parallel Loop Access Pattern Reduction Objects Combination Operation Code Analyzer Variable Analyzer Code Generator FREERIDE Code CUDA Code Cluster of CPUs GPU on Each Node User input

Mar 16, Future Work Extend the code generation system for data mining applications to more structures Improve and apply ILP approach for shared memory arrangement for other architectures Include more parameters in the auto-tuning framework Extend loop transformation to heterogeneous structures …

Mar 16, Conclusion Code generation for data mining applications –Translation system for enabling data mining applications on GPUs –Automatic translation of data mining applications from MATLAB to GPUs –Automatic code generation for data mining on clusters with GPU support –Arranging data on shared memory with ILP Solver Code optimization for tensor contractions –Auto-tuning approach for tensor contractions on GPUs –Loop transformation for tensor contraction sequences on multi-level memory architecture

Mar 16, Thank you !