 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

L9: CUDA-CHiLL Research and Introduction to Dense Linear Algebra CS6235.

Torino, Italy – June 27th, 2013 A2B: AN I NTEGRATED F RAMEWORK FOR D ESIGNING H ETEROGENEOUS AND R ECONFIGURABLE S YSTEMS C. Pilato, R. Cattaneo, G. Durelli,

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Presented by Rengan Xu LCPC /16/2014

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies CS6963.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

GPGPU platforms GP - General Purpose computation using GPU

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

GPU Architecture and Programming

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

Memory-Aware Compilation Philip Sweany 10/20/2011.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

Computer Engg, IIT(BHU)

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Basic CUDA Programming

Many-core Software Development Platforms

CS 179: GPU Programming Lecture 7.

L18: CUDA, cont. Memory Hierarchy and Examples

STUDY AND IMPLEMENTATION

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Convolution Layer Optimization

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

6- General Purpose GPU Programming

Introduction to Optimization

Presentation transcript:

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.  Future extensions of the library will be easily incorporated by the auto-tuning framework. A Library of data-copy and data-layout optimizations. Malik Muhammad Zaki Murtaza Khan, Chun Chen, Jacqueline Chame, Mary Hall. Code Example: Jacobi  Modern architectures are incorporating complex memory hierarchies.  Heterogeneous memory (DRAM/SRAM, different latency and bandwidth properties, different access policies)  Software-managed storage  Data copy is an important transformation for orchestrating data movement and reorganization.  A compiler framework can incorporate data copy to optimize memory hierarchy utilization. Motivation Representation & Implementation Motivation & Overview Data copy library with Auto-tuning Compiler Data copy library with Auto-tuning Compiler Without framework support (from original iteration space): for(t1 = 1; t1 <= n; t1++) { s1(t1,1); s2(t1,1); if (t1 <= 1) { s3(1,1); } for(t2 = 2; t2 <= t1-1; t2++) { s2(t1,t2); } if (t1 >= 2) { s2(t1,t1); s3(t1,t1); } Data copy library Funded by NSF awards CSR , DOE grant DE-FC02-06ER25765 and by a gift from Intel Corporation.  Data copy dynamically rearranges data layouts by copying data between different memory locations, possibly in different memory structures.  Conventional architecture : ♦ Improves locality and avoids cache/memory- bank conflicts.  Accelerators : ♦ Copying data into memory of accelerators (e.g., FPGAs and GPUS) a precursor to execution. ♦ Rearranging data can improve parallelism of memory accesses and increase memory bandwidth. ♦ Automating this process can ease the programmer's job of using the accelerator.  What’s the need? ♦ Automatic code generation for a number of different architectures.. ♦ Many similar reorganizations required across platforms, but specific code is not portable.  What’s the solution? ♦ A library of different copy implementations, with a shared front-end and architecture- specific code generation. ♦ Compiler technology to correctly rewrite access expressions automatically. Data copy  A powerful polyhedral model supports rewriting of the affine access expressions.  Multiple data copy implementations form part of library as callable functions.  Script level interface provides the mechanism to use the implementation functions by the compilers and programmers.  Multiple data layouts can be generated for different computations at different stages of optimization.  Polyhedral model provides simple representations of different program structures.  Allows for the iteration domains and statement instances to be viewed as objects in a polyhedron.  Omega library Plus* provides the framework to implements the model.  Efficient loop code generation by the advanced Omega code generation tool. *Omega Library Plus is a new version of old Omega Library, with unique features.[ Polyhedral Model How does it work? Polyhedral Framework Source Code Data copy library Data Copy Implementations Auto-tuning Compiler Framework Optimized Code Data Copy Function Calls { Sym=[n] [_t1,_t2] : 1 <= _t2 < n && 1 <= _t1 < n } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = 1+Out_2 } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && 1+_t2 = Out_2 }……. DO I=1,N-1 DO J=1,N-1 A(I,J) = ( B(I-1,J) + B(I+1,J) + B(I,J-1)+ B(I,J+1) )/4 Polyhedral framework producing a simple representation of iteration domains. Simple relations, based on linear equations, representing indices in the array accesses. int A[34][18]; int B[34][18]; for (i = 1; i < 33; i++) for (j = 1; j < 17; j++) { A[i][j] = (B[i + 1][j] + B[i − 1][j] + B[i][j + 1] + B[i][j − 1])/ 4; } (a) Jacobi unannotated kernel for (i = 1; i < 33; i+=2) /* unroll by 1 */ for (j = 1; j < 17; j+=2) { /* unroll by 1 */ A[i][j] = (B[i + 1][j] + B[i − 1][j] + B[i][j + 1] + B[i][j − 1]) / 4; A[i][j + 1] = (B[i + 1][j + 1] + B[i − 1][j + 1] + B[i][j + 2] + B[i][j]) / 4; A[i + 1][j] = (B[i + 2][j] + B[i][j] + B[i + 1][j + 1] + B[i + 1][j − 1]) / 4; A[i + 1][j + 1] = (B[i + 2][j + 1] + B[i][j + 1] + B[i + 1][j + 2] + B[i + 1][j]) / 4; } (b) unroll-and-jam 1- Custom Data Layout for Memory Parallelism. Byoungro So, Mary Hall, and Heidi Ziegler. (CGO'04), Palo Also, CA, March 20-24, 2004 A(0,0) A(0,2).....… B(0,0) B(0,2) ….. A(1,0) A(1,2).....… B(1,0) B(1,2) ….. A(0,1) A(0,3).....… B(0,1) B(0,3) ….. A(1,1) A(1,3).....… B(1,1) B(1,3) ….. Array references partitioned in different memory banks. Memory Layout for FPGA.  Exploiting memory parallelism offered by multiple memory banks.  Unroll the loops.  Apply scalar replacement  Doing reuse analysis.  Partitioning Array references in different memory banks according to access patterns. Memory Layout for GPU.  Exploiting parallelism in memory hierarchy of a GPU.  Split a task into subtasks.  Divide input data in blocks that fit shared memory.  Copy from global memory into shared memory.  Copy results from shared memory back to global memory. #define N 16 __global__ void jacobi_GPU(int* a, int* c) {__shared__ float b[4][4]; int thidx =....; int thidy =....; if (blockIdx.x == 0) { if(threadIdx.x == 0 ) b[...]= a[...]; if(threadIdx.x == 0 && (blockIdx.y == threadIdx.y)) b[...]= a[...] ; if(threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.y == 0) b[...]= a[...]; } …. if (thidx > 0 && thidx 0 && thidy < N-1 ) c[...] = 0.8* (b[...] + b[...] + b[...] + b[...]); } int main(){..... dim3 dimBlock(N/2,N/2); dim3 dimGrid(N/4, N/4); cudaMalloc((void **)&a_gpu,N*N*sizeof(int)); cudaMalloc((void **)&c_gpu,N*N*sizeof(int)); cudaMemcpy(a_gpu,a,N*N*sizeof(int),cudaMemcpyHostToDevice); jacobi_GPU >>(a_gpu,c_gpu); cudaMemcpy(cr_gpu,c_gpu,N*sizeof(int),cudaMemcpyDeviceToHost);.... return 0; } Copying data to the shared memory Shared memory space allocation Memory hierarchy in a GPU Multiple memory banks in an FPGA {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = Out_2 } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = Out_2 }……. Array access expressions can be modified to implement required optimizations DO I=1,N-1 · · · = D0[i]; · · · = D1[i]; DO J=1,N-1 · · · = · · · + B0[i+j]*· · ·; // u(0,0) · · · = · · · + B1[i+j]*· · ·; // u(0,1) · · · = · · · + · · · ∗ · · ·; // u(1,0) · · · = · · · + B0[i+j+1]*· · ·; // u(1,1) } D1[i] = · · ·; D0[i] = · · ·; } Modified array references or newly created arrays help set up the storage for the data copy optimization. source:jacobi.sp2 procedure: 0 … unroll(...) datacopy(...) (a) Script Interface