Compiler-Driven Data Layout Transformation for Heterogeneous Platforms

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Intermediate GPGPU Programming in CUDA

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Chapter 9 Imperative and object-oriented languages 1.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Parallelization of the SIMD Kalman Filter for Track Fitting R. Gabriel Esteves Ashok Thirumurthi Xin Zhou Michael D. McCool Anwar Ghuloum Rama Malladi.

My Coordinates Office EM G.27 contact time:

Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Computer Engg, IIT(BHU)

NFV Compute Acceleration APIs and Evaluation

Gwangsun Kim, Jiyun Jeong, John Kim

Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng

Employing compression solutions under openacc

Heterogeneous Computing with D

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Brook GLES Pi: Democratising Accelerator Programming

Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos

A Practical Stride Prefetching Implementation in Global Optimizer

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Presentation transcript:

Compiler-Driven Data Layout Transformation for Heterogeneous Platforms HeteroPar'2013 Deepak Majeti1, Rajkishore Barik2 Jisheng Zhao1, Max Grossman1, Vivek Sarkar1 1Rice University, 2Intel Corporation

Acknowledgments I would like to thank the Habanero team at Rice University for their support. I would also like to thank the reviewers for their valuable feedback.

Portability and Performance Introduction Heterogeneity is ubiquitous – both in client as well and server domains Portability and Performance

Motivation How does one write portable and performant programs? How do we port existing application to the newer hardware? This is a hard problem. Example: A CPU has different architectural features compared to a GPU. Need high level language constructs, efficient compiler and runtime frameworks

Motivation Each processor has a different memory(cache) hierarchy Each memory hierarchy demands a different data layout CPU performs well with a struct layout because of prefetching GPU performs well with an array layout because of coalescing In our experiments, the layout impacted the performance of one application by a factor of 27. Need to automatically map data structures

Prefetching on CPU Load A[0] Load B[0] Load A[4] Load B[4] Load A[0] L1 cache L1 cache L1 cache L1 cache 2 Loads/processor 1 Load/processor A0 A0 A1 A2 A3 A4 A4 A0 B0 A0 B0 A1 B1 A2 B0 B0 B1 B2 B3 B4 B4 B2 A3 B3 A4 B4 A4 B4 Main Memory Main Memory Without Prefetching With Prefetching

Coalescing on GPU Load A[0-15] Load B[0-15] Load C[0-15] Load A[16-31] SM 1 SM 2 SM 1 SM 2 B[0-15] A[0-15] C[0-15] A[16-31] B[16-31] C[16-31] C[0-9] A[0-10] A[11-15] B[0-10] C[10-15] B[11-15] A[27-31] A[16-26] B[27-31] C[26-31] B[16-26] C[16-25] L1 cache L1 cache L1 cache L1 cache 128-byte Load 3 64-byte Loads/SM 3 64-byte Loads, 3 128-byte Loads/SM A[0-15] A[0-15] A[16-31] A[16-31] [ABC][0-15] A[0-10] A[11-15] B[11-15] B[0-10] C[10-15] C[0-9] B[0-15] B[0-15] B[16-31] B[16-31] [ABC][16-31] A[16-26] A[27-31] B[16-26] B[27-31] C[26-31] C[16-25] C[0-15] C[0-15] C[16-31] C[16-31] Coalesced Access Non-Coalesced Access

Our Approach Restrict data structures in programming model so that they can be implemented with different data layouts without changing the program semantics We propose a compiler driven meta-data layout framework to abstract away the data layout from the program. Auto-tuner / expert programmer specifies the metadata file. A metadata file contains a high level view of the data layout. Compiler auto-generates the program based on the metadata for the target architecture.

Overall Framework

Habanero-C forasync Multidimentional Parallel Loop Construct forasync point (args) size (args) seq (args) { Body } Loop indices in each dimension are specified by the point clause. The number of iterations in each dimension is specified by the size clause. The tile size is specified by the seq clause forasync is compiled to OpenCL (CPU and GPU). forasync can also be compiled to Habanero-C runtime for CPUs only Full details are available at https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C

forasync to OpenCL kernel int main(){ …………… forasync point(i, j) size(M, N) seq(tilesize1, tilesize2) { a[i * M + j] = b[i * M + j] + c[ i * M + j]; } ……………. void offload(float *a, float *b, char *kernel_name, char* ocl_kernel) { //Build the kernel //copy data from host to the device buffer //execute the kernel //copy data back from the device to host buffer .......... } int main(){ offload(a, b, c, ”kernel_1”, Kernel_string); Kernel_string=“ void kernel_1(__global float *a, __global float *b, __global float *c, int M, int N) { i = get_global_id(1); j = get_global_id(0); a[i * M + j] = b[i * M + j] + c[ i * M + j]; } “;

Metadata layout framework Algorithm Auto-tuner/ expert programmer specifies a layout schema Create data structure definitions. For every function definition whose formal parameters belong to the schema, replace with the corresponding data structure in the schema. For every instruction, update the reference of such formal parameter with the new data structure reference.

Metadata Framework Example int main(){ forasync point(i, j) size(M, N) seq(tilesizeq1, tilesize2) { a[i * M + j] = b[i * M + j] + c[ i * M + j]; } Arch Intel_CPU Struct ABC Field fp a Field fp b Field fp c struct ABC{float a, float b, float c}; void offload(struct ABC *abc, char *kernel_name, char* ocl_kernel) { //OpenCL Host Code .......... } int main(){ struct ABC *abc = ………… offload(abc, ”kernel_1”, Kernel_string); Kernel_string=“ struct ABC{float a, float b, float c}; void kernel_1(__global struct ABC *abc, int M, int N) { i = get_global_id(1); j = get_global_id(0); abc[i * M + j].a = abc[i * M + j].b + abc[ i * M + j].c; } “;

Metadata Framework Example int main(){ forasync point(i, j) size(M, N) seq(tilesize, tilesize) { a[i * M + j] = b[i * M + j] + c[ i * M + j]; } Arch AMD_CPU Struct AB Field fp a Field fp b struct AB{float a, float b}; void offload(struct AB *ab, float * c, char *kernel_name, char* ocl_kernel) { //OpenCL Host Code .......... } int main(){ struct AB *ab = ………… offload(ab, c, ”kernel_1”, Kernel_string); Kernel_string=“ struct AB{float a, float b}; void kernel_1(__global struct AB *ab, __global float *c, int M, int N) { i = get_global_id(1); j = get_global_id(0); ab[i * M + j].a = ab[i * M + j].b + c[ i * M + j]; } “;

Metadata layout framework Current Limitations The transformation depends upon the names of the fields. One can use a smart variable renaming scheme Assumes a strongly typed program. Pointer manipulations are hard to transform between layouts Cannot handle pointer swaps between fields of different layout groups

Benchmarks Name Description Original Layout Num of Fields Input NBody N-Body Simulation SOA 7 32K nodes Medical Imaging Image Registration 6 256 x 256 x 256 SRAD Speckle Reducing Anisotropic Diffusion 4 5020 x 4580 Seismic Seismic Wave Simulation 4096 x 4096 MRIQ Matrix Q Computation for 3D Magnetic Resonance Image Reconstruction in Non-Cartesian Space 64 x 64 x 64

Experimental Setup Vendor Type Model Freq Cores L1$ L2$ Intel CPU 2.8 GHz 12(HT) 192 KB 1.5 MB Integrated GPU I7-3770U 1.15 GHz 14 N.A NVIDIA Discrete GPU Tesla M2050 575 MHz 8 128 KB 768 KB AMD A10-5800K 1.4GHz 4(HT) 4 MB Radeon HD 7660 800MHz 6

Experimental Methodology Timing Minimum time over 5 runs Normalized Time Layouts are compared to the fastest layout implementation on each architecture. Overheads Data communication OpenCL kernel creation/ launching Layouts AOS: Array of Structures SOA: Structure of Arrays AOS1: Intermediate Hybrid representation

Experimental Results Medical Imaging benchmark Array of structures (AOS) is better for the CPUs. Kernel benefits from spatial locality Structure of arrays (SOA) is better for the GPUs. Due to coalescing

Experimental Results Seismic benchmark Array of structures(AOS) is better for the AMD CPU. Possibly due to cache associativity on CPUs Structure of arrays(SOA) is better for the INTEL CPU and GPUs. Coalescing for GPUs

Experimental Results MRIQ benchmark Data layout play no role in performance on CPUs and GPUs. Kernel is compute bound.

Experimental Results NBody benchmark AOS better for GPUs. Coalescing Data layout plays no role in performance on CPUs. No gain from spatial locality.

Experimental Results SRAD benchmark Array of Structures (AOS) is best on AMD GPU. Intermediate Array of Structures (AOS1) is best on all the other platforms. The non-affine accesses in the kernel make it hard to predict the correct layout.

Conclusion Different architectures require different data layouts to reduce communication latency. The meta-data layout framework helps bridge this gap. We extend Habanero-C to support a meta-data layout framework. Auto-tuner / expert programmer specifies the metadata file. Compiler auto-generates the program based on the meta data for the target architecture. The meta-data framework also helps take advantage of the various scratchpad memory available on a given hardware. by automatically generating the program for a specified layout.

Future Work Come up with auto-tuning frameworks and heuristics to automatically determine the layouts on a specific hardware.

Related Work Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications. Sung et.al., PACT 10 Change indexing for memory level parallelism DL: A data layout transformation system for heterogeneous computing. Sung et.al., InPar’12. Change the data layout at runtime Dymaxion: optimizing memory access patterns for heterogeneous systems, Che et al., SC ’11 runtime iteration space re-ordering functions for better memory coalescing TALC: A Simple C Language Extension For Improved Performance and Code Maintainability. Keasler et.al., Proceedings of the the 9th LCI International Conference on High-Performance Clustered Computing Only for CPUs and focus on meshes

Thank you