Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,

Slides:



Advertisements
Similar presentations
An Implementation of Mostly- Copying GC on Ruby VM Tomoharu Ugawa The University of Electro-Communications, Japan.
Advertisements

Dynamic memory allocation
Dynamic Memory Allocation in C.  What is Memory What is Memory  Memory Allocation in C Memory Allocation in C  Difference b\w static memory allocation.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Programming Languages and Paradigms
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Chris Riesbeck, Fall 2007 Dynamic Memory Allocation Today Dynamic memory allocation – mechanisms & policies Memory bugs.
Computer Architecture CSCE 350
The University of Adelaide, School of Computer Science
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
Memory allocation CSE 2451 Matt Boggus. sizeof The sizeof unary operator will return the number of bytes reserved for a variable or data type. Determine:
Introduction to C Programming in Unix Environment - II Abed Asi Extended System Programming Laboratory (ESPL) CS BGU Fall 2014/2015 Some slides.
Dynamic memory allocation. The process of allocating memory at run time is known as dynamic memory allocation. C have four library functions for allocating.
Spark: Cluster Computing with Working Sets
CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Cmput Lecture 8 Department of Computing Science University of Alberta ©Duane Szafron 2000 Revised 1/26/00 The Java Memory Model.
Chapter 9 Subprogram Control Consider program as a tree- –Each parent calls (transfers control to) child –Parent resumes when child completes –Copy rule.
Dynamic Memory Allocation in C++. Memory Segments in C++ Memory is divided in certain segments – Code Segment Stores application code – Data Segment Holds.
Programming Language Semantics Java Threads and Locks Informal Introduction The Java Specification Language Chapter 17.
Introduction to Kernel
Run-Time Storage Organization
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Compiler Summary Mooly Sagiv html://
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
. Memory Management. Memory Organization u During run time, variables can be stored in one of three “pools”  Stack  Static heap  Dynamic heap.
Pointers Applications
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Operating System Chapter 7. Memory Management Lynn Choi School of Electrical Engineering.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Runtime Environments Compiler Construction Chapter 7.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Chapter 5: Programming Languages and Constructs by Ravi Sethi Activation Records Dolores Zage.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Computation and data migration in an embedded many-core SoC January Matthieu BRIEDA Anca MOLNOS Julien.
Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
+ Dynamic memory allocation. + Introduction We often face situations in programming where the data is dynamics in nature. Consider a list of customers.
CS415 C++ Programming Takamitsu Kawai x4212 G11 CERC building WV Virtual Environments Lab West Virginia University.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
1 Lecture07: Memory Model 5/2/2012 Slides modified from Yin Lou, Cornell CS2022: Introduction to C.
CS533 Concepts of Operating Systems Jonathan Walpole.
CNG 140 C Programming (Lecture set 12) Spring Chapter 13 Dynamic Data Structures.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.
C Programming Chapters 11, . . .
ENEE150 – 0102 ANDREW GOFFIN Dynamic Memory. Dynamic vs Static Allocation Dynamic  On the heap  Amount of memory chosen at runtime  Can change allocated.
By Anand George SourceLens.org Copyright. All rights reserved. Content Owner - Meera R (meera at sourcelens.org)
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
Tuning Threaded Code with Intel® Parallel Amplifier.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
NFV Compute Acceleration APIs and Evaluation
Checking Memory Management
Lecture 5: Process Creation
Operation System Program 4
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Practice Six Chapter Eight.
More examples How many processes does this piece of code create?
System Structure and Process Model
Dynamic Memory Allocation
Operating System Chapter 7. Memory Management
CUDA Fortran Programming with the IBM XL Fortran Compiler
Presentation transcript:

Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I, August Princeton University, Brown University SCIE. SCOPUS 2011

2 Contents Introduction Runtime Library Communication Management Evaluation

3 Introduction Copy data CPU to GPU Weakness of Maunal Delay latence Error Unnecessary computation

4 Introduction Application __kernel void opencl_kernel() Int main() Introduction

5 Cyclic communication pattern Repeat a loop Limit of performance Inspector-executor system - Management system in clusters with distributed memory - Break loop into inspector, a scheduler and executor - Which array offsets the programs read or write during iteration - Executor compute loop iterations in parallel

6 Introduction

7 Motivation Weakness of Manual communication management Time-consuming Error-prone Limit applicability Repeating a loop Goal : CGCM improve performance Implementation for automatic communication management Remove cyclic dependencies Supporting library for users

8 Runtime library CGCM Run-time library enables automatic CPU-GPU communication Optimization for programs Determining decision which byte to transfer Two parts of CGCM Memory allocation Mapping semantics

9 Runtime library CGCM Memory allocation Malloc(), calloc() Runtime library store - Saved the impormation of heap, stack, pointer.. - Size of allocation units in a block using binary balance tree - Realized difference unit if it is in a distribution block Live-data transfer pointer in GPU execution

10 Runtime library CPU-GPU Mapping Semantics Using three functions - Map(), unmap(), release() Mapping - Memory copy from CPU to GPU - Allocating memory necessary - Return pointer GPU Unmapping - A CPU pointer update the CPU allocation unith which is corresponded the GPU allocation unit - Retrun Pointer with GPU Releasing - A CPU pointer relieve the corresponding GPU allocation unit - Free it necessary - Return Pointer with CPU

11 Runtime library

12 Communication Management Maunal communication management A common source of errors for parallelization Limits the applicability Communication Management in CGCM Storing live-data list such as global variables, pointer with CPU called “flows” Labeling “flows” when load the operations it has the “flows” After checking flows, transfer data from the CPU to the GPU by map() or mapArray() After function call, compiler call unmap() or unmapArray() for each “flows”

13 Optimizing CPU-GPU Community Map promotion Scaned all the region which is function or loop body in GPU Captured all calls to CGCM run-time library featuring Call map() function befor loop region and copies all map() call Acyclic communication pattern with information during loop Alloca promotion Similar logic to map promotion Management local variables with their parents stack frame Glue kernel Optimization small code region of CPU between GPU regions Improve the applicability of map promotion

14 Evaluation CPUIntel Core 2 Quad clocked at 2.40GHz GPUNVIDIA GeForce GTX 480 OpenCLVersion 3.2 CUDAVersion 2.0 Platform benchmark PolyBench Rodinia DOALL PARSEC … etc

15 Evaluation Result Program speedup over sequential CPU only execution for inspector-executor, opti CGCM, unopti CGCM, manual

16 Evaluation

17 Conculsion CGCM is first fully automatic system for managing and optimizing CPU-GPU communication CGCM has twoparts, one is run-time library the other is optimizing compiler CGCM performs 5.3x over best sequential CPU-only