Linchuan Chen, Xin Huo and Gagan Agrawal

Slides:

Advertisements

Similar presentations

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

GPGPU platforms GP - General Purpose computation using GPU

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Supporting GPU Sharing in Cloud Environments with a Transparent

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

Extracted directly from:

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

My Coordinates Office EM G.27 contact time:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Gwangsun Kim, Jiyun Jeong, John Kim

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Sathish Vadhiyar Parallel Programming

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Efficient and Simplified Parallel Graph Processing over CPU and MIC

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Advisor: Dr. Gagan Agrawal

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

Linchuan Chen, Peng Jiang and Gagan Agrawal

NVIDIA Fermi Architecture

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Wei Jiang Advisor: Dr. Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Mattan Erez The University of Texas at Austin

Graphics Processing Unit

6- General Purpose GPU Programming

Multicore and GPU Programming

CS 295: Modern Systems GPU Computing Introduction

Presentation transcript:

Linchuan Chen, Xin Huo and Gagan Agrawal Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores Linchuan Chen, Xin Huo and Gagan Agrawal

Outline Introduction Background System Design Experiment Results Conclusions and Future Work 9/22/2018

Introduction Motivations Evolution of Heterogeneous Architectures Decoupled CPU-GPU architectures CPU + NVIDIA GPU Coupled CPU-GPU architectures AMD Fusion Intel Ivy Bridge Non-coherent, non-uniform access shared memory Accelerating Applications using both CPU and GPU Desirable: because both CPU and GPU are important computing power Issues: locking overhead (e.g., work stealing), lack of locking support between devices, kernel re-launch overhead 9/22/2018

Introduction Our Work Several Locking-free Scheduling Methods: Master-worker One CPU core as master for scheduling, others as workers Core-level token passing Uses a token to pass the permission for retrieving tasks among cores/SMs Device-level token passing Coarse-grained token passing: pass token between CPU and GPU(s) Platforms A decoupled CPU-GPU node with 2 GPUs An AMD fusion CPU-GPU Efficiency CPU+1GPU: up to 1.88 speedup over the better of a single device vertion CPU+2GPU: further improves the performance by up to 1.79x over CPU+1GPU version Up to 21% faster than StarPU or OmpSs 9/22/2018

Outline Introduction Background System Design Experiment Results Conclusions and Future Work ffdsa 9/22/2018

Decoupled GPU Architectures Processing Component Device Grid (CUDA) NDRange (OpenCL) Streaming Multiprocessor (SM) Block (CUDA) Workgroup (OpenCL) Processing Core Thread (CUDA) Work Item (OpenCL) Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) CPU PCIe 9/22/2018

Decoupled GPU Architectures Memory Component Device memory Shared memory Small size (32 KB) Faster I/O Faster locking operations Registers Constant memory (Device) Grid Constant Memory Texture Device Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host 9/22/2018

Heterogeneous Architecture (AMD Fusion Chip) Processing Component: Same with a decoupled GPU Memory Component GPU shares the same physical memory with CPU No separate device memory No PCIe bus zero copy memory buffer Also has memory hierarchy Shared memory Registers GPU Private Private Private Private Thread 0 Thread 1 … Thread 0 Thread 1 … … Shared Memory Shared Memory SM 0 SM 0 device memory RAM host memory Zero copy Zero copy CPU 9/22/2018

Outline Introduction Background System Design Experiment Results Conclusions and Future Work 9/22/2018

Task Scheduling Challenges Static Scheduling? Relative speeds of CPU and GPU vary Partitioning ratio cannot be determined Dynamic Scheduling Kernel relaunch based A device exits kernel and gets tasks using pthread locking High kernel launch overhead Work Stealing Retrieves tasks from within the kernel Relies on the locking mechanism supported by coherent shared memory Access to zero-copy by CPU and GPU is non-coherent, non-uniform Locking to this memory for CPU and GPU is not correctly supported 9/22/2018

We need relaunch-free, locking-free dynamic scheduling methods 9/22/2018

Master-worker Scheduling Schedule Msg Task Info has_task task_idx task_size Scheduler (core 0) zero copy worker info … … busy busy idle busy busy busy idle busy CPU GPU zero copy B 0 B 1 B m B 0 B 1 B n map map map map map map … … … … combine 9/22/2018 9/22/2018 Output 12

Master-worker Scheduling Locking-free Worker cores do not retrieve tasks actively No competition on global task offset Relaunch-free GPU kernel continue processing until the end Dedicating a CPU core to Scheduling A waste of resource Especially for applications where CPU is much faster than GPU 9/22/2018

Token-based Scheduling Basic Idea Put input in zero-copy buffer Use a token to pass the permission for retrieving tasks A worker could retrieve tasks only when it holds the token A worker passes the token to an idle worker after task retrieval Mutual exclusion without locking No need to use a core for scheduling

Core-level Token Passing CPU Core 1 Token is first held by CPU Core 0 CPU Core 0 retrieves a task block CPU Core 0 passes the token to an idle core, CPU Core 2 CPU Core 2 will repeat the same process busy CPU Core 0 CPU Core 2 idle Global Task Offset token unprocessed tasks idle busy GPU SM 0 GPU SM 2 busy GPU SM 1 9/22/2018

Core-level Token Passing Pros: No core is used exclusively for scheduling Locking-free mutual exclusion Relaunch-free Cons: Status checking overhead High frequency of token passing If the number of cores/SMs is large 9/22/2018

Device-level Token Passing Token is first held by CPU CPU retrieves a task block CPU passes the token to GPU, which is idle The task block retried by CPU is scheduled to CPU cores using intra-device locking GPU will repeat the same process Global Task Offset unprocessed tasks idle CPU GPU Core 0 Core 1 Core n SM 0 SM 1 SM n … … 9/22/2018

Device-level Token Passing Token Passing is Between Devices Reduces status checking overhead and token passing frequency Intra-device Task Scheduling is through Locking Locking frequency is low because tasks are scheduled in relatively large blocks 9/22/2018

Outline Introduction Background System Design Experiment Results Conclusions and Future Work 9/22/2018

Experimental Setup Platform Applications Coupled CPU-GPU AMD Fusion APU A3850 Quad Core AMD CPU + HD6550 GPU (5 x 80 = 400 cores) Decoupled CPU-GPU 12-core Intel Xeon 5650 CPU + 2 Nvidia M2070 (14 x 32 = 448 cores) (Fermi) cards Applications Generalized Reductions Kmeans, Gridding Kernel, Stencil Computation Jacobi, Sobel Filter, Irregular Reductions Moldyn Euler 9/22/2018

Overall Performance of Different Scheduling (Coupled CPU-GPU) Device-level token passing is the fastest, less than 8% slower than hand-optimal 9/22/2018

Overall Performance of Different Scheduling (Decoupled CPU-GPU) Device-level token passing is the fastest, less than 5% slower than hand-optimal 9/22/2018

Core-level Token Passing VS Device-level Token Passing Results are from AMD Fusion APU First component (red lines): starts with GPU only execution, and gradually increases the CPU cores Second component (blue lines): starts with CPU only execution, and gradually increases the GPU cores 9/22/2018

Scaling on Multiple GPUs Scale from CPU+1GPU to CPU+2GPU for each scheduling scheme Token passing method used here is device-level token passing Token passing is the fastest. Master-worker scheme works better with number of GPUs increasing, since sacrificing a single CPU core (master) is less affective 9/22/2018

Comparison with StarPU and OmpSs We use device-level token passing scheme to compare against StarPU and OmpSs. Our scheduling scheme achieves 1.08x to 1.21x speedups over the faster of StarPU and OmpSs. 9/22/2018

Outline Introduction Background System Design Experiment Results Conclusions and Future Work 9/22/2018

Conclusions and Future Work Exploiting parallelism across heterogeneous cores Locking-free and relaunch-free scheduling schemes Device-level token passing has very low scheduling overhead Efficiently use of both CPU and GPU cores Outperforms StarPU and OmpSs Future Work Extend to other emerging intra-node architectures, e.g., MIC Extend to clusters where each node has a collection of heterogeneous cores 9/22/2018

Thank you! Questions? Contacts: Linchuan Chen chenlinc@cse.ohio-state.edu Xin Huo huox@cse.ohio-state.edu Gagan Agrawal agrawal@cse.ohio-state.edu 9/22/2018