Fine-grain Task Aggregation and Coordination on GPUs

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

INSTITUTE OF COMPUTING TECHNOLOGY An Adaptive Task Creation Strategy for Work-Stealing Scheduling Lei Wang, Huimin Cui, Yuelu Duan, Fang Lu, Xiaobing Feng,

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Parallel Programming Languages Andrew Rau-Chaplin.

Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Contemporary Languages in Parallel Computing Raymond Hummel.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An Introduction to Programming with CUDA Paul Richmond

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

UNIT - 1Topic - 3. Computer software is a program that tells a computer what to do. Computer software, or just software, is any set of machine-readable.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

MRPGA ： An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter ：古乃卉.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

GPU Architecture and Programming

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

P ARALLEL P ROCESSING F INAL P RESENTATION CILK Eliran Ben Moshe Neriya Cohen.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.

Contemporary Languages in Parallel Computing Raymond Hummel.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Concurrency and Performance Based on slides by Henri Casanova.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Eliminating Intra-warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement Farzad Khorasani Bryan Rowe,

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Parallel Programming Models

Prof. Zhang Gang School of Computer Sci. & Tech.

Chapter 4: Multithreaded Programming

Our Graphics Environment

CMPS 5433 Programming Models

Marc S. Orr PhD Defense, December 5, 2016 Advisor: David A. Wood

Prabhanjan Kambadur, Open Systems Lab, Indiana University

Parallel Computing Lecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Lecture 26: Multiprocessors

Presented by: Isaac Martin

Fine-grained vs Coarse-grained multithreading

Lecture 27: Multiprocessors

Introduction to CUDA.

Cilk and Writing Code for Hardware

6- General Purpose GPU Programming

Presentation transcript:

Fine-grain Task Aggregation and Coordination on GPUs † § Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr†§, Bradford M. Beckmann§, Steven K. Reinhardt§, David A. Wood†§ ISCA, June 16, 2014

Executive Summary SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelism Compare to Pthreads, Cilk, MapReduce, TBB, etc. Goal: enable irregular parallelism on GPUs Why? More GPU applications How? Fine-grain task aggregation What? Cilk on GPUs

Outline Background Our Work Results/Conclusion GPUs Cilk Channel Abstraction Our Work Cilk on Channels Channel Design Results/Conclusion

+ Maps well to SIMD hardware - Limits fine-grain scheduling GPUs Today GPU tasks scheduled by control processor (CP)— small, in-order programmable core Today’s GPU abstractions are coarse-grain GPU SIMD SIMD CP CP SIMD SIMD System Memory + Maps well to SIMD hardware - Limits fine-grain scheduling

Cilk Background Cilk extends C for divide and conquer parallelism Adds keywords spawn: schedule a thread to execute a function sync: wait for prior spawns to complete 1: int fib(int n) { 2: if (n <= 2) return 1; 3: int x = spawn fib(n - 1); 4: int y = spawn fib(n - 2); 5: sync; 6: return (x + y); 7: }

Dynamic aggregation enables “CPU-like” scheduling abstractions on GPUs Prior Work on Channels CP, or aggregator (agg), manages channels Finite task queues, except: User-defined scheduling Dynamic aggregation One consumption function GPU SIMD SIMD Agg Agg SIMD SIMD System Memory channels Dynamic aggregation enables “CPU-like” scheduling abstractions on GPUs

Outline Background Our Work Results/Conclusion GPUs Cilk Channel Abstraction Our Work Cilk on Channels Channel Design Results/Conclusion

Enable Cilk on GPUs via Channels Step 1 Cilk routines split by sync into sub-routines 1: int fib (int n) { 2: if (n<=2) return 1; 3: int x = spawn fib (n-1); 4: int y = spawn fib (n-2); 5: sync; 6: return (x+y); 7: } 1: int fib (int n) { 2: if (n<=2) return 1; 3: int x = spawn fib (n-1); 4: int y = spawn fib (n-2); 5: } 6: int fib_cont(int x, int y) { 7: return (x+y); 8: } “pre-sync” “continuation”

Enable Cilk on GPUs via Channels Step 2 Channels instantiated for breadth-first traversal Quickly populates GPU’s tens of thousands of lanes Facilitates coarse-grain dependency management 2 1 “pre-sync” task ready “pre-sync” task done 3 3 3 2 2 1 “continuation” task 4 4 4 3 3 3 A B task A spawned task B A B task B depends on task A 5 5 5 fib channel fib_cont channel stack: top of stack

Bound Cilk’s Memory Footprint Bound memory to the depth of the Cilk tree by draining channels closer to the base case The amount of work generated dynamically is not known a priori We propose that GPUs allow SIMT threads to yield Facilitates resolving conflicts on shared resources like memory 5 4 3 2 1

Channel Implementation Our design accommodates SIMT access patterns + array-based + lock-free + non-blocking See Paper

Outline Background Our Work Results/Conclusion GPUs Cilk Channel Abstraction Our Work Cilk on Channels Channel Design Results/Conclusion

Methodology Implemented Cilk on channels on a simulated APU Caches are sequentially consistent Aggregator schedules Cilk tasks

Cilk scales with the GPU Architecture More Compute Units  Faster execution

Conclusion We observed that dynamic aggregation enables new GPU programming languages and abstractions We enabled dynamic aggregation by extending the GPU’s control processor to manage channels We found that breadth first scheduling works well for Cilk on GPUs We proposed that GPUs allow SIMT threads to yield for breadth first scheduling Future work should focus on how the control processor can enable more GPU applications

Backup

Divergence and Channels Branch divergence Memory divergence + Data in channels good Pointers to data in channels bad

GPU NOT Blocked on Aggregator

GPU Cilk vs. standard GPU workloads Cilk is more succinct than SIMT languages Channels trigger more GPU dispatches LOC reduction Dispatch rate Speedup Strassen 42% 13x 1.06 Queens 36% 12.5x 0.98 Same performance, easier to program

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.