Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Slides:

Advertisements

Similar presentations

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

CPU Scheduling Questions answered in this lecture: What is scheduling vs. allocation? What is preemptive vs. non-preemptive scheduling? What are FCFS,

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

SAGE: Self-Tuning Approximation for Graphics Engines

Supporting GPU Sharing in Cloud Environments with a Transparent

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Extracted directly from:

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

CE Operating Systems Lecture 7 Threads & Introduction to CPU Scheduling.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

Sunpyo Hong, Hyesoon Kim

Martin Kruliš by Martin Kruliš (v1.0)1.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

My Coordinates Office EM G.27 contact time:

Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.

Gwangsun Kim, Jiyun Jeong, John Kim

CS427 Multicore Architecture and Parallel Computing

Adaptive Cache Partitioning on a Composite Core

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

ISPASS th April Santa Rosa, California

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

Process Scheduling B.Ramamurthy 9/16/2018.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

RegLess: Just-in-Time Operand Staging for GPUs

CPU Scheduling G.Anuradha

NVIDIA Fermi Architecture

Chapter 6: CPU Scheduling

General Purpose Graphics Processing Units (GPGPUs)

Process Scheduling B.Ramamurthy 4/11/2019.

Process Scheduling B.Ramamurthy 4/7/2019.

Chapter 6: Scheduling Algorithms Dr. Amjad Ali

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Chimera: Collaborative Preemption for Multitasking on a Shared GPU Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1 1University of Michigan, Ann Arbor 2Hongik University

GPUs in Modern Computer Systems GPU is now a default component in modern computer systems Servers, desktops, laptops, etc. Mobile devices Offloads data-parallel kernels CUDA, OpenCL, and etc.

GPU Execution Model threads thread blocks

Multitasking Needs in GPUs Augmented Reality Bodytrack 3D rendering Graphics Data parallel algorithm ….

Traditional Context Switching GTX 780 (Kepler) 256kB registers + 48kB shared memory 288.4 GB/s for 12 SMs ~88us per SM (~1us for CPUs) K1 Context Save Context Load K2 K2 launches Time

Challenge 1: Preemption latency GTX 780 (Kepler) 256kB registers + 48kB shared memory Too long for latency-critical kernels 288.4 GB/s for 12 SMs ~88us per SM K1 Context Save Context Load K2 K2 launches Time

Challenge 2: Throughput overhead GTX 780 (Kepler) 256kB registers + 48kB shared memory 288.4 GB/s for 12 SMs ~88us per SM K1 Context Save Context Load K2 No useful work is done K2 launches Time

Objective of This Work Prior work Switch Thread Block Progress (%) Preemption Cost Prior work 0% 100% Thread Block Progress (%)

SM draining [Tanasic’ 14] No issue Thread block K1 K2 K2 launches Time

Chimera Opportunity Switch!! Drain!! Switch Drain Preemption Cost Opportunity Drain 0% 100% Thread Block Progress (%)

Idempotent if read state is not modified SM Flushing Instant preemption Throw away what was running on the SM Re-execute from the beginning later (Idempotent kernel) CPU Idempotent if read state is not modified GPU Global Memory Only observable state

Finding Relaxed Idempotence Detected by compiler CUDA __global__ void kernel_cuda(const float *in, float* out, float* inout) { … = inout[idx]; … atomicAdd(…); … out[idx] = …; inout[idx] = …; } Atomic Operation Idempotent Region Global Overwrite

Chimera Flush near the beginning Context switch in the middle Preemption Cost Optimal Drain 0% 100% Thread Block Progress (%) Flush near the beginning Context switch in the middle Drain near the end

Independent Thread Block Execution No shared state between SMs and thread blocks Each SM/thread block can be preempted with different preemption technique GPU SM No shared state … Thread block SM

Chimera Collaborative Preemption … Progress Flush GPU Thread SM Drain Thread block … SM Switch

Thread Block Scheduler Architecture SM Scheduling Policy How many SMs will each kernel have? Two level scheduler Kernel scheduler + thread block scheduler Kernel Scheduler SM Scheduling Policy TB-to-Kernel Mapping Thread Block Scheduler Chimera

Thread Block Scheduler Architecture Chimera Which SM will be preempted? Which preemption technique to use? Kernel Scheduler SM Scheduling Policy TB-to-Kernel Mapping Thread Block Scheduler Chimera

Architecture Thread block scheduler Which thread block will be scheduled? Carry out preemption decision Kernel Scheduler Thread Block Queue per Kernel Preempted TB TB-to-Kernel Mapping Next TB Thread Block Scheduler Preempt

Cost Estimation: Preemption Latency Switch Context size / (Memory bandwidth / # of SMs) Drain Instructions measured in a warp granularity Progress in insts Average execution insts: Estimated remaining insts x CPI = Estimated preemption latency Flush Zero preemption latency

Cost Estimation: Throughput Switch IPC * Preemption latency * 2 Doubled due to context saving and loading Drain Instructions measured in a warp granularity Progress in insts Most progressed in the same SM: Overhead Flush Executed instructions in a warp granularity

Chimera Algorithm Preemption victim Least throughput overhead Meets preemption latency constraint Flush GPU SM Flush : … Switch: Thread Drain : SM Thread block Switch

Chimera Algorithm Preemption victim Least throughput overhead Meets preemption latency constraint Constraint Flush GPU SM Latency : … Overhead : Thread Latency : SM Thread block Overhead : Switch

Experimental Setup GPGPU-Sim v3.2.2 Workloads GPU Model: Fermi architecture Up to 32,768 (128 kB) registers Up to 48 kB shared memory Workloads 14 benchmarks from Nvidia SDK, Parboil, and Rodinia GPGPU benchmark + Synthetic benchmark Mimics periodic, real-time task (e.g. Graphics kernel) Period: 1ms Execution Time: 200us GPGPU benchmark + GPGPU benchmark Baseline: Non-preemptive First-come First-served

Preemption Latency Violations GPGPU benchmark + Synthetic benchmark 15 us preemption latency constraint (real-time task) 0.2% Non-idempotent kernel with short thread block execution time Estimated shorter preemption latency

System Throughput Case study: LUD + Other GPGPU benchmark LUD has many kernel launches with varying number of thread blocks Drain has lower average normalized turnaround time (5.17x for Drain, 5.50x for Chimera)

Preemption Technique Distribution GPGPU benchmark + Synthetic benchmark

Summary Context switch can have high overhead on GPUs Chimera Flushing Preemption latency, and throughput overhead Chimera Flushing Instant preemption Collaborative preemption Flush + Switch + Drain Almost always meets preemption latency constraint 0.2% violations (estimated shorter preemption latency) 5.5x ANTT improvement, 12.2% STP improvement For GPGPU benchmark + GPGPU benchmark combinations

Questions? Drain Context Switch Flush