CS6235 L15: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6235 Administrative STRSM due March 23 (EXTENDED) Midterm coming -In class March 28, can.

Slides:

Advertisements

Similar presentations

On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.

Advertisements

Module R2 Overview. Process queues As processes enter the system and transition from state to state, they are stored queues. There may be many different.

L16: Sorting and OpenGL Interface

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

CS6963 L19: Dynamic Task Queues and More Synchronization.

CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman

TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Operating System Support Focus on Architecture

Multiprocessing Memory Management

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CS 104 Introduction to Computer Science and Graphics Problems

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Chapter 8 Operating System Support

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.

Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.

CS6963 L14: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6963 Administrative STRSM due March 17 (EXTENDED) Midterm coming -In class April 4, open.

Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.

Concurrency, Mutual Exclusion and Synchronization.

Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.

Advanced / Other Programming Models Sathish Vadhiyar.

Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.

11/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.

A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Martin Kruliš by Martin Kruliš (v1.1)1.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Martin Kruliš by Martin Kruliš (v1.0)1.

1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Concurrency and Performance Based on slides by Henri Casanova.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

L16: Dynamic Task Queues and Synchronization

Processes and threads.

Operating Systems (CS 340 D)

Parallel Programming By J. H. Wang May 2, 2017.

William Stallings Computer Organization and Architecture

Task Scheduling for Multicore CPUs and NUMA Systems

Parallel Algorithm Design

Lecture 5: GPU Compute Architecture

L21: Putting it together: Tree Search (Ch. 6)

EE 193: Parallel Computing

Lecture 5: GPU Compute Architecture for the last time

Synchronization Lecture 23 – Fall 2017.

Yiannis Nikolakopoulos

Background and Motivation

Lecture 25: Multiprocessors

CS703 - Advanced Operating Systems

Parallel Programming in C with MPI and OpenMP

GPU Scheduling on the NVIDIA TX2:

Lecture 18: Coherence and Synchronization

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

CS6235 L15: Dynamic Scheduling

L14: Dynamic Task Queues 2 CS6235 Administrative STRSM due March 23 (EXTENDED) Midterm coming -In class March 28, can bring one page of notes -Review notes, readings and review lecture -Prior exams are posted Design Review -Intermediate assessment of progress on project, oral and short -In class on April 4 Final projects -Poster session, April 23 (dry run April 18) -Final report, May 3

L14: Dynamic Task Queues 3 CS6235 Schedule of Remaining Lectures March 19 (today):Dynamic Scheduling March 21:Sorting March 26:Midterm Review April 2:Tree Algorithms April 4:Design Reviews April 9:Fast Fourier Transforms or TBD April 11: TBD April 16:Open CL April 18:Poster Dry Run April 23:Public Poster Presentation

L14: Dynamic Task Queues 4 CS6235 Sources for Today’s Lecture “On Dynamic Load Balancing on Graphics Processors,” D. Cederman and P. Tsigas, Graphics Hardware (2008). ancing-GH08.pdf “A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems,” P. Tsigas and Y. Zhang, SPAA (more on lock-free queue) Thread Building Blocks (more on task stealing)

L14: Dynamic Task Queues 5 CS6235 Motivation for Next Few Lectures Goal is to discuss prior solutions to topics that might be useful to your projects -Dynamic scheduling (TODAY) -Tree-based algorithms -Sorting -Combining CUDA and Open GL to display results of computation -Combining CUDA with MPI for cluster execution (6-function MPI) -Other topics of interest? End of semester -CUDA 4 Features -Open CL

L14: Dynamic Task Queues 6 CS6235 Motivation: Dynamic Task Queue Mostly we have talked about how to partition large arrays to perform identical computations on different portions of the arrays -Sometimes a little global synchronization is required What if the work is very irregular in its structure? -May not produce a balanced load -Data representation may be sparse -Work may be created on GPU in response to prior computation

L14: Dynamic Task Queues 7 CS6235 Dynamic Parallel Computations These computations do not necessarily map well to a GPU, but they are also hard to do on conventional architectures -Overhead associated with making scheduling decisions at run time -May create a bottleneck (centralized scheduler? centralized work queue?) -Interaction with locality (if computation is performed in arbitrary processor, we may need to move data from one processor to another). Typically, there is a tradeoff between how balanced is the load and these other concerns.

L14: Dynamic Task Queues 8 CS6235 Dynamic Task Queue, Conceptually Processors Task Queue(s) 0 N-22 1 N-1

L14: Dynamic Task Queues 9 CS6235 Dynamic Task Queue, Conceptually Processors Task Queue(s) 0 N-22 1 N-1 Processor 0 requests a work assignment

L14: Dynamic Task Queues 10 CS6235 Dynamic Task Queue, Conceptually Processors Task Queue(s) 0 N-22 1 N-1 First task is assigned to processor 0 and task queue is updated Just to make this work correctly, what has to happen? Topic of today’s lecture!

L14: Dynamic Task Queues 11 CS6235 Constructing a dynamic task queue on GPUs Must use some sort of atomic operation for global synchronization to enqueue and dequeue tasks Numerous decisions about how to manage task queues -One on every SM? -A global task queue? -The former can be made far more efficient but also more prone to load imbalance Many choices of how to do synchronization -Optimize for properties of task queue (e.g., very large task queues can use simpler mechanisms) Static vs. dynamic scheduling -In batches vs. one-by-one All proposed approaches have a statically allocated task list that must be as large as the max number of waiting tasks

L14: Dynamic Task Queues 12 CS6235 Suggested Synchronization Mechanism // also unsigned int and long long versions int atomicCAS(int* address, int compare, int val); reads the 32-bit or 64-bit word old located at the address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap). 64-bit words are only supported for global memory. __device__ void getLock(int *lockVarPtr) { while (atomicCAS(lockVarPtr, 0, 1) == 1); }

L14: Dynamic Task Queues 13 CS6235 Synchronization Blocking -Uses mutual exclusion to only allow one process at a time to access the object. Lockfree -Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. Waitfree -Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps. Slide source: Daniel Cederman

L14: Dynamic Task Queues 14 CS6235 Load Balancing Methods Static Blocking Task Queue Dynamic Blocking Task Queue Non-blocking Task Queue Task Stealing Slide source: Daniel Cederman

L14: Dynamic Task Queues 15 CS6235 Blocking Static Task Queue (Simplest) function DEQUEUE(q, id) return q.in[id] ; function ENQUEUE(q, task) localtail ← atomicAdd (&q.tail, 1) q.out[localtail ] = task function NEWTASKCNT(q) q.in, q.out, oldtail, q.tail ← q.out, q.in, q.tail, 0 return oldtail procedure MAIN(taskinit) q.in, q.out ← newarray(maxsize), newarray(maxsize) q.tail ← 0, Tbid ← 0 ENQUEUE(q, taskinit ) blockcnt ← NEWTASKCNT (q) while blockcnt != 0 do run blockcnt blocks in parallel t ← DEQUEUE(q, ++Tbid ) subtasks ← doWork(t ) for each nt in subtasks do ENQUEUE(q, nt ) blockcnt ← NEWTASKCNT (q) Two lists: q_in is read only and not synchronized q_out is write only and is updated atomically When NEWTASKCNT is called at the end of major task scheduling phase, q_in and q_out are swapped Synchronization required to insert tasks, but at least one gets through (wait free)

L14: Dynamic Task Queues 16 CS6235 Blocking Static Task Queue ENQUEUE TBidqtail TBidqtail

L14: Dynamic Task Queues 17 CS6235 Blocking Dynamic Task Queue function DEQUEUE(q) while atomicCAS(&q.lock, 0, 1) == 1 do; if q.beg != q.end then q.beg ++ result ← q.data[q.beg] else result ← NIL q.lock ← 0 return result function ENQUEUE(q, task) while atomicCAS(&q.lock, 0, 1) == 1 do; q.end++ q.data[q.end ] ← task q.lock ← 0 Use lock for both adding and deleting tasks from the queue. All other threads block waiting for lock. Potentially very inefficient, particularly for fine-grained tasks

L14: Dynamic Task Queues 18 CS6235 Blocking Dynamic Task Queue ENQUEUE qbegqend qbeg qend DEQUEUE qend qbeg

L14: Dynamic Task Queues 19 CS6235 Lock-free Dynamic Task Queue function DEQUEUE(q) oldbeg ← q.beg lbeg ← oldbeg while task = q.data[lbeg] == NIL do lbeg ++ if atomicCAS(&q.data[lbeg], task, NIL) != task then restart if lbeg mod x == 0 then atomicCAS(&q.beg, oldbeg, lbeg) return task function ENQUEUE(q, task) oldend ← q.end lend ← oldend while q.data[lend] != NIL do lend ++ if atomicCAS(&q.data[lend], NIL, task) != NIL then restart if lend mod x == 0 then atomicCAS(&q.end, oldend, lend ) Idea: At least one thread will succeed to add or remove task from queue Optimization: Only update beginning and end with atomicCAS every x elements.

L14: Dynamic Task Queues 20 CS6235 Task Stealing No code provided in paper Idea: -A set of independent task queues. -When a task queue becomes empty, it goes out to other task queues to find available work -Lots and lots of engineering needed to get this right -Best implementions of this in Intel Thread Building Blocks and Cilk

L14: Dynamic Task Queues 21 CS6235 General Issues One or multiple task queues? Where does task queue reside? -If possible, in shared memory -Actual tasks can be stored elsewhere, perhaps in global memory

L14: Dynamic Task Queues 22 CS6235 Remainder of Paper Octtree partitioning of particle system used as example application A comparison of 4 implementations -Figures 2 and 3 examine two different GPUs -Figures 4 and 5 look at two different particle distributions