L16: Dynamic Task Queues and Synchronization

Slides:

Advertisements

Similar presentations

On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.

Advertisements

Practical techniques & Examples

WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.

CS6963 L19: Dynamic Task Queues and More Synchronization.

CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman

TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.

Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.

Multiprocessing Memory Management

Computer Organization and Architecture

CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.

Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.

CS6235 L15: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6235 Administrative STRSM due March 23 (EXTENDED) Midterm coming -In class March 28, can.

Concurrency Recitation – 2/24 Nisarg Raval Slides by Prof. Landon Cox.

CS6963 L14: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6963 Administrative STRSM due March 17 (EXTENDED) Midterm coming -In class April 4, open.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.

Concurrency and Performance Based on slides by Henri Casanova.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Introduction to CSCI 1311 Dr. Mark C. Lewis

Chapter 7 Memory Management

Computer Architecture: Parallel Task Assignment

Processes and threads.

Chapter 2 Memory and process management

EMERALDS Landon Cox March 22, 2017.

Memory Allocation The main memory must accommodate both:

Operating Systems (CS 340 D)

Parallel Programming By J. H. Wang May 2, 2017.

6/16/2010 Parallel Performance Parallel Performance.

Atomic Operations in Hardware

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Atomic Operations in Hardware

CSE451 I/O Systems and the Full I/O Path Autumn 2002

William Stallings Computer Organization and Architecture

Chapter 8 Main Memory.

Basic CUDA Programming

Parallel Algorithm Design

Operating Systems (CS 340 D)

Lecture 5: GPU Compute Architecture

L21: Putting it together: Tree Search (Ch. 6)

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Anders Gidenstam Håkan Sundell Philippas Tsigas

CMSC 341 Lecture 5 Stacks, Queues

Optimizing Malloc and Free

Lecture 5: GPU Compute Architecture for the last time

L16: Feedback from Design Review

8/04/2009 Many thanks to David Sun for some of the included slides!

Yiannis Nikolakopoulos

Background and Motivation

/ Computer Architecture and Design

Dr. Mustafa Cem Kasapbaşı

COMP60621 Fundamentals of Parallel and Distributed Systems

Lecture 25: Multiprocessors

Practical Session 8, Memory Management 2

Chapter 3: Processes.

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Mattan Erez The University of Texas at Austin

Today’s agenda ~10min: file system (UNIX) ~1hr: final review + Q&A

Lecture 5: Synchronization and ILP

Parallel Programming in C with MPI and OpenMP

GPU Scheduling on the NVIDIA TX2:

COMP60611 Fundamentals of Parallel and Distributed Systems

Lecture 18: Coherence and Synchronization

COMP755 Advanced Operating Systems

Operating Systems: Internals and Design Principles, 6/E

Practical Session 9, Memory Management continues

Presentation transcript:

L16: Dynamic Task Queues and Synchronization

Project Feedback (early afternoon) Design Review Administrative Midterm In class, open notes Currently scheduled for Wednesday, March 31 Some students have requested we move it due to another exam in the previous class Proposed new date, April 5. Vote? Project Feedback (early afternoon) Design Review Intermediate assessment of progress on project (next slide) Due Monday, April 12, 5PM Final projects Poster session, April 28 (dry run April 26) Final report, May 5

Design Reviews Goal is to see a solid plan for each project and make sure projects are on track Plan to evolve project so that results guaranteed Show at least one thing is working How work is being divided among team members Major suggestions from proposals Project complexity – break it down into smaller chunks with evolutionary strategy Add references – what has been done before? Known algorithm? GPU implementation? In some cases, claim no communication but it seems needed to me

Design Reviews Oral, 10-minute Q&A session (April 14 in class, April 13/14 office hours, or by appointment) Each team member presents one part Team should identify “lead” to present plan Three major parts: Overview - Define computation and high-level mapping to GPU Project Plan The pieces and who is doing what. What is done so far? (Make sure something is working by the design review) Related Work Prior sequential or parallel algorithms/implementations Prior GPU implementations (or similar computations) Submit slides and written document revising proposal that covers these and cleans up anything missing from proposal.

Final Project Presentation Dry run on April 26 Easels, tape and poster board provided Tape a set of Powerpoint slides to a standard 2’x3’ poster, or bring your own poster. Poster session during class on April 28 Invite your friends, profs who helped you, etc. Final Report on Projects due May 5 Submit code And written document, roughly 10 pages, based on earlier submission. In addition to original proposal, include Project Plan and How Decomposed (from DR) Description of CUDA implementation Performance Measurement Related Work (from DR)

Let’s Talk about Demos For some of you, with very visual projects, I asked you to think about demos for the poster session This is not a requirement, just something that would enhance the poster session Realistic? I know everyone’s laptops are slow … … and don’t have enough memory to solve very large problems Creative Suggestions? Movies captured from run on larger system

Sources for Today’s Lecture “On Dynamic Load Balancing on Graphics Processors,” D. Cederman and P. Tsigas, Graphics Hardware (2008). http://www.cs.chalmers.se/~cederman/papers/GPU_Load_ Balancing-GH08.pdf “A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems,” P. Tsigas and Y. Zhang, SPAA 2001. (more on lock-free queue) Thread Building Blocks http://www.threadingbuildingblocks.org/ (more on task stealing)

Motivation for Next Few Lectures Goal is to discuss prior solutions to topics that might be useful to your projects Dynamic scheduling Overlapping execution on GPU and marshalling inputs Combining CUDA and Open GL to display results of computation Other topics of interest? More case studies End of semester (week of April 19) Open CL Future GPU architectures

Motivation: Dynamic Task Queue Mostly we have talked about how to partition large arrays to perform identical computations on different portions of the arrays Sometimes a little global synchronization is required What if the work is very irregular in its structure? May not produce a balanced load Data representation may be sparse Work may be created on GPU in response to prior computation

Constructing a dynamic task queue on GPUs Must use some sort of atomic operation for global synchronization to enqueue and dequeue tasks Numerous decisions about how to manage task queues One on every SM? A global task queue? The former can be made far more efficient but also more prone to load imbalance Many choices of how to do synchronization Optimize for properties of task queue (e.g., very large task queues can use simpler mechanisms) All proposed approaches have a statically allocated task list that must be as large as the max number of waiting tasks

Simple Lock Using Atomic Updates Can you use atomic updates to create a lock variable? Consider primitives: int lockVar; atomicAdd(&lockVar, 1); atomicAdd(&lockVar,-1); atomicInc(&lockVar,1); atomicDec(&lockVar,1); atomicExch(&lockVar,1); atomicCAS(&lockVar,0,1);

Suggested Implementation // also unsigned int and long long versions int atomicCAS(int* address, int compare, int val); reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap). 64-bit words are only supported for global memory. __device__ void getLock(int *lockVarPtr) { while (atomicCAS(lockVarPtr, 0, 1) == 1); }

Synchronization Blocking Lockfree Waitfree Uses mutual exclusion to only allow one process at a time to access the object. Lockfree Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. Waitfree Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps. Slide source: Daniel Cederman

Load Balancing Methods Blocking Task Queue Non-blocking Task Queue Task Stealing Static Task List Slide source: Daniel Cederman

Static Task List (Recommended) function DEQUEUE(q, id) return q.in[id] ; function ENQUEUE(q, task) localtail ← atomicAdd (&q.tail, 1) q.out[localtail ] = task function NEWTASKCNT(q) q.in, q.out , oldtail , q.tail ← q.out , q.in, q.tail, 0 return oldtail procedure MAIN(taskinit) q.in, q.out ← newarray(maxsize), newarray(maxsize) q.tail ← 0 enqueue(q, taskinit ) blockcnt ← newtaskcnt (q) while blockcnt != 0 do run blockcnt blocks in parallel t ← dequeue(q, TBid ) subtasks ← doWork(t ) for each nt in subtasks do enqueue(q, nt ) blocks ← newtaskcnt (q) Two lists: q_in is read only and not synchronized q_out is write only and is updated atomically When NEWTASKCNT is called at the end of major task scheduling phase, q_in and q_out are swapped Synchronization required to insert tasks, but at least one gets through (wait free)

Blocking Dynamic Task Queue Use lock for both adding and deleting tasks from the queue. All other threads block waiting for lock. Potentially very inefficient, particularly for fine-grained tasks function DEQUEUE(q) while atomicCAS(&q.lock, 0, 1) == 1 do; if q.beg != q.end then q.beg ++ result ← q.data[q.beg] else result ← NIL q.lock ← 0 return result function ENQUEUE(q, task) q.end++ q.data[q.end ] ← task

Lock-free Dynamic Task Queue function DEQUEUE(q) oldbeg ← q.beg lbeg ← oldbeg while task = q.data[lbeg] == NI L do lbeg ++ if atomicCAS(&q.data[l beg], task, NIL) != task then restart if lbeg mod x == 0 then atomicCAS(&q.beg, oldbeg, lbeg) return task function ENQUEUE(q, task) oldend ← q.end lend ← oldend while q.data[lend] != NIL do lend ++ if atomicCAS(&q.data[lend], NIL, task) != NIL then if lend mod x == 0 then atomicCAS(&q.end , oldend, lend ) Idea: At least one thread will succeed to add or remove task from queue Optimization: Only update beginning and end with atomicCAS every x elements.

Task Stealing No code provided in paper Idea: A set of independent task queues. When a task queue becomes empty, it goes out to other task queues to find available work Lots and lots of engineering needed to get this right Best work on this is in Intel Thread Building Blocks

General Issues One or multiple task queues? Where does task queue reside? If possible, in shared memory Actual tasks can be stored elsewhere, perhaps in global memory

Remainder of Paper Octtree partitioning of particle system used as example application A comparison of 4 implementations Figures 2 and 3 examine two different GPUs Figures 4 and 5 look at two different particle distributions

Next Time Displaying results of computation from Open GL Asynchronous computation/data staging