Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Theory of Computer Science - Algorithms
© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
Threads, SMP, and Microkernels
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Ch4. Processes Process Concept (1) new waiting terminate readyrunning admitted interrupt exit scheduler dispatch I/O or event completion.
Cilk NOW Based on a paper by Robert D. Blumofe & Philip A. Lisiecki.
Dynamic Load Balancing Rashid Kaleem and M Amber Hassaan.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Parallelizing Incremental Bayesian Segmentation (IBS) Joseph Hastings Sid Sen.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT.
1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.
Threads Section 2.2. Introduction to threads A thread (of execution) is a light-weight process –Threads reside within processes. –They share one address.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.
CSC 2300 Data Structures & Algorithms February 6, 2007 Chapter 4. Trees.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
Multi-core architectures. Single-core computer Single-core CPU chip.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Overview Work-stealing scheduler O(pS 1 ) worst case space small overhead Narlikar scheduler 1 O(S 1 +pKT  ) worst case space large overhead Hybrid scheduler.
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
An Efficient Algorithm for Scheduling Instructions with Deadline Constraints on ILP Machines Wu Hui Joxan Jaffar School of Computing National University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
P ARALLEL P ROCESSING F INAL P RESENTATION CILK Eliran Ben Moshe Neriya Cohen.
Hwajung Lee. Why do we need these? Don’t we already know a lot about programming? Well, you need to capture the notions of atomicity, non-determinism,
Lecture 8 : Priority Queue Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.
January 26, 2016Department of Computer Sciences, UT Austin Characterization of Existing Systems Puneet Chopra Ramakrishna Kotla.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
Maintaining SP Relationships Efficiently, on-the-fly Jeremy Fineman.
Genetic algorithms for task scheduling problem J. Parallel Distrib. Comput. (2010) Fatma A. Omara, Mona M. Arafa 2016/3/111 Shang-Chi Wu.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
Chapter 4 – Thread Concepts
CILK: An Efficient Multithreaded Runtime System
Computer Architecture: Parallel Task Assignment
Prabhanjan Kambadur, Open Systems Lab, Indiana University
Chapter 4 – Thread Concepts
Computer Engg, IIT(BHU)
Chapter 15, Exploring the Digital Domain
Multithreaded Programming in Cilk LECTURE 1
Estimating Timing Profiles for Simulation of Embedded Systems
Introduction to CILK Some slides are from:
Fast Communication and User Level Parallelism
Transactions with Nested Parallelism
Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.
Atlas: An Infrastructure for Global Computing
Warm Up – Tuesday Find the critical times for each vertex.
Foundations and Definitions
Introduction to CILK Some slides are from:
Presentation transcript:

Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer Science

Motivation Strict binding of multi-threaded computations on parallel computers. To find a parallel execution process by creating a directed acyclic graph and to traverse the instructions accordingly.

MIMD To maintain efficiency, enough threads must be active. Number of active threads should be within the hardware requirements. Related threads should be placed in the same processor.

Work Sharing vs Work Stealing Work Sharing: – Scheduler attempts to migrate threads to other processors hoping to distribute work to underutilized processors. Work Stealing: – Underutilized processors take the initiative by ‘steal’ing threads form other processors.

Multithreaded Computation Blocks -> Threads Circles - > Instructions Right arrows -> Continue Threads Curved Arrows -> Join Edges Vertical/Slant arrows -> Spawn Edges

Strict vs Fully-Strict computation In a strict computation, all join edges from a thread go to ancestor of the thread in an activation tree. In a fully-strict computation all join edges form a thread go to the thread’s parent.

Busy-Leaves From the time thread T, is spawned until the time T dies, there is atleast one thread from the sub-computation that is ready. Disadvantages: Not efficient in large environment of multiprocessors. Scalability

Busy-Leaves conditions Spawn: if T a spawns T b, T a returns to thread pool. The processor starts next step with T b. Stall: if T a stalls, it is returned to thread pool. Next step by the processor is idle. Dies: if T a dies, processor checks if its parent T b has any living children. If no children and no processor is working on T b, it is taken from thread pool, else next step idle.

Busy-Leaves conditions

Thread Spawn/Death

Randomized Work-Stealing Algorithm The centralized pool of Busy-Leaves algorithm is distributed across the processors. Each processor maintains a ready deque. Cases: Spawns Stalls Dies Enables

Randomized Work-Stealing Properties For a processor ‘p’, if there are ‘k’ threads in the deque, and if k>0, the following properties are satisfies. (1)For i=1,2,3,….,k, thread T i is parent of T i-1 (2) If k>1, for i=1,2,…,k-1, thread T i has not been worked on since it spawned T i-1

Work-Stealing For a fully-strict computation with work ‘T 1 ’ and critical path length T 1 the expected running time with P processors is T 1 /P + O(T 1 ) Execution time on P processors is T 1 /P + O(T 1 + lg P + lg(1/ ² )) Expected total communication is O(PT 1 (1+n d )S max

Recycling Game for Atomic Access (P,M) recycling game: P = number of balls = number of bins M = total number of ball tosses – Adversary chooses some of the balls from reservoir selects one of the P bins randomly. – Adversary inspects each of the P bins that contains atleast 1 ball and removes it from the bin.

Work Stealing Total delay incurred by M random requests made by P processors is O(M+PlgP+Plog(1/ ² ))

Conclusion The proposed work-stealing scheduling algorithm is efficient in terms of time, space and communication. ‘C’ base language called ‘Cilk’ being developed for programming multithreaded computations based on work-stealing.