Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.

Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice Herlihy and Nir Shavit

Content  Futures  Introduction, motivation and examples  Parallelism analysis  Work distribution  Work dealing  Work stealing  Work sharing

Futures

Introduction  Some applications break down naturally to parallel threads  Web server  Consumers-producers  While some applications don’t  but they still have inherent parallelism  And it’s not so obvious how to take advantage of it  For example – matrix multiplication

Parallel Matrix Multiplication  Naïve solution: computer each C ij using a single thread  n×n array of threads  the program initializes and starts all threads and waits for all of them to finish  Ideal design

The problem  In practice, the naïve solution performs extremely poorly on large matrices  Why?  Threads require memory for stack and other bookkeeping info.  creating, scheduling and destroying lots of short-lived threads creates a great overhead

Solution: Pool of threads  Each thread in the pool repeatedly waits for task  When a thread is assigned a task, it executed it, and then rejoins the pool

Pools in java  In java, thread pools are called “executor-service”  Provides the ability to…  submit a task  wait for a task (set of tasks) to complete  Cancel an uncompleted task  Two types of tasks:  Runnable  Callable

Futures  When a callable object is submitted to an executor service (pool), the executor service returns an object implementing the Future interface  A Future object is promise to deliver the result of an asynchronous computation  It provides a get() method that returns the result (blocking)  Similarly, when a runnable object is submitted to a executor service  The service will also return a Future  This Future doesn’t return a value  But the programmer can use the Future’s get() method to block until the computation finishes

Matrix Split  From now on, we’ll assume that n is a power of 2  Any n×n matrix can be splitted into 4 sub-matrices  Thus, matrix addition C=A+B can be decomposed as follows:

Parallel Matrix Addition  MatrixTask class  submit an addtask to the pool and wait for the computation (future) to complete Create a new pool of threads Submit the task to the pool

Parallel Matrix Addition Base case Split source and destination matrices

Matrix Multiplication  Similarly to addition, matrix multiplication can be decomposed as follows:  The product terms can be computed in parallel, and when they are done, we can compute the 4 sums in parallel

Parallelism analysis

Fibonacci Function - Parallel Computation  This implementation is very inefficient, but we use it here in order to show multithreaded dependencies

Dependencies and critical path  An edge u→v indicates that v depends on u (v depends on u’s result )  A node that creates a future has 2 successors – one in The same thread, and one In the future’s computation  Future’s get() function  critical path

Analizing parallelism  T p – the minimum time needed to execute a multi-threaded program on a system of P processors.  It’s an Idealized measure  T 1 - the number of steps needed to execute the program on a single processor  Also called the program’s Work  Since in any given time, P processors can execute at most P steps:

 T ∞ - the number of steps to execute the program on an unlimited number of processors  Critical-path length  Since finite resources can’t do better than infinite resources:  The speedup on P processors is the ratio:  A program has a linear speedup if  The program’s parallelism is the maximum possible speedup:

Matrix addition and multiplication – revisit(1/3)  Denote:  A p (n)= the number of steps needed to add two n×n matrices on P processors.  Requires constant amount of time splitting the matrices  Plus 4 half-sized matrix additions  Thus, the work of the computation is given by:  the same number of steps as the conventional doubly-nested-loop  Because the half-size additions can be done in parallel, the critical-path length is given by:

Matrix addition and multiplication – revisit(2/3)  Denote:  M p (n)= the number of steps needed to multiply two n×n matrices on P processors.  Requires 8 half-size matrix multiplication and 4 matrix additions.  Thus, the work of the computation is given by:  The same work of the conventional doubly-nested loop.  Since the 8 multiplications can be done in parallel, and the 4 sums can be done in parallel (once the multiplications are over)

 Thus, the parallelism for matrix multiplication is given by:  Which is actually pretty high…  For a 1000 by 1000 matrices, n 3 =10 9, and log (n) = log 1000 ≈ 10, so the parallism is approximately 10 9 /10 2 =10 7  But… the parallelism in the computation given here is a highly idealized upper-bound on the performance  Why?  Idle threads – (where it may not be easy to assign them to idle processors)  A program that displays less parallelism but consumes less memory may perform better (less page faults) Matrix addition and multiplication – revisit(3/3)

Work- Distribution

Work Dealing  We now understand that the key to achieving a good speedup is to keep user-level threads supplied with tasks  Multithreaded computations create and destroy threads dynamically  A work distribution algorithm is needed to assign ready tasks to idle threads  A simple approach would be work-dealing – an overloaded task tries to offload tasks to other, less heavily loaded threads.  the basic flaw with this approach is that if most threads are busy, we’re just wasting time trying to exchange tasks.

Work Stealing  Instead, we first consider work-stealing :  When a thread runs out of work it tries to “steal” work from others.  The advantage of this approach is that if all threads are already busy, we’re not wasting time on attempts to exchange work.

Work Stealing  Each thread keeps a pool of tasks waiting to be executed in the form of a double- ended queue (DEQueue).  When a thread creaetes a new task, it calls pushBottom() to push it onto the queue

Work Stealing  When a thread needs a task to work on, it calls popBottom() to removea task form its own DEQueue.  If it finds out that the queue is empty, it becomes a thief  It chooses a victim thread at random, and calls its DEQueue’s popTop() method to steal a task for itself

implementation  Array of DEQueues, one for each thread  Each thread removes a task from its queue And runs it.  If it runs out of, then it repeatedly chooces A victim thread at random and tries to steal a task from its queue Perform tasks until the queue is empty Choose a victim thread and steal a task from its queue

Yielding  A multiprogrammed environment is one in which there are more threads than processors  Implying that not all threads can run at the same time  Any thread can be suspended at any time  To guarantee process, we must ensure that threads that have work to do, are not delayed by thief threads which do nothing but task-stealing  To prevent that, we have each thief thread call Tread.yield() before trying to steal a task  This call yields the processor to another thread, allowing desceduled threads to regain the processor and make progress

Work Balancing (1/4)  An alternative approach to work-stealing, is work-balancing  Having each thread periodically balance its workloads with a randomly chosen partner  To ensure that heavily loaded threads do not waste effort trying to rebalance, we’ll make mire likely for lightly-loaded threads to initiate rebalancing.  Each thread periodically flips a biased coin to decide whether to balance with one another

 The thread’s probability to rebalance is inversely proportional to the number of tasks it has  Threads with nothing to do are certain to rebalance  Threads with a lot of tasks aren't likely to rebalance  A thread rebalances by choosing a victim uniformly at random  If the difference between the amount of tasks the threads have exceeds a predefined threshold then both threads exchange tasks until their queues have the same number of tasks Work Balancing (2/4)

Flip a coin Perform tasks until the queue is empty balance

 Advantages of this approach:  if one thread has much more work than the other threads then in Work-stealing approach, many threads will steal individual tasks from the overloaded task ( contention overhead )  Here, in the work-balancing approach, balancing multiple tasks at a time means that work will spread quickly among the threads  Eliminates the overhead of synchronization per individual task. Work Balancing (4/4)

Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.

Similar presentations

Presentation on theme: "Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.

Similar presentations

Presentation on theme: "Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice."— Presentation transcript:

Similar presentations

About project

Feedback