Download presentation
Presentation is loading. Please wait.
Published byEsther Lamb Modified over 9 years ago
1
Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice Herlihy and Nir Shavit
2
Content Futures Introduction, motivation and examples Parallelism analysis Work distribution Work dealing Work stealing Work sharing
3
Futures
4
Introduction Some applications break down naturally to parallel threads Web server Consumers-producers While some applications don’t but they still have inherent parallelism And it’s not so obvious how to take advantage of it For example – matrix multiplication
5
Parallel Matrix Multiplication Naïve solution: computer each C ij using a single thread n×n array of threads the program initializes and starts all threads and waits for all of them to finish Ideal design
6
The problem In practice, the naïve solution performs extremely poorly on large matrices Why? Threads require memory for stack and other bookkeeping info. creating, scheduling and destroying lots of short-lived threads creates a great overhead
7
Solution: Pool of threads Each thread in the pool repeatedly waits for task When a thread is assigned a task, it executed it, and then rejoins the pool
8
Pools in java In java, thread pools are called “executor-service” Provides the ability to… submit a task wait for a task (set of tasks) to complete Cancel an uncompleted task Two types of tasks: Runnable Callable
9
Futures When a callable object is submitted to an executor service (pool), the executor service returns an object implementing the Future interface A Future object is promise to deliver the result of an asynchronous computation It provides a get() method that returns the result (blocking) Similarly, when a runnable object is submitted to a executor service The service will also return a Future This Future doesn’t return a value But the programmer can use the Future’s get() method to block until the computation finishes
10
Matrix Split From now on, we’ll assume that n is a power of 2 Any n×n matrix can be splitted into 4 sub-matrices Thus, matrix addition C=A+B can be decomposed as follows:
11
Parallel Matrix Addition MatrixTask class submit an addtask to the pool and wait for the computation (future) to complete Create a new pool of threads Submit the task to the pool
12
Parallel Matrix Addition Base case Split source and destination matrices
13
Matrix Multiplication Similarly to addition, matrix multiplication can be decomposed as follows: The product terms can be computed in parallel, and when they are done, we can compute the 4 sums in parallel
14
Parallelism analysis
15
Fibonacci Function - Parallel Computation This implementation is very inefficient, but we use it here in order to show multithreaded dependencies
17
Dependencies and critical path An edge u→v indicates that v depends on u (v depends on u’s result ) A node that creates a future has 2 successors – one in The same thread, and one In the future’s computation Future’s get() function critical path
18
Analizing parallelism T p – the minimum time needed to execute a multi-threaded program on a system of P processors. It’s an Idealized measure T 1 - the number of steps needed to execute the program on a single processor Also called the program’s Work Since in any given time, P processors can execute at most P steps:
19
T ∞ - the number of steps to execute the program on an unlimited number of processors Critical-path length Since finite resources can’t do better than infinite resources: The speedup on P processors is the ratio: A program has a linear speedup if The program’s parallelism is the maximum possible speedup:
20
Matrix addition and multiplication – revisit(1/3) Denote: A p (n)= the number of steps needed to add two n×n matrices on P processors. Requires constant amount of time splitting the matrices Plus 4 half-sized matrix additions Thus, the work of the computation is given by: the same number of steps as the conventional doubly-nested-loop Because the half-size additions can be done in parallel, the critical-path length is given by:
21
Matrix addition and multiplication – revisit(2/3) Denote: M p (n)= the number of steps needed to multiply two n×n matrices on P processors. Requires 8 half-size matrix multiplication and 4 matrix additions. Thus, the work of the computation is given by: The same work of the conventional doubly-nested loop. Since the 8 multiplications can be done in parallel, and the 4 sums can be done in parallel (once the multiplications are over)
22
Thus, the parallelism for matrix multiplication is given by: Which is actually pretty high… For a 1000 by 1000 matrices, n 3 =10 9, and log (n) = log 1000 ≈ 10, so the parallism is approximately 10 9 /10 2 =10 7 But… the parallelism in the computation given here is a highly idealized upper-bound on the performance Why? Idle threads – (where it may not be easy to assign them to idle processors) A program that displays less parallelism but consumes less memory may perform better (less page faults) Matrix addition and multiplication – revisit(3/3)
23
Work- Distribution
24
Work Dealing We now understand that the key to achieving a good speedup is to keep user-level threads supplied with tasks Multithreaded computations create and destroy threads dynamically A work distribution algorithm is needed to assign ready tasks to idle threads A simple approach would be work-dealing – an overloaded task tries to offload tasks to other, less heavily loaded threads. the basic flaw with this approach is that if most threads are busy, we’re just wasting time trying to exchange tasks.
25
Work Stealing Instead, we first consider work-stealing : When a thread runs out of work it tries to “steal” work from others. The advantage of this approach is that if all threads are already busy, we’re not wasting time on attempts to exchange work.
26
Work Stealing Each thread keeps a pool of tasks waiting to be executed in the form of a double- ended queue (DEQueue). When a thread creaetes a new task, it calls pushBottom() to push it onto the queue
27
Work Stealing When a thread needs a task to work on, it calls popBottom() to removea task form its own DEQueue. If it finds out that the queue is empty, it becomes a thief It chooses a victim thread at random, and calls its DEQueue’s popTop() method to steal a task for itself
28
implementation Array of DEQueues, one for each thread Each thread removes a task from its queue And runs it. If it runs out of, then it repeatedly chooces A victim thread at random and tries to steal a task from its queue Perform tasks until the queue is empty Choose a victim thread and steal a task from its queue
29
Yielding A multiprogrammed environment is one in which there are more threads than processors Implying that not all threads can run at the same time Any thread can be suspended at any time To guarantee process, we must ensure that threads that have work to do, are not delayed by thief threads which do nothing but task-stealing To prevent that, we have each thief thread call Tread.yield() before trying to steal a task This call yields the processor to another thread, allowing desceduled threads to regain the processor and make progress
30
Work Balancing (1/4) An alternative approach to work-stealing, is work-balancing Having each thread periodically balance its workloads with a randomly chosen partner To ensure that heavily loaded threads do not waste effort trying to rebalance, we’ll make mire likely for lightly-loaded threads to initiate rebalancing. Each thread periodically flips a biased coin to decide whether to balance with one another
31
The thread’s probability to rebalance is inversely proportional to the number of tasks it has Threads with nothing to do are certain to rebalance Threads with a lot of tasks aren't likely to rebalance A thread rebalances by choosing a victim uniformly at random If the difference between the amount of tasks the threads have exceeds a predefined threshold then both threads exchange tasks until their queues have the same number of tasks Work Balancing (2/4)
32
Flip a coin Perform tasks until the queue is empty balance
33
Advantages of this approach: if one thread has much more work than the other threads then in Work-stealing approach, many threads will steal individual tasks from the overloaded task ( contention overhead ) Here, in the work-balancing approach, balancing multiple tasks at a time means that work will spread quickly among the threads Eliminates the overhead of synchronization per individual task. Work Balancing (4/4)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.