Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR PROGRAMMING” BY MAURICE HERLIHY AND NIR SHAVIT © 1
Content 2 Art of Multiprocessor Programming - Computer Science Seminar Intro and motivation Analyzing Parallelism Work distribution
Intro Art of Multiprocessor Programming - Computer Science Seminar Some applications break down naturally into parallel threads. Web Server Creates a thread to handle a request.
Intro Art of Multiprocessor Programming - Computer Science Seminar Some applications break down naturally into parallel threads. Producer Consumer Every consumer and producer can be represented as a thread.
Intro Art of Multiprocessor Programming - Computer Science Seminar But, we are here to talk about the hard stuff. We will look at applications that have inherent parallelism, but where it is not obvious how to take advantage of it. Our example will be matrix multiplication. Recall:
How To Parallelize? Art of Multiprocessor Programming - Computer Science Seminar
Intro First Try Art of Multiprocessor Programming - Computer Science Seminar class MMThread { 2 double[][] a, b, c; 3 int n; 4 public MMThread(double[][] myA, double[][] myB) { 5 n = ymA.length; 6 a = myA; 7 b = myB; 8 c = new double[n][n]; 9 } 10 void multiply() { 11 Worker[][] worker = new Worker[n][n]; 12 for (int row = 0; row < n; row++) 13 for (int col = 0; col < n; col++) 14 worker[row][col] = new Worker(row,col); 15 for (int row = 0; row < n; row++) 16 for (int col = 0; col < n; col++) 17 worker[row][col].start(); 18 for (int row = 0; row < n; row++) 19 for (int col = 0; col < n; col++) 20 worker[row][col].join(); 21 } 22 class Worker extends Thread { 23 int row, col; 24 Worker(int myRow, int myCol) { 25 row = myRow; col = myCol; 26 } 27 public void run() { 28 double dotProduct = 0.0; 29 for (int i = 0; i < n; i++) 30 dotProduct += a[row][i] * b[i][col]; 31 c[row][col] = dotProduct; 32} 33} 34 }
Intro First Try Art of Multiprocessor Programming - Computer Science Seminar This might seem like an ideal design, but… -Poor performance for large matrices (Million threads on 1000x1000 matrices!). -High memory consumption. -Many short-lived threads.
Intro Thread Pool (Second Try) Art of Multiprocessor Programming - Computer Science Seminar A data-structure that connects threads to tasks. -Number of long lived threads. -The number of threads can be dynamic or static (fixed number). -Each thread waits until it assigned a task. -The thread Executes the task and rejoins the pool to await its next assignment. Benefits: -Performance improvement due to the use of long lived threads. -Platform independent – from small machines to hundreds of cores
Intro Thread Pool Java Terms Art of Multiprocessor Programming - Computer Science Seminar
Intro Thread Pool Java Terms (cont.) Art of Multiprocessor Programming - Computer Science Seminar
Intro Back to matrix multiplication Art of Multiprocessor Programming - Computer Science Seminar public class Matrix { 2 int dim; 3 double[][] data; 4 int rowDisplace, colDisplace; 5 public Matrix(int d) { 6 dim = d; 7 rowDisplace = colDisplace = 0; 8 data = new double[d][d]; 9 } 10 private Matrix(double[][] matrix, int x, int y, int d) { 11 data = matrix; 12 rowDisplace = x; 13 colDisplace = y; 14 dim = d; 15 } 16 public double get(int row, int col) { 17 return data[row+rowDisplace][col+colDisplace]; 18} 19 public void set(int row, int col, double value) { 20 data[row+rowDisplace][col+colDisplace] = value; 21} 22 public int getDim() { 23 return dim; 24 } 25 Matrix[][] split() { 26 Matrix[][] result = new Matrix[2][2]; 27 int newDim = dim / 2; 28 result[0][0] = 29 new Matrix(data, rowDisplace, colDisplace, newDim); 30 result[0][1] = 31 new Matrix(data, rowDisplace, colDisplace + newDim, newDim); 32 result[1][0] = 33 new Matrix(data, rowDisplace + newDim, colDisplace, newDim); 34 result[1][1] = 35 new Matrix(data, rowDisplace + newDim, colDisplace + newDim, newDim); 36 return result; 37 } } Splits the matrix to 4 sub- matrices
Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar Split the two matrices into 4 Multiply the 8 sub matrices 4 Sums of the 8 products
12 34 Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar X=
12 34 Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar X= Parallel Multiplication Parallel Addition Task Creation
Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar The class that holds the thread pool The multiplying task The constructor created two Matrices to hold the matrix product terms Now we will describe the thread that performs the job.
Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar Split all the matrices Submit the tasks to compute the eight product terms in parallel Once they are complete the thread submits tasks to compute the four sums in parallel and waits for them to complete
In Conclusion Art of Multiprocessor Programming - Computer Science Seminar Two tries for the same algorithm. -One is bad and inefficient because it is not smart, it’s just allocated thousands of threads and executes them. -The second is a lot better, with a good design and fewer threads we achieve a better performance. -Some analysis of the parallelism might help us to design better solutions for the same algorithm.
Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar
Program DAG Art of Multiprocessor Programming - Computer Science Seminar Multithreaded computation can be represented as a DAG. -Each node represents a task -Each directed edge links a predecessor task to successor task where the successor depends on the predecessor’s result -In a node that creates futures we have 2 dependencies : The computation node, and the next successor task in the same node. Example: Fibonacci sequence
Fibonacci Example Art of Multiprocessor Programming - Computer Science Seminar Fibonacci multithreaded implementations with futures: Thread pool that holds the tasks
The Fibonacci DAG for fib(4) Art of Multiprocessor Programming - Computer Science Seminar fib(4) fib(3)fib(2) fib(1) fib(0) fib(1)fib(0)
Back to Program DAGs The Fibonacci DAG for fib(4) Art of Multiprocessor Programming - Computer Science Seminar
Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar What is the meaning of the notion “Some computations are inherently more parallel than others”? We want to give a precise answer for this question.
Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar
Example - Addition Art of Multiprocessor Programming - Computer Science Seminar
Example - Multiplication Art of Multiprocessor Programming - Computer Science Seminar
In real life… Art of Multiprocessor Programming - Computer Science Seminar The multithreaded speedup we achieved is not realistic, it’s highly idealized upper bound. In real life it is not easy to assign idle threads to idle processors. In some cases a program that displays less parallelism but consumes less memory may perform better because it encounters fewer page faults. But, this kind of analysis is a good indication of which problem can be resolved in parallel
Realistic Multiprocessor Scheduling Art of Multiprocessor Programming - Computer Science Seminar
Recall: Operating Systems Art of Multiprocessor Programming - Computer Science Seminar Multithreaded programs – task level User level scheduler mapping of tasks to fixed number of threads Kernel mapping of threads to hardware processors Can be controlled by the application The programmer can optimize this with good work distribution
Greedy Schedulers Art of Multiprocessor Programming - Computer Science Seminar In other words: It executes as many of the ready nodes as possible, given the number of available processors
Art of Multiprocessor Programming - Computer Science Seminar Greedy schedulers are a simple and practical way to achieve performance that is reasonably close to optimal. Conclusion:
Work Distribution Art of Multiprocessor Programming - Computer Science Seminar
Intro Art of Multiprocessor Programming - Computer Science Seminar The key to achieving a good speedup is to keep user-level threads supplied with tasks. -However, multithreaded computations create and destroy tasks dynamically sometimes in unpredictable ways. -We need a work distribution algorithm to assign ready tasks to idle threads as efficiently as possible.
Work Dealing Art of Multiprocessor Programming - Computer Science Seminar Simple approach to work distribution. An overloaded task tries to offload tasks to other, less heavily loaded threads. Thread A Thread B HEAVY TASK What if all threads are overloaded? Thread A offloads work
Work Stealing Art of Multiprocessor Programming - Computer Science Seminar Opposite approach: A thread that runs out of work will try to “steal” work from others. Thread A Thread B HEAVY TASK The issue fixed? Thread B steals work
DEQueue Art of Multiprocessor Programming - Computer Science Seminar
Algorithm Review Art of Multiprocessor Programming - Computer Science Seminar Holds the array of all thread queues, internal id, and random number generator Pops a task from the queue(pool) and runs it If the pool is empty then it randomly finds a victim to steal a job from Why?
-Another, alternative work distribution approach. -Periodically each thread balances its workloads with a randomly chosen partner. What could be a problem? -Solution: Coin flipping! -We ensure that lightly-loaded threads will be more likely to initiate rebalancing. -Each thread periodically flips a biased coin to decide whether to balance with another. -The thread’s probability of balancing is inversely proportional to the number of tasks in the thread’s queue. Work Balancing Art of Multiprocessor Programming - Computer Science Seminar Fewer tasks -> more chance to be selected for balancing
Work Balancing (cont.) Art of Multiprocessor Programming - Computer Science Seminar A thread rebalances by selecting a victim uniformly. -If the difference between its workload and the victim’s exceeds a predefined threshold they transfer tasks until their queues contain the same number of tasks. -Algorithm’s benefits: -Fairness. -The balancing operation moves multiple tasks at each exchange. -If one thread has much more work than the others it is easy to balance his work over all threads. -Algorithm’s drawbacks? -Need a good value of threshold for every platform.
Work Balancing Implementation Art of Multiprocessor Programming - Computer Science Seminar Holds queue of tasks and random number generator The best threshold depends eventually on the OS and platform Always runs Find the victim and do the balance
Work Balancing Implementation (2) Art of Multiprocessor Programming - Computer Science Seminar Gets 2 queues Calculates the difference between the sizes of the queues If the size bigger than the Threshold we will move items from the bigger queue to the smaller one
Conclusion Art of Multiprocessor Programming - Computer Science Seminar How to implement multithreaded programs with thread pools Analyze with precise tools the parallelism of an algorithm Improve thread scheduling on user level Learn different approaches on work distribution
Thank You! Art of Multiprocessor Programming - Computer Science Seminar