Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.

Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR PROGRAMMING” BY MAURICE HERLIHY AND NIR SHAVIT © 1

Content 2 Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014  Intro and motivation  Analyzing Parallelism  Work distribution

Intro Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 3 Some applications break down naturally into parallel threads. Web Server Creates a thread to handle a request.

Intro Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 4 Some applications break down naturally into parallel threads. Producer Consumer Every consumer and producer can be represented as a thread.

Intro Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 5 But, we are here to talk about the hard stuff. We will look at applications that have inherent parallelism, but where it is not obvious how to take advantage of it. Our example will be matrix multiplication. Recall:

How To Parallelize? Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 6

Intro First Try Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 7 1 class MMThread { 2 double[][] a, b, c; 3 int n; 4 public MMThread(double[][] myA, double[][] myB) { 5 n = ymA.length; 6 a = myA; 7 b = myB; 8 c = new double[n][n]; 9 } 10 void multiply() { 11 Worker[][] worker = new Worker[n][n]; 12 for (int row = 0; row < n; row++) 13 for (int col = 0; col < n; col++) 14 worker[row][col] = new Worker(row,col); 15 for (int row = 0; row < n; row++) 16 for (int col = 0; col < n; col++) 17 worker[row][col].start(); 18 for (int row = 0; row < n; row++) 19 for (int col = 0; col < n; col++) 20 worker[row][col].join(); 21 } 22 class Worker extends Thread { 23 int row, col; 24 Worker(int myRow, int myCol) { 25 row = myRow; col = myCol; 26 } 27 public void run() { 28 double dotProduct = 0.0; 29 for (int i = 0; i < n; i++) 30 dotProduct += a[row][i] * b[i][col]; 31 c[row][col] = dotProduct; 32} 33} 34 }

Intro First Try Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 8 This might seem like an ideal design, but… -Poor performance for large matrices (Million threads on 1000x1000 matrices!). -High memory consumption. -Many short-lived threads.

Intro Thread Pool (Second Try) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 9 -A data-structure that connects threads to tasks. -Number of long lived threads. -The number of threads can be dynamic or static (fixed number). -Each thread waits until it assigned a task. -The thread Executes the task and rejoins the pool to await its next assignment. Benefits: -Performance improvement due to the use of long lived threads. -Platform independent – from small machines to hundreds of cores

Intro Thread Pool Java Terms Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 10

Intro Thread Pool Java Terms (cont.) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 11

Intro Back to matrix multiplication Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 12 1 public class Matrix { 2 int dim; 3 double[][] data; 4 int rowDisplace, colDisplace; 5 public Matrix(int d) { 6 dim = d; 7 rowDisplace = colDisplace = 0; 8 data = new double[d][d]; 9 } 10 private Matrix(double[][] matrix, int x, int y, int d) { 11 data = matrix; 12 rowDisplace = x; 13 colDisplace = y; 14 dim = d; 15 } 16 public double get(int row, int col) { 17 return data[row+rowDisplace][col+colDisplace]; 18} 19 public void set(int row, int col, double value) { 20 data[row+rowDisplace][col+colDisplace] = value; 21} 22 public int getDim() { 23 return dim; 24 } 25 Matrix[][] split() { 26 Matrix[][] result = new Matrix[2][2]; 27 int newDim = dim / 2; 28 result[0][0] = 29 new Matrix(data, rowDisplace, colDisplace, newDim); 30 result[0][1] = 31 new Matrix(data, rowDisplace, colDisplace + newDim, newDim); 32 result[1][0] = 33 new Matrix(data, rowDisplace + newDim, colDisplace, newDim); 34 result[1][1] = 35 new Matrix(data, rowDisplace + newDim, colDisplace + newDim, newDim); 36 return result; 37 } } Splits the matrix to 4 sub- matrices

Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 13 Split the two matrices into 4 Multiply the 8 sub matrices 4 Sums of the 8 products

12 34 Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 14 12 34 12 34 16 12 34 28312616 710 1522 X=

12 34 Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 15 12 34 12 34 16 12 34 28312616 710 1522 X= 11 2 3 12 2 4 3143 3 2 4 4 Parallel Multiplication Parallel Addition Task Creation

Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 16 The class that holds the thread pool The multiplying task The constructor created two Matrices to hold the matrix product terms Now we will describe the thread that performs the job.

Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 17 Split all the matrices Submit the tasks to compute the eight product terms in parallel Once they are complete the thread submits tasks to compute the four sums in parallel and waits for them to complete

In Conclusion Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 18 -Two tries for the same algorithm. -One is bad and inefficient because it is not smart, it’s just allocated thousands of threads and executes them. -The second is a lot better, with a good design and fewer threads we achieve a better performance. -Some analysis of the parallelism might help us to design better solutions for the same algorithm.

Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 19

Program DAG Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 20 -Multithreaded computation can be represented as a DAG. -Each node represents a task -Each directed edge links a predecessor task to successor task where the successor depends on the predecessor’s result -In a node that creates futures we have 2 dependencies : The computation node, and the next successor task in the same node. Example: Fibonacci sequence

Fibonacci Example Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 21 Fibonacci multithreaded implementations with futures: Thread pool that holds the tasks

The Fibonacci DAG for fib(4) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 22 fib(4) fib(3)fib(2) fib(1) fib(0) fib(1)fib(0)

Back to Program DAGs The Fibonacci DAG for fib(4) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 23

Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 24 What is the meaning of the notion “Some computations are inherently more parallel than others”? We want to give a precise answer for this question.

Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 25

Example - Addition Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 26

Example - Multiplication Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 27

In real life… Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 28 The multithreaded speedup we achieved is not realistic, it’s highly idealized upper bound. In real life it is not easy to assign idle threads to idle processors. In some cases a program that displays less parallelism but consumes less memory may perform better because it encounters fewer page faults. But, this kind of analysis is a good indication of which problem can be resolved in parallel

Realistic Multiprocessor Scheduling Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 29

Recall: Operating Systems Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 30 Multithreaded programs – task level User level scheduler mapping of tasks to fixed number of threads Kernel mapping of threads to hardware processors Can be controlled by the application The programmer can optimize this with good work distribution

Greedy Schedulers Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 31 -In other words: It executes as many of the ready nodes as possible, given the number of available processors

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 32 Greedy schedulers are a simple and practical way to achieve performance that is reasonably close to optimal. Conclusion:

Work Distribution Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 33

Intro Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 34 -The key to achieving a good speedup is to keep user-level threads supplied with tasks. -However, multithreaded computations create and destroy tasks dynamically sometimes in unpredictable ways. -We need a work distribution algorithm to assign ready tasks to idle threads as efficiently as possible.

Work Dealing Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 35 Simple approach to work distribution. An overloaded task tries to offload tasks to other, less heavily loaded threads. Thread A Thread B HEAVY TASK What if all threads are overloaded? Thread A offloads work

Work Stealing Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 36 Opposite approach: A thread that runs out of work will try to “steal” work from others. Thread A Thread B HEAVY TASK The issue fixed? Thread B steals work

DEQueue Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 37

Algorithm Review Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 38 Holds the array of all thread queues, internal id, and random number generator Pops a task from the queue(pool) and runs it If the pool is empty then it randomly finds a victim to steal a job from Why?

-Another, alternative work distribution approach. -Periodically each thread balances its workloads with a randomly chosen partner. What could be a problem? -Solution: Coin flipping! -We ensure that lightly-loaded threads will be more likely to initiate rebalancing. -Each thread periodically flips a biased coin to decide whether to balance with another. -The thread’s probability of balancing is inversely proportional to the number of tasks in the thread’s queue. Work Balancing Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 39 Fewer tasks -> more chance to be selected for balancing

Work Balancing (cont.) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 40 -A thread rebalances by selecting a victim uniformly. -If the difference between its workload and the victim’s exceeds a predefined threshold they transfer tasks until their queues contain the same number of tasks. -Algorithm’s benefits: -Fairness. -The balancing operation moves multiple tasks at each exchange. -If one thread has much more work than the others it is easy to balance his work over all threads. -Algorithm’s drawbacks? -Need a good value of threshold for every platform.

Work Balancing Implementation Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 41 Holds queue of tasks and random number generator The best threshold depends eventually on the OS and platform Always runs Find the victim and do the balance

Work Balancing Implementation (2) Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 42 Gets 2 queues Calculates the difference between the sizes of the queues If the size bigger than the Threshold we will move items from the bigger queue to the smaller one

Conclusion Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 43 How to implement multithreaded programs with thread pools Analyze with precise tools the parallelism of an algorithm Improve thread scheduling on user level Learn different approaches on work distribution

Thank You! Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014 44

Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.

Similar presentations

Presentation on theme: "Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.

Similar presentations

Presentation on theme: "Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR."— Presentation transcript:

Similar presentations

About project

Feedback