Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

Mutual Exclusion The Art of Multiprocessor Programming Spring 2007.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

1 Chapter 5 Threads 2 Contents  Overview  Benefits  User and Kernel Threads  Multithreading Models  Solaris 2 Threads  Java Threads.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.

Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.

Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this.

Art of Multiprocessor Programming1 Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy.

Threads - Definition - Advantages using Threads - User and Kernel Threads - Multithreading Models - Java and Solaris Threads - Examples - Definition -

Server Architecture Models Operating Systems Hebrew University Spring 2004.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Chapter 4: Threads. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th edition, Jan 23, 2005 Chapter 4: Threads Overview Multithreading.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.

Strategies for Implementing Dynamic Load Sharing.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

50.003: Elements of Software Construction Week 5 Basics of Threads.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.

Threads, Thread management & Resource Management.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.

Share Memory Program Example int array_size=1000 int global_array[array_size] main(argc, argv) { int nprocs=4; m_set_procs(nprocs); /* prepare to launch.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.

Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.

Static Process Scheduling

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Project 2 Overview (Threads in Practice) CSE451 Andrew Whitaker.

A System Performance Model Distributed Process Scheduling.

Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.

CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.

Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used.

Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Cs431-cotter1 Processes and Threads Tanenbaum 2.1, 2.2 Crowley Chapters 3, 5 Stallings Chapter 3, 4 Silberschaz & Galvin 3, 4.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Operating System Concepts

Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Matrix Multiplication in Hadoop

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Multi-processor Scheduling

Operating Systems (CS 340 D)

Practice Chapter Four.

Operating System (013022) Dr. H. Iwidat

Chapter 4: Multithreaded Programming

William Stallings Computer Organization and Architecture

Task Scheduling for Multicore CPUs and NUMA Systems

Operating Systems (CS 340 D)

Futures, Scheduling, and Work Distribution

Chapter 4: Threads.

CSE8380 Parallel and Distributed Processing Presentation

Thread Implementation Issues

Multithreaded Programming

Futures, Scheduling, and Work Distribution

Chapter 4: Threads & Concurrency

Presentation transcript:

Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR PROGRAMMING” BY MAURICE HERLIHY AND NIR SHAVIT © 1

Content 2 Art of Multiprocessor Programming - Computer Science Seminar  Intro and motivation  Analyzing Parallelism  Work distribution

Intro Art of Multiprocessor Programming - Computer Science Seminar Some applications break down naturally into parallel threads. Web Server Creates a thread to handle a request.

Intro Art of Multiprocessor Programming - Computer Science Seminar Some applications break down naturally into parallel threads. Producer Consumer Every consumer and producer can be represented as a thread.

Intro Art of Multiprocessor Programming - Computer Science Seminar But, we are here to talk about the hard stuff. We will look at applications that have inherent parallelism, but where it is not obvious how to take advantage of it. Our example will be matrix multiplication. Recall:

How To Parallelize? Art of Multiprocessor Programming - Computer Science Seminar

Intro First Try Art of Multiprocessor Programming - Computer Science Seminar class MMThread { 2 double[][] a, b, c; 3 int n; 4 public MMThread(double[][] myA, double[][] myB) { 5 n = ymA.length; 6 a = myA; 7 b = myB; 8 c = new double[n][n]; 9 } 10 void multiply() { 11 Worker[][] worker = new Worker[n][n]; 12 for (int row = 0; row < n; row++) 13 for (int col = 0; col < n; col++) 14 worker[row][col] = new Worker(row,col); 15 for (int row = 0; row < n; row++) 16 for (int col = 0; col < n; col++) 17 worker[row][col].start(); 18 for (int row = 0; row < n; row++) 19 for (int col = 0; col < n; col++) 20 worker[row][col].join(); 21 } 22 class Worker extends Thread { 23 int row, col; 24 Worker(int myRow, int myCol) { 25 row = myRow; col = myCol; 26 } 27 public void run() { 28 double dotProduct = 0.0; 29 for (int i = 0; i < n; i++) 30 dotProduct += a[row][i] * b[i][col]; 31 c[row][col] = dotProduct; 32} 33} 34 }

Intro First Try Art of Multiprocessor Programming - Computer Science Seminar This might seem like an ideal design, but… -Poor performance for large matrices (Million threads on 1000x1000 matrices!). -High memory consumption. -Many short-lived threads.

Intro Thread Pool (Second Try) Art of Multiprocessor Programming - Computer Science Seminar A data-structure that connects threads to tasks. -Number of long lived threads. -The number of threads can be dynamic or static (fixed number). -Each thread waits until it assigned a task. -The thread Executes the task and rejoins the pool to await its next assignment. Benefits: -Performance improvement due to the use of long lived threads. -Platform independent – from small machines to hundreds of cores

Intro Thread Pool Java Terms Art of Multiprocessor Programming - Computer Science Seminar

Intro Thread Pool Java Terms (cont.) Art of Multiprocessor Programming - Computer Science Seminar

Intro Back to matrix multiplication Art of Multiprocessor Programming - Computer Science Seminar public class Matrix { 2 int dim; 3 double[][] data; 4 int rowDisplace, colDisplace; 5 public Matrix(int d) { 6 dim = d; 7 rowDisplace = colDisplace = 0; 8 data = new double[d][d]; 9 } 10 private Matrix(double[][] matrix, int x, int y, int d) { 11 data = matrix; 12 rowDisplace = x; 13 colDisplace = y; 14 dim = d; 15 } 16 public double get(int row, int col) { 17 return data[row+rowDisplace][col+colDisplace]; 18} 19 public void set(int row, int col, double value) { 20 data[row+rowDisplace][col+colDisplace] = value; 21} 22 public int getDim() { 23 return dim; 24 } 25 Matrix[][] split() { 26 Matrix[][] result = new Matrix[2][2]; 27 int newDim = dim / 2; 28 result[0][0] = 29 new Matrix(data, rowDisplace, colDisplace, newDim); 30 result[0][1] = 31 new Matrix(data, rowDisplace, colDisplace + newDim, newDim); 32 result[1][0] = 33 new Matrix(data, rowDisplace + newDim, colDisplace, newDim); 34 result[1][1] = 35 new Matrix(data, rowDisplace + newDim, colDisplace + newDim, newDim); 36 return result; 37 } } Splits the matrix to 4 sub- matrices

Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar Split the two matrices into 4 Multiply the 8 sub matrices 4 Sums of the 8 products

12 34 Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar X=

12 34 Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar X= Parallel Multiplication Parallel Addition Task Creation

Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar The class that holds the thread pool The multiplying task The constructor created two Matrices to hold the matrix product terms Now we will describe the thread that performs the job.

Intro Back to matrix multiplication(cont.) Art of Multiprocessor Programming - Computer Science Seminar Split all the matrices Submit the tasks to compute the eight product terms in parallel Once they are complete the thread submits tasks to compute the four sums in parallel and waits for them to complete

In Conclusion Art of Multiprocessor Programming - Computer Science Seminar Two tries for the same algorithm. -One is bad and inefficient because it is not smart, it’s just allocated thousands of threads and executes them. -The second is a lot better, with a good design and fewer threads we achieve a better performance. -Some analysis of the parallelism might help us to design better solutions for the same algorithm.

Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar

Program DAG Art of Multiprocessor Programming - Computer Science Seminar Multithreaded computation can be represented as a DAG. -Each node represents a task -Each directed edge links a predecessor task to successor task where the successor depends on the predecessor’s result -In a node that creates futures we have 2 dependencies : The computation node, and the next successor task in the same node. Example: Fibonacci sequence

Fibonacci Example Art of Multiprocessor Programming - Computer Science Seminar Fibonacci multithreaded implementations with futures: Thread pool that holds the tasks

The Fibonacci DAG for fib(4) Art of Multiprocessor Programming - Computer Science Seminar fib(4) fib(3)fib(2) fib(1) fib(0) fib(1)fib(0)

Back to Program DAGs The Fibonacci DAG for fib(4) Art of Multiprocessor Programming - Computer Science Seminar

Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar What is the meaning of the notion “Some computations are inherently more parallel than others”? We want to give a precise answer for this question.

Analyzing Parallelism Art of Multiprocessor Programming - Computer Science Seminar

Example - Addition Art of Multiprocessor Programming - Computer Science Seminar

Example - Multiplication Art of Multiprocessor Programming - Computer Science Seminar

In real life… Art of Multiprocessor Programming - Computer Science Seminar The multithreaded speedup we achieved is not realistic, it’s highly idealized upper bound. In real life it is not easy to assign idle threads to idle processors. In some cases a program that displays less parallelism but consumes less memory may perform better because it encounters fewer page faults. But, this kind of analysis is a good indication of which problem can be resolved in parallel

Realistic Multiprocessor Scheduling Art of Multiprocessor Programming - Computer Science Seminar

Recall: Operating Systems Art of Multiprocessor Programming - Computer Science Seminar Multithreaded programs – task level User level scheduler mapping of tasks to fixed number of threads Kernel mapping of threads to hardware processors Can be controlled by the application The programmer can optimize this with good work distribution

Greedy Schedulers Art of Multiprocessor Programming - Computer Science Seminar In other words: It executes as many of the ready nodes as possible, given the number of available processors

Art of Multiprocessor Programming - Computer Science Seminar Greedy schedulers are a simple and practical way to achieve performance that is reasonably close to optimal. Conclusion:

Work Distribution Art of Multiprocessor Programming - Computer Science Seminar

Intro Art of Multiprocessor Programming - Computer Science Seminar The key to achieving a good speedup is to keep user-level threads supplied with tasks. -However, multithreaded computations create and destroy tasks dynamically sometimes in unpredictable ways. -We need a work distribution algorithm to assign ready tasks to idle threads as efficiently as possible.

Work Dealing Art of Multiprocessor Programming - Computer Science Seminar Simple approach to work distribution. An overloaded task tries to offload tasks to other, less heavily loaded threads. Thread A Thread B HEAVY TASK What if all threads are overloaded? Thread A offloads work

Work Stealing Art of Multiprocessor Programming - Computer Science Seminar Opposite approach: A thread that runs out of work will try to “steal” work from others. Thread A Thread B HEAVY TASK The issue fixed? Thread B steals work

DEQueue Art of Multiprocessor Programming - Computer Science Seminar

Algorithm Review Art of Multiprocessor Programming - Computer Science Seminar Holds the array of all thread queues, internal id, and random number generator Pops a task from the queue(pool) and runs it If the pool is empty then it randomly finds a victim to steal a job from Why?

-Another, alternative work distribution approach. -Periodically each thread balances its workloads with a randomly chosen partner. What could be a problem? -Solution: Coin flipping! -We ensure that lightly-loaded threads will be more likely to initiate rebalancing. -Each thread periodically flips a biased coin to decide whether to balance with another. -The thread’s probability of balancing is inversely proportional to the number of tasks in the thread’s queue. Work Balancing Art of Multiprocessor Programming - Computer Science Seminar Fewer tasks -> more chance to be selected for balancing

Work Balancing (cont.) Art of Multiprocessor Programming - Computer Science Seminar A thread rebalances by selecting a victim uniformly. -If the difference between its workload and the victim’s exceeds a predefined threshold they transfer tasks until their queues contain the same number of tasks. -Algorithm’s benefits: -Fairness. -The balancing operation moves multiple tasks at each exchange. -If one thread has much more work than the others it is easy to balance his work over all threads. -Algorithm’s drawbacks? -Need a good value of threshold for every platform.

Work Balancing Implementation Art of Multiprocessor Programming - Computer Science Seminar Holds queue of tasks and random number generator The best threshold depends eventually on the OS and platform Always runs Find the victim and do the balance

Work Balancing Implementation (2) Art of Multiprocessor Programming - Computer Science Seminar Gets 2 queues Calculates the difference between the sizes of the queues If the size bigger than the Threshold we will move items from the bigger queue to the smaller one

Conclusion Art of Multiprocessor Programming - Computer Science Seminar How to implement multithreaded programs with thread pools Analyze with precise tools the parallelism of an algorithm Improve thread scheduling on user level Learn different approaches on work distribution

Thank You! Art of Multiprocessor Programming - Computer Science Seminar