Computer Science 320 Load Balancing for Hybrid SMP/Clusters.

Slides:



Advertisements
Similar presentations
Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Practical techniques & Examples
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Uniform Distributions on Integer Matrices. The Problem  How many ways can I fill in a matrix with specified row and column sums? ???2 ???2 ???3 223.
Computer Science 320 Reduction Variables and Operators.
Loops For While Do While. Loops Used to repeat something Loop statement creates and controls the loop.
Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.
CSE 160 – Lecture 10 Programs 1 and 2. Program 1 Write a “launcher” program to specify exactly where programs are to be spawned, gather output, clean.
26-Jun-15 Methods. About methods A method is a named group of declarations and statements If a method is in the same class, you execute those declarations.
CSE 113 Week 5 February , Announcements  Module 2 due 2/15  Exam 3 is on 2/15  Module 3 due 2/22  Exam 4 is on 2/25  Module 4 due 2/29.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Week 7 - Wednesday.  What did we talk about last time?  Recursive running time  Master Theorem  Introduction to trees.
Warm-Up: April 21 Write a loop (type of your choosing) that prints every 3 rd number between 10 and 50.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
1 LiveViz – What is it? Charm++ library Visualization tool Inspect your program’s current state Client runs on any machine (java) You code the image generation.
Simple Load Balancing CS550 Operating Systems. Announcements Project will be posted – TBA This project will use the client-server model and will require.
A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.
Recursion Recursion Chapter 12. Outline n What is recursion n Recursive algorithms with simple variables n Recursion and the run-time stack n Recursion.
Hybrid MPI and OpenMP Parallel Programming
CIS162AD - C# Arrays – part 1 12_arrays_loading.ppt.
Threaded Programming Lecture 4: Work sharing directives.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Java Programming: Guided Learning with Early Objects Chapter 11 Recursion.
A First Book of ANSI C, Fourth Edition1 Functions for Modularity 04/24/15.
Lecture 20: Parallelism & Concurrency CS 62 Spring 2013 Kim Bruce & Kevin Coogan CS 62 Spring 2013 Kim Bruce & Kevin Coogan Some slides based on those.
NestedLoops-Mody7-part51 Two-Dimensional Arrays and Nested Loops – part 5 Rotations Barb Ericson Georgia Institute of Technology May 2007.
February ,  2/16: Exam 1 Makeup Papers Available  2/20: Exam 2 Review Sheet Available in Lecture  2/27: Lab 2 due by 11:59:59pm  3/2:
Computer Science 320 Introduction to Hybrid SMP/Clusters.
Computer Science 320 Reduction. Estimating π Throw N darts, and let C be the number of darts that land within the circle quadrant of a unit circle Then,
Computer Science 320 Load Balancing with Clusters.
Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.
An Introduction to MPI (message passing interface)
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
Week 5 - Wednesday.  What did we talk about last time?  Recursion  Definitions: base case, recursive case  Recursive methods in Java.
Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Lecture 20 Threads Richard Gesick. Threads Makes use of multiple processors/cores if available Share memory within the program (as opposed to processes.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Computer Science 320 Parallel Image Generation. The Mandelbrot Set.
Computer Science 320 Random Numbers for Parallel Programs.
Computer Science 320 A First Program in Parallel Java.
Week 7 - Wednesday.  What did we talk about last time?  Recursive running time  Master Theorem  Symbol tables.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Computer Science 320 Barrier Actions. 1-D Continuous Cellular Automata 1-D array of cells, each having a value between 0.0 and 1.0 Each cell has a neighborhood.
Computer Science 320 Reduction. Estimating π Throw N darts, and let C be the number of darts that land within the circle quadrant of a unit circle Then,
COP 2220 Computer Science I Topics –Breaking Problems Down –Functions –User-defined Functions –Calling Functions –Variable Scope Lecture 4.
Georgia Institute of Technology Making Text for the Web part 2 Barb Ericson Georgia Institute of Technology March 2006.
1 Network Access to Charm Programs: CCS Orion Sky Lawlor 2003/10/20.
Computer Science 320 Introduction to Cluster Computing.
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
Threads in Java Two ways to start a thread
Operating Systems (CS 340 D)
Exceptions, Interfaces & Generics
Computer Engg, IIT(BHU)
Creating and Modifying Text part 2
Counted Loops.
Lab. 3 (May 11th) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute :
CSCE569 Parallel Computing
Using compiler-directed approach to create MPI code automatically
Patterns Paraguin Compiler Version 2.1.
CMPT 120 Lecture 24 – Unit 4 – Computer Vision
Presentation transcript:

Computer Science 320 Load Balancing for Hybrid SMP/Clusters

Load Balancing Strategies For SMP, use a dynamic schedule to break the work into smaller chunks to keep the threads continually busy For cluster, use the master/worker pattern with a dynamic schedule to keep the nodes continually busy For hybrid, put several worker threads in each node, and schedule them as in the cluster program

One-Level Scheduling Strategy Cluster Hybrid

Hybrid Mandelbrot Set Program Each of Kp nodes has Kt worker threads Node 0 has one extra thread (the master) Each worker thread is numbered, from 0 to Kt * Kp - 1 The master thread communicates with all worker threads; message tags identify them

Set Up and Run the Threads ParallelTeam team = new ParallelTeam (rank == 0 ? Kt+1 : Kt); // Every parallel team thread runs the worker section, except thread Kt // (which exists only in process 0) runs the master section. team.execute(new ParallelRegion(){ public void run() throws Exception{ if (getThreadIndex() == Kt) masterSection(); else workerSection(rank * Kt + getThreadIndex()); } }); The workerSection method takes a parameter to identify the thread for messages to and from the master thread

Scheduling the Threads in the Master private static void masterSection()throws IOException{ int process, thread, worker; Range range; // Set up a schedule object to divide the row range into chunks. IntegerSchedule schedule = IntegerSchedule.runtime(); schedule.start(K, new Range(0, height-1)); // Send initial chunk range to each worker. If range is null, no more // work for that worker. Keep count of active workers. int activeWorkers = K; // (Kp * Kt) for (process = 0; process < Kp; ++ process) for (thread = 0; thread < Kt; ++ thread) worker = process * Kt + thread; range = schedule.next(worker); world.send(process, worker, ObjectBuf.buffer(range)); if (range == null) --activeWorkers; }

Scheduling the Threads in the Master private static void masterSection()throws IOException{ int process, thread, worker; Range range; // Repeat until all workers have finished. while (activeWorkers > 0){ // Receive an empty message from any worker. CommStatus status = world.receive(null, null, IntegerBuf.emptyBuffer()); process = status.fromRank; worker = status.tag; // Send next chunk range to that specific worker. // If null, no more work. range = schedule.next(worker); world.send(process, worker, ObjectBuf.buffer (range)); if (range == null) --activeWorkers; }

Worker Thread Activity: Receive private static void workerSection(int worker) throws IOException{ // Image, writer, matrix, and row slice variables are now local here.... for (;;){ // Receive chunk range from master. If null, no more work. ObjectItemBuf rangeBuf = ObjectBuf.buffer(); world.receive(0, worker, rangeBuf); Range range = rangeBuf.item; if (range == null) break; int lb = range.lb(); int ub = range.ub(); int len = range.length(); // Allocate storage for matrix row slice if necessary. if (slice == null || slice.length < len) slice = new int [len] [width]; // Code to compute rows and columns of slice goes here.

Worker Thread Activity: Send private static void workerSection(int worker) throws IOException{ // Image, writer, matrix, and row slice variables are now local here.... for (;;){ // Receive chunk range from master. If null, no more work. ObjectItemBuf rangeBuf = ObjectBuf.buffer(); world.receive (0, worker, rangeBuf); Range range = rangeBuf.item; if (range == null) break;... // Report completion of slice to master. world.send(0, worker, IntegerBuf.emptyBuffer()); // Set full pixel matrix rows to refer to slice rows. System.arraycopy(slice, 0, matrix, lb, len); // Write row slice of full pixel matrix to image file. writer.writeRowSlice(range); }

One-Level Scheduling Performance With one master and Kt * Kp workers, lots of messages just to schedule them all Two-level scheduling: –One worker per node, but each worker uses multiple threads –Two schedules, one from the master for each worker and one from each worker for its threads

Two-Level Scheduling

Changes to Program Master uses a schedule with chunk size of 100, worker uses schedule with chunk size of 1 Master node has two parallel sections as well as a worker team No worker tags needed Master section has no changes otherwise

Set Up and Run the Threads // In master process, run master section and worker section in parallel. if (rank == 0) new ParallelTeam(2).execute (new ParallelRegion(){ public void run() throws Exception{ execute(new ParallelSection(){ public void run() throws Exception{ masterSection(); } }, new ParallelSection(){ public void run() throws Exception{ workerSection(); } }); } }); // In worker process, run only worker section. else workerSection();

Worker Thread Activity private static void workerSection() throws IOException{ // Image, writer, matrix, and row slice variables are now local here.... // Parallel team to calculate each slice in multiple threads. ParallelTeam team = new ParallelTeam(); for (;;){ // Receive chunk range from master. If null, no more work. ObjectItemBuf rangeBuf = ObjectBuf.buffer(); world.receive(0, rangeBuf); Range range = rangeBuf.item; if (range == null) break; final int lb = range.lb(); final int ub = range.ub(); final int len = range.length(); // Allocate storage for matrix row slice if necessary. if (slice == null || slice.length < len) slice = new int [len] [width];

Worker Thread Activity private static void workerSection() throws IOException{ // Image, writer, matrix, and row slice variables are now local here.... // Parallel team to calculate each slice in multiple threads. ParallelTeam team = new ParallelTeam(); for (;;){... // Compute rows of slice in parallel threads. team.execute (new ParallelRegion(){ public void run() throws Exception{ execute (lb, ub, new IntegerForLoop(){ // Use the thread-level loop schedule. public IntegerSchedule schedule(){ return thrschedule; } // Compute all rows and columns in slice. public void run (int first, int last){ for (int r = first; r <= last; ++ r){ // Yadah, yadah, yadah