High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Practical techniques & Examples
Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
Lecture 3: Parallel Algorithm Design
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
OpenFOAM on a GPU-based Heterogeneous Cluster
6/2/2015CS194 Lecture1 Load Balancing Part 1: Dynamic Load Balancing Kathy Yelick
04/06/2006CS267 Lecture 221 CS 267: Applications of Parallel Computers Load Balancing James Demmel
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Parallel Simulation etc Roger Curry Presentation on Load Balancing.
04/18/2007CS267 Lecture 241 CS 267: Applications of Parallel Computers Load Balancing Kathy Yelick
CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 3: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
CS240A: Conjugate Gradients and the Model Problem.
Strategies for Implementing Dynamic Load Sharing.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling David E Culler.
CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, , Kumar Berman, F., Wolski, R.,
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
CS294, YelickLoad Balancing, p1 CS Distributed Load Balancing
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Master/Workers Model Example Research Computing UNC - Chapel Hill Instructor: Mark Reed
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
CS 420 Design of Algorithms Parallel Algorithm Design.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Symmetric-pattern multifrontal factorization T(A) G(A)
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
Auburn University
Auburn University
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Exploratory Decomposition Dr. Xiao Qin Auburn.
Xing Cai University of Oslo
Ioannis E. Venetis Department of Computer Engineering and Informatics
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Resource Elasticity for Large-Scale Machine Learning
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
CSCI1600: Embedded and Real Time Software
Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
CS 267: Applications of Parallel Computers Load Balancing
Course Outline Introduction in algorithms and applications
Parallel Programming in C with MPI and OpenMP
CSCI1600: Embedded and Real Time Software
Presentation transcript:

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley

High Performance Computing 1 Ideas for dividing work Embarrassingly parallel computations –‘ideal case’. after perhaps some initial communication, all processes operate independently until the end of the job –examples: computing pi; general Monte Carlo calculations; simple geometric transformation of an image –static or dynamic (worker pool) task assignment

High Performance Computing 1 Ideas for dividing work Partitioning –partition the data, or the domain, or the task list, perhaps master/slave –examples: dot product of vectors; integration on a fixed interval; N-body problem using domain decomposition –static or dynamic task assignment; need for care

High Performance Computing 1 Ideas for dividing work Divide & Conquer –recursively partition the data, or the domain, or the task list –examples: tree algorithm for N-body problem; multipole; multigrid –usually dynamic work assignments

High Performance Computing 1 Ideas for dividing work Pipelining –a sequence of tasks performed by one of a host of processors; functional decomposition –examples: upper triangular linear solves; pipeline sorts –usually dynamic work assignments

High Performance Computing 1 Ideas for dividing work Synchronous Computing –same computation on different sets of data; often domain decomposition –examples: iterative linear system solves –often can schedule static work assignments, if data structures don’t change

High Performance Computing 1 Load balancing Determined by –Task costs –Task dependencies –Locality needs Spectrum of solutions –Static - all information available before starting –Semi-Static - some info before starting –Dynamic - little or no info before starting Survey of solutions –How each one works –Theoretical bounds, if any –When to use it

High Performance Computing 1 Load Balancing in General Large literature A closely related problem is scheduling, which is to determine the order in which tasks run

High Performance Computing 1 Load Balancing Problems Tasks costs –Do all tasks have equal costs? Task dependencies –Can all tasks be run in any order (including parallel)? Task locality –Is it important for some tasks to be scheduled on the same processor (or nearby) to reduce communication cost?

High Performance Computing 1 Task cost spectrum

High Performance Computing 1 Task Dependency Spectrum

High Performance Computing 1 Task Locality Spectrum

High Performance Computing 1 Approaches Static load balancing Semi-static load balancing Self-scheduling Distributed task queues Diffusion-based load balancing DAG scheduling Mixed Parallelism

High Performance Computing 1 Static Load Balancing All information is available in advance Common cases: –dense matrix algorithms, e.g. LU factorization done using blocked/cyclic layout blocked for locality, cyclic for load balancing –usually a regular mesh, e.g., FFT done using cyclic+transpose+blocked layout for 1D –sparse-matrix-vector multiplication use graph partitioning, where graph does not change over time

High Performance Computing 1 Semi-Static Load Balance Domain changes slowly; locality is important –use static algorithm –do some computation, allowing some load imbalance on later steps –recompute a new load balance using static algorithm Particle simulations, particle-in-cell (PIC) methods tree-structured computations (Barnes Hut, etc.) grid computations with dynamically changing grid, which changes slowly

High Performance Computing 1 Self-Scheduling Self scheduling: –Centralized pool of tasks that are available to run –When a processor completes its current task, look at the pool –If the computation of one task generates more, add them to the pool Originally used for: –Scheduling loops by compiler (really the runtime-system)

High Performance Computing 1 When is Self-Scheduling a Good Idea? A set of tasks without dependencies –can also be used with dependencies, but most analysis has only been done for task sets without dependencies Cost of each task is unknown Locality is not important Using a shared memory multiprocessor, so a centralized pool of tasks is fine

High Performance Computing 1 Variations on Self-Scheduling Don’t grab small unit of parallel work. Chunk of tasks of size K. –If K large, access overhead for task queue is small –If K small, likely to have load balance Four variations: –Use a fixed chunk size –Guided self-scheduling –Tapering –Weighted Factoring

High Performance Computing 1 Variation 1: Fixed Chunk Size How to compute optimal chunk size Requires a lot of information about the problem characteristics e.g. task costs, number Need off-line algorithm; not useful in practice. –All tasks must be known in advance

High Performance Computing 1 Variation 2: Guided Self-Scheduling Use larger chunks at the beginning to avoid excessive overhead and smaller chunks near the end to even out the finish times.

High Performance Computing 1 Variation 3: Tapering Chunk size, K i is a function of not only the remaining work, but also the task cost variance –variance is estimated using history information –high variance => small chunk size should be used –low variant => larger chunks OK

High Performance Computing 1 Variation 4: Weighted Factoring Similar to self-scheduling, but divide task cost by computational power of requesting node Useful for heterogeneous systems Also useful for shared resource e.g. NOWs –as with Tapering, historical information is used to predict future speed –“speed” may depend on the other loads currently on a given processor

High Performance Computing 1 Distributed Task Queues The obvious extension of self-scheduling to distributed memory Good when locality is not very important –Distributed memory multiprocessors –Shared memory with significant synchronization overhead –Tasks that are known in advance –The costs of tasks is not known in advance

High Performance Computing 1 DAG Scheduling Directed acyclic graph (DAG) of tasks –nodes represent computation (weighted) –edges represent orderings and usually communication (may also be weighted) –usually not common to have DAG in advance

High Performance Computing 1 DAG Scheduling Two application domains where DAGs are known –Digital Signal Processing computations –Sparse direct solvers (mainly Cholesky, since it doesn’t require pivoting). –Basic strategy: partition DAG to minimize communication and keep all processors busy –NP complete, so need approximations –Different than graph partitioning, which was for tasks with communication but no dependencies

High Performance Computing 1 Mixed Parallelism Another variation - a problem with 2 levels of parallelism course-grained task parallelism –good when many tasks, bad if few fine-grained data parallelism –good when much parallelism within a task, bad if little

High Performance Computing 1 Mixed Parallelism Adaptive mesh refinement Discrete event simulation, e.g., circuit simulation Database query processing Sparse matrix direct solvers

High Performance Computing 1 Mixed Parallelism Strategies

High Performance Computing 1 Which Strategy to Use

High Performance Computing 1 Switch Parallelism: A Special Case

High Performance Computing 1 A Simple Performance Model for Data Parallelism

High Performance Computing 1 Values of Sigma - problem size