Virtues of Good (Parallel) Software

Slides:

Advertisements

Similar presentations

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Advertisements

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

2 Less fish … More fish! Parallelism means doing multiple things at the same time: you can get more work done in the same time.

Parallel System Performance CS 524 – High-Performance Computing.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Reference: Message Passing Fundamentals.

1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Tuesday, September 26, 2006 Wisdom consists of knowing when to avoid perfection. -Horowitz.

Strategies for Implementing Dynamic Load Sharing.

CS213 Parallel Processing Architecture Lecture 5: MIMD Program Design

Mapping Techniques for Load Balancing

1 Parallel Computing 5 Parallel Application Design Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Parallel Simulation of Continuous Systems: A Brief Introduction

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Static Process Scheduling

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Concurrency and Performance Based on slides by Henri Casanova.

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick

Auburn University

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Parallel and Distributed Simulation Techniques

CS 584 Lecture 3 How is the assignment going?.

Parallel Algorithm Design

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Parallel Programming in C with MPI and OpenMP

EE 193: Parallel Computing

Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar

COMP60621 Fundamentals of Parallel and Distributed Systems

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

CS 584 Lecture 5 Assignment. Due NOW!!.

Presentation transcript:

Virtues of Good (Parallel) Software Concurrency Able to exploit concurrencies in algorithm/problem/hardware Scalability Resilient to increasing processor count Locality More frequent access to local data than to remote data Modularity Employ abstraction and modular design

Two Basic Requirements for Parallel Program Safety: Produce correct results Result computed on P processors and on 1 processor must be IDENTICAL. Livelihood: Able to proceed and finish; free of deadlock.

Sources of Overhead Execution time The time that elapses from when the first processor starts executing on the problem to when the last processor completes execution Execution time = computation time + communication time + idle time Communication / interprocess interaction: usually main source of overhead T_comm = t_s + t_w*L Minimize the volume and frequency of communications; overlap computation/communication Idling: lack of computation or lack of data Load imbalance Synchronization Presence of serial components Wait on remote data Replicated Computation Communicate or replicate

Speedup & Efficiency Relative speed-up: the factor by which the execution time is reduced on multiple processors S(p) = T_1/T_p T_1 is the execution time on one processor T_p is execution time on p processors Absolute speed-up: where is T_1 is the uniprocessor time for best-known (sequential) algorithm S(p) <= p Embarrassingly parallel (EP): no communication among cpus. Superlinear speedup: exists in reality Efficiency: the fraction of time that processors spend doing useful work. E = S/p = T_1/(p*T_p) Parallel cost: p*T_p Parallel overhead: T_o = p*T_p – T_1

Amdahl’s Law This is for a fixed problem size Alpha – fraction of operations in serial code that can be parallelized P – number of processors This is for a fixed problem size T_p = alpha*T_1/p + (1-alpha)*T_1 S 1/(1-alpha) as Pinfinity Alpha = 90%, S10 Alpha =99%, S100 Alpha = 99.9%, S1000 “Mental block”

Gustafson’s Law Alpha – fraction of time spent on parallel operations in the parallel program This is for a scaled problem size; or constant run time. T_1 = (1-alpha)*T_p + p*alpha*T_p As problem size increases, fraction of parallel operations increases

Iso-Efficiency Function For fixed problem size N, as P increases, increase in speedup S slows down or levels off, efficiency E decreases For fixed P, as the N increases, S increases, efficiency E increases As P increases, can increase the problem size N such that the efficiency is kept constant This N(p) for fixed efficiency is called iso-efficiency function Rate of increase in N(p), dN/dp, measures the scalability of a parallel program Smaller rate of increase  more scalable

Parallel Program Design PCAM Model (I. Foster) Concurrency, scalability Locality, performance-related issues

Partitioning Decompose the computation to be performed and the data operated on by this computation into small tasks Purpose: expose opportunities of parallel execution Ignore practical issues such as number of processors in target machine etc Avoid replicating computation and data Focus: Define a large number of small tasks in order to yield a fine-grained decomposition of the problem Fine grained decomposition provides the greatest flexibility in terms of potential parallel algorithms Maximize concurrency

Partitioning Good partition: divides both the computation associated with a problem and the data this computation operates on Domain/Data decomposition: first focus on data Partition the data associated with the problem Associate computations with partitioned data Functional decomposition: first focus on computation Decompose computations to be performed Deal with data decomposed computations work on

Domain Decomposition Decompose the data first, and then associated computations “owner computes” Outcome: tasks comprising some data and a set of operations on that data Some operation may require data from several tasks  communication Data can be input data, output data, intermediate data, or all of them. Rule of thumb: focus first on largest data structure or the data structure accessed most frequently Mesh-based problems: Structured mesh: 1D, 2D, 3D decompositions Unstructured mesh: graph partitioning tools such as METIS Favor the most aggressive decomposition possible at this stage

Functional Decomposition Focus first on computation to be performed; Divide computations into disjoint tasks Then consider the data associated with each sub-task Data requirements may be disjoint  done Data may overlap significantly, communications; May just as well try domain decomposition Provide an alternative way of thinking about problem; Hybrid decomposition maybe best E.g. multi-physics simulations, overall functional decomposition, each component domain decomposition

Partitioning: Questions to Ask Does your partition define more tasks (an order of magnitude more?) than the number of processors of the target machine? No  reduced flexibility in subsequent stages Does your partition avoid redundant computation and storage requirements? No  may not be scalable to large problems Are tasks of comparable size? No  hard to allocate to cpus with equal amount of work  load imbalance Does the number of tasks scale with problem size? Ideal: increased problem size  increase in number of tasks No  may not be able to solve larger problems with more processors Have you identified alternative partitions? Maximize flexibility; try both domain and functional decompositions

Communication Purpose: Determine the interaction among tasks Distribute communication operations among many tasks Organize communication operations in a way that permits concurrent execution 4 categories of communications: Local/global communications: Local: each task communicates with a small set of other tasks (neighbors) Global: communicate with many or all other tasks

Communication Structured/un-structured communication Structured: A task and neighbors form a regular structure, grid or tree Un-structured: communication represented by arbitrary graphs Static/dynamic communication: Static: identity of communication partners does not change over time Dynamic: identity of partners determined by data computed at runtime and highly variable Synchronous/asynchronous communication Synchronous: requires coordination between communication partners Asynchronous: without cooperation

Task Dependency Graph Task dependencies: one task cannot start until some other task(s) finishes. E.g. the output of one task is the input to another task Represented by the task dependency graph: Directed acyclic Nodes: tasks (task size as the weight of node) Directed edges: dependencies among tasks

Task Dependency Graph Degree of concurrency: number of tasks that can run concurrently Maximum degree of concurrency: the maximum number of tasks that can be executed simultaneously at any given time Average degree of concurrency: the average number of tasks that can run concurrently over the duration of program Critical path: The longest vertex-weighted directed path between any pair of start and finish nodes Critical path length: sum of vertex weights along the critical path Average degree of concurrency = total amount of work / critical path length

Task Interaction Graph Even independent tasks may need to interact, e.g. sharing data Interaction graph: captures interaction patterns among tasks Nodes: tasks Edges: communications / interactions Usually contains task dependency graph as sub-graph Example interaction graph

Communication: Questions to Ask Do all tasks perform the same number of communication operations? Unbalanced communication  poor scalability Distribute communications equitably Does each task communicate only with a small number of neighbors? May need to re-formulate global communication in terms of local communication structures Can communications proceed concurrently? Can computations associated with different tasks proceed concurrently? No  may need to re-order computations / communications

Agglomeration Improve performance: Combine tasks to reduce the task interaction strength, increase locality, increase the computation and communication granularity. Also determine if it is worthwhile to replicate data/computation Dependent tasks will be combined Independent tasks may also be agglomerated to increase granularity Goals: reduce communication cost, retain flexibility w.r.t. scalability and mapping decisions

Increasing Granularity Coarse-grain usually performs better: Send less data (reduce volume of communication) Use fewer messages when sending same amount of data (reducing frequency of communications) Surface-to-volume effects: Communication cost usually proportional to surface area of domain Computation cost usually proportional to volume of domain As task size increases, amount of communication per unit computation decreases High-D decomposition usually more efficient than low-D decompositions, due to reduced surface area for a given volume. Replicate computation: May trade off replicated computation for reduced communication or execution time.

Agglomeration: Questions to Ask Has agglomeration reduced communication costs by increasing locality? If computation is replicated, have you verified that the benefits of replication out-weigh its costs for a range of problem size and processor counts? If data is replicated, have you verified that it does not comprise scalability Do the tasks have similar computation and communication costs after agglomeration? Load balance Does the number of tasks still scale with problem size?

Mapping Map tasks to processors or processes. If the number of tasks is larger than the number of processors, may need to place more than one task on a single processor Goal: minimize total execution time Place tasks that execute concurrently on different processors Place tasks that communicate frequently on the same processor In general case, no computationally tractable algorithm for the mapping problem, NP-complete. If SPMD-style, one task per processor

Parallel Algorithm Models Data parallel model: processors perform similar operations on different data Work/task pool model (replicated workers): Pool of tasks, a number of processors A processor can remove a task from pool and work on it A processor may generate a new task during computation and add it to the pool Master-slave/manager-worker model: master processors generate work and allocate it to worker processors Pipeline/producer-consumer model: a stream of data passes through a succession of processors, each perform some task on it. Hybrid model: combination of two or more models