1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)

Slides:

Advertisements

Similar presentations

Symmetric Multiprocessors: Synchronization and Sequential Consistency.

Advertisements

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) SE 292: High Performance.

1 Parallelization of An Example Program Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program.

Grid solver example Gauss-Seidel (near-neighbor) sweeps to convergence  interior n-by-n points of (n+2)-by-(n+2) updated in each sweep  updates done.

Reference: Message Passing Fundamentals.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

1 Parallel Programs. 2 Why Bother with Programs? They’re what runs on the machines we design Helps make design decisions Helps evaluate systems tradeoffs.

1 Lecture 24: Multiprocessors Today’s topics:  Directory-based cache coherence protocol  Synchronization  Consistency  Writing parallel programs Reminder:

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)

1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM.

ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Synchronization (Barriers) Parallel Processing (CS453)

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

9/20/2015 slide 1 PCOD: Lecture 1 Per Stenström © 2008, Sally A. McKee © credit points Instructor: Sally A. McKee.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models.

SE-292 High Performance Computing Intro. to Parallelization Sathish Vadhiyar.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

1 Concurrency Architecture Types Tasks Synchronization –Semaphores –Monitors –Message Passing Concurrency in Ada Java Threads.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Steps in Creating a Parallel Program

1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)

April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.

Parallel Computing Presented by Justin Reschke

Concurrency and Performance Based on slides by Henri Casanova.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

CS5102 High Performance Computer Systems Thread-Level Parallelism

Background on the need for Synchronization

Parallel Programming By J. H. Wang May 2, 2017.

3- Parallel Programming Models

Computer Engg, IIT(BHU)

Steps in Creating a Parallel Program

Steps in Creating a Parallel Program

Perspective on Parallel Programming

Steps in Creating a Parallel Program

Steps in Creating a Parallel Program

Lecture 26: Multiprocessors

Steps in Creating a Parallel Program

Lecture 2: Parallel Programs

Steps in Creating a Parallel Program

Steps in Creating a Parallel Program

Parallelization of An Example Program

Background and Motivation

Steps in Creating a Parallel Program

Lecture 26: Multiprocessors

Steps in Creating a Parallel Program

Parallel Programming in C with MPI and OpenMP

Steps in Creating a Parallel Program

Presentation transcript:

1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...) Issues are the same, so assume programmer does it all explicitly + Scheduling Parallel Algorithm Find max. degree of Parallelism (DOP) or concurrency (Dependency analysis/ graph) Max. no of Tasks Processes Tasks How many tasks? Task (grain) size? Communication Abstraction Tasks  Processes Processes  Processors + Execution Order (scheduling) Fine-grain Parallel Computations  Tasks Fine-grain Parallel Computations 1 EECC756 - Shaaban lec # 4 Spring (PCA Chapter 2.3) From last lecture

2 Example Motivating Problem: Simulating Ocean Currents/Heat Transfer... Model as two-dimensional grids Discretize in space and time – finer spatial and temporal resolution => greater accuracy Many different computations per time step, O(n 2 ) per grid. – set up and solve equations iteratively (Gauss-Seidel). Concurrency across and within grid computations per iteration – n 2 parallel computations per grid x number of grids n n Degree of Parallelism (DOP) or concurrency: O(n 2 ) data parallel computations per grid A[i,j] = 0.2  (A[i,j] +A[i,j –1] +A[i –1, j] + A[i,j+ 1] +A[i+ 1, j]) Expression for updating each interior point: 2D Grid n grids Total O(n 3 ) Computations Per iteration From last lecture (PCA Chapter 2.3)

3 Parallelization of An Example Program Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program in low-level parallel language: C-like pseudo-code with simple extensions for parallelism Expose basic communication and synchronization primitives that must be supported by parallel programming model. (PCA Chapter 2.3) 2D Grid, n 2 points Data Parallel Shared Address Space (SAS) Message Passing

4 2D Grid Solver Example Simplified version of solver in Ocean simulation Gauss-Seidel (near-neighbor) sweeps (iterations) to convergence: – Interior n-by-n points of (n+2)-by-(n+2) updated in each sweep (iteration) – Updates done in-place in grid, and difference from previous value is computed – Accumulate partial differences into a global difference at the end of every sweep or iteration – Check if error (global difference) has converged (to within a tolerance parameter) If so, exit solver; if not, do another sweep (iteration) Or iterate for a set maximum number of iterations. Computation = O(n 2 ) per sweep or iteration n 2 = n x n interior grid points n + 2 points Boundary Points Fixed

5 Pseudocode, Sequential Equation Solver { Sweep O(n 2 ) computations Done? Iterate until convergence TOL, tolerance or threshold Initialize grid points Call equation solver

6 Decomposition Concurrency O(n) along anti-diagonals, serialization O(n) along diagonal Retain loop structure, use pt-to-pt synch; Problem: too many synch ops. Restructure loops, use global synch; imbalance and too much synch Simple way to identify concurrency is to look at loop iterations – Dependency analysis; if not enough concurrency is found, then look further into application Not much concurrency here at this level (all loops sequential) Examine fundamental dependencies, ignoring loop structure Concurrency along anti-diagonals O(n) Serialization along diagonals O(n) New Old

7 Exploit Application Knowledge Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synchronization between them (conservative but convenient) Ocean uses red-black; we use simpler, asynchronous one to illustrate – no red-black sweeps, simply ignore dependencies within a single sweep (iteration) all points can be updated in parallel – Sequential order same as original. Reorder grid traversal: red-black ordering Maximum Degree of parallelism = DOP = O(n 2 ) Type of parallelism: Data parallelism One point update per task (n 2 parallel tasks) Computation = 1 Communication = 4 Communication-to-Computation ratio = 4 For PRAM with n 2 processors: Sweep = O(1) Global Difference = O( log 2 n 2 ) Thus: T = O( log 2 n 2 ) Two parallel sweeps Each with n 2 /2 points Decomposition:

8 Decomposition Only Decomposition into elements: degree of concurrency n 2 To decompose into rows, make line 18 loop sequential; degree of parallelism (DOP) = n for_all leaves assignment left to system – but implicit global synch. at end of for_all loop Parallel PRAM O(1) Global Difference PRAM O( log 2 n 2 ) Task = update one grid point Coarser Grain: n parallel tasks each update a row Task = grid row Computation = O(n) Communication = O(n) ~ 2n Communication to Computation ratio = O(1) ~ 2 Degree of Parallelism (DOP) The “for_all” loop construct imply parallel loop computations Fine Grain: n 2 parallel tasks each updates one element DOP = O(n 2 ) Point Update DOP = O(n 2 ) Task = update one row of points O(n 2 ) Parallel computations

9 Assignment: (n/p rows per task) Dynamic assignment – Get a row index, work on the row, get a new row, and so on Static assignment into rows reduces concurrency (from n 2 to p) – concurrency (DOP) = n for one row per task C-to-C = O(1) – Block assign. reduces communication by keeping adjacent rows together Let’s examine orchestration under three programming models i p Static assignments (given decomposition into rows) – Block assignment of rows: Row i is assigned to process – Cyclic assignment of rows: process i is assigned rows i, i+p, and so on p = number of processors < n p tasks or processes Task = updates n/p rows = n 2 /p elements Computation = O(n 2 /p) Communication = O(n) Communication-to-Computation ratio = O ( n / (n 2 /p) ) = O(p/n) Lower C-to-C ratio is better Block or strip assignment n/p rows per task p = number of processors (tasks or processes)

10 Data Parallel Solver Block decomposition by row Sweep: T = O(n 2 /p) Add all local differences (REDUCE) cost depends on architecture and implementation of REDUCE best: O(log 2 p) using binary tree reduction Worst: O(p) sequentially } O(n 2 /p + log 2 p)  T(iteration)  O(n 2 /p + p) nprocs = number of processes = p In Parallel

11 Shared Address Space Solver Assignment controlled by values of variables used as loop bounds Single Program Multiple Data (SPMD) Still MIMD Setup Barrier 1 Barrier 2 Barrier 3 Not Done? Sweep again (iteration) n/p rows or n 2 /p points per process or task i.e Which n/p rows to update For process Done ? All processes test for convergence

12 Pseudo-code, Parallel Equation Solver for Shared Address Space (SAS) # of processors = p = nprocs T = O(p) Serialized update of global difference Barrier 1 Barrier 2 Barrier 3 Check convergence: all processes do it Main process or thread Create p-1 processes (Start sweep) (sweep done) mymin = low row mymax = high row T(p) = O(n 2 /p + p) Sweep: T = O(n 2 /p) Array “A” is shared Which rows? SAS Mutual Exclusion (lock) for global difference Critical Section: global difference Loop Bounds/Which Rows?

13 Notes on SAS Program SPMD: not lockstep (i.e. still MIMD not SIMD) or even necessarily same instructions. Assignment controlled by values of variables used as loop bounds (i.e. mymin, mymax) – Unique pid per process, used to control assignment of blocks of rows to processes. Done condition (convergence test) evaluated redundantly by all processes Code that does the update identical to sequential program – Each process has private mydiff variable Most interesting special operations needed are for synchronization – Accumulations of local differences (mydiff) into shared global difference have to be mutually exclusive – Why the need for all the barriers? SPMD = Single Program Multiple Data Otherwise each process must enter the shared global difference critical section n 2 /p times (n 2 times total ) instead of just p times per iteration for all processes

14 Need for Mutual Exclusion Code each process executes: load the value of diff into register r1 add the register r2 to register r1 store the value of register r1 into diff A possible interleaving: P1 P2 r1  diff {P1 gets 0 in its r1} r1  diff {P2 also gets 0} r1  r1+r2 {P1 sets its r1 to 1} r1  r1+r2 {P2 sets its r1 to 1} diff  r1 {P1 sets cell_cost to 1} diff  r1 {P2 also sets cell_cost to 1} Need the sets of operations to be atomic (mutually exclusive) Time r2 = mydiff = Local Difference

15 Mutual Exclusion Provided by LOCK-UNLOCK around critical section Set of operations we want to execute atomically Implementation of LOCK/UNLOCK must guarantee mutual exclusion. Can lead to significant serialization if contended (many tasks want to enter critical section at the same time) Especially since expect non-local accesses in critical section Another reason to use private mydiff for partial accumulation – Reduce the number times needed to enter critical section by each process to update global difference: Once per iteration vs. n 2 /p times per process without mydiff

16 Global Event Synchronization BARRIER(nprocs): wait here till nprocs processes get here Built using lower level primitives Global sum example: wait for all to accumulate before using sum Often used to separate phases of computation Process P_1Process P_2Process P_nprocs set up eqn system set up eqn systemset up eqn system Barrier (name, nprocs) Barrier (name, nprocs)Barrier (name, nprocs) solve eqn system solve eqn system solve eqn system Barrier (name, nprocs) Barrier (name, nprocs)Barrier (name, nprocs) apply results apply results apply results Barrier (name, nprocs) Barrier (name, nprocs)Barrier (name, nprocs) Conservative form of preserving dependencies, but easy to use WAIT_FOR_END (nprocs-1) Convergence Test Done by all processes

17 Point-to-point Event Synchronization (Not Used Here) One process notifies another of an event so it can proceed: Needed for task ordering according to data dependence between tasks Common example: producer-consumer (bounded buffer) Concurrent programming on uniprocessor: semaphores Shared address space parallel programs: semaphores, or use ordinary variables as flags Busy-waiting (i.e. spinning) Or block process (better for uniprocessors?) Initially flag = 0 P2P2 P1P1 On A Or compute using “A” as operand

18 Message Passing Grid Solver Cannot declare A to be a shared array any more Need to compose it logically from per-process private arrays – Usually allocated in accordance with the assignment of work – Process assigned a set of rows allocates them locally Transfers (communication) of entire rows between tasks needed Structurally similar to SAS (e.g. SPMD), but orchestration is different – Data structures and data access/naming – Communication – Synchronization Send/receive pairs No shared address space } Explicit Implicit Ghost rows

19 Parallel Computation = O(n 2 /p) Communication of rows = O(n) Communication of local DIFF = O(p) Computation = O(n 2 /p) Communication = O( n + p) Communication-to-Computation Ratio = O( (n+p)/(n 2 /p) ) = O( (np + p 2 ) / n 2 ) Message Passing Grid Solver Pid = nprocs -1 pid = 0 n/p rows per task n Send Row Receive Row Send Row Receive Row Send Row Receive Row Send Row Receive Row Send Row Receive Row Send Row Receive Row pid 1 Ghost Rows for Task pid 1 Time per iteration: T = T(computation) + T(communication) T = O(n 2 /p + n + p) nprocs = number of processes = number of processors = p n/p rows or n 2 /p points per process or task

20 Pseudo-code, Parallel Equation Solver for Message Passing Communication O(n) exchange ghost rows O(p) Computation O(n 2 /p) # of processors = p = nprocs Create p-1 processes Initialize local rows myA exchange ghost rows Sweep n/p rows = n 2 /p points per task T = O(n 2 /p) T = O(n 2 /p + n + p) Pid 0: calculate global difference and test for convergence send test result Send mydiff to pid 0 { Receive test result from pid 0 send/receive Message Passing

21 Notes on Message Passing Program Use of ghost rows Receive does not transfer data, send does (sender-initiated) – Unlike SAS which is usually receiver-initiated (load fetches data) Communication done at beginning of iteration ( exchange of ghost rows). Communication in whole rows, not one element at a time Core similar, but indices/bounds in local rather than global space Synchronization through sends and receives (implicit) – Update of global difference and event synch for done condition – Could implement locks and barriers with messages Only one process (pid = 0) checks convergence (done condition) Can use REDUCE and BROADCAST library calls to simplify code Tell all tasks if done

22 Message-Passing Modes: Send and Receive Alternatives Affect event synch (mutual exclusion implied: only one process touches data) Affect ease of programming and performance Synchronous messages provide built-in synch. through match Separate event synchronization needed with asynch. messages With synchronous messages, our code is deadlocked. Fix? Can extend functionality: stride, scatter-gather, groups Semantic flavors: based on when control is returned Affect when data structures or buffers can be reused at either end Send/Receive SynchronousAsynchronous BlockingNon-blocking Send waits until message is acutely received Receive: Wait until message is received Send: Wait until message is sent Return immediately (both) Easy to create Deadlock Use asynchronous sends/receives

23 Synchronous Message Passing: Process X executing a synchronous send to process Y has to wait until process Y has executed a synchronous receive from X. Asynchronous Message Passing: Blocking Send/Receive: A blocking send is executed when a process reaches it without waiting for a corresponding receive. Returns when the message is sent. A blocking receive is executed when a process reaches it and only returns after the message has been received. Non-Blocking Send/Receive: A non-blocking send is executed when reached by the process without waiting for a corresponding receive. A non-blocking receive is executed when a process reaches it without waiting for a corresponding send. Both return immediately. Message-Passing Modes: Send and Receive Alternatives

24 Orchestration: Summary Shared address space Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data communication Message passing Data distribution among local address spaces needed No explicit shared structures (implicit in communication patterns) Communication is explicit Synchronization implicit in communication (at least in synch. case) – Mutual exclusion implied

25 Correctness in Grid Solver Program Decomposition and Assignment similar in SAS and message-passing Orchestration is different: Data structures, data access/naming, communication, synchronization Requirements for performance are another story... Ghost Rows Send/ Receive Pairs Lock/unlock Barriers