1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models.

Slides:



Advertisements
Similar presentations
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Advertisements

CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) SE 292: High Performance.
Perspective on Parallel Programming Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
1 Parallelization of An Example Program Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program.
Grid solver example Gauss-Seidel (near-neighbor) sweeps to convergence  interior n-by-n points of (n+2)-by-(n+2) updated in each sweep  updates done.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
Reference: Message Passing Fundamentals.
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Lecture 18: Multiprocessors
1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Parallel Programs. 2 Why Bother with Programs? They’re what runs on the machines we design Helps make design decisions Helps evaluate systems tradeoffs.
1 Lecture 24: Multiprocessors Today’s topics:  Directory-based cache coherence protocol  Synchronization  Consistency  Writing parallel programs Reminder:
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM.
1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.
ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols.
Distributed Shared Memory Systems and Programming
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
SE-292 High Performance Computing Intro. to Parallelization Sathish Vadhiyar.
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Steps in Creating a Parallel Program
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
Concurrency and Performance Based on slides by Henri Casanova.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,
Steps in Creating a Parallel Program
Perspective on Parallel Programming
Steps in Creating a Parallel Program
Lecture 26: Multiprocessors
Steps in Creating a Parallel Program
Lecture 26: Multiprocessors
Steps in Creating a Parallel Program
Lecture 2: Parallel Programs
Steps in Creating a Parallel Program
Lecture: Coherence Protocols
Steps in Creating a Parallel Program
Lecture 27: Pot-Pourri Today’s topics:
Parallelization of An Example Program
Lecture: Coherence Protocols
Steps in Creating a Parallel Program
Lecture 27: Multiprocessors
Lecture 17: Multi-threaded Applications
Lecture 26: Multiprocessors
Mattan Erez The University of Texas at Austin
Lecture 3: Coherence Protocols
Steps in Creating a Parallel Program
Lecture 19: Coherence Protocols
Lecture 18: Cache Coherence
Parallel Programming in C with MPI and OpenMP
Lecture 18: Coherence and Synchronization
Steps in Creating a Parallel Program
Presentation transcript:

1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models

2 Parallel Application Examples Simulating ocean currents Simulating evolution of galaxies Visualizing complex scenes using raytracing Mining data for associations

3 Ocean Simulates motion of water currents, influenced by wind, friction, etc. We examine a horizontal cross-section of the ocean at a time and the cross-section is modeled as a grid of equidistant points At each time step, the value of each variable at each grid point is updated based on neighboring values and equations of motion Apparently high concurrency

4 Barnes-Hut Problem studies how galaxies evolve by simulating mutual gravitational effects of n bodies A naïve algorithm computes pairwise interactions in every time step (O(n 2 )) – a hierarchical algorithm can achieve good accuracy and run in O(n log n) by grouping distant stars Apparently high concurrency, but varying star density can lead to load balancing issues

5 Data Mining Data mining attempts to identify patterns in transactions For example, association mining computes sets of commonly purchased items and the conditional probability that a customer purchases a set, given they purchased another set of products Itemsets are iteratively computed – an itemset of size k is constructed by examining itemsets of size k-1 Database queries are simpler and computationally less expensive, but also represent an important benchmark for multiprocessors

6 Parallelization Process Ideally, a parallel application must be constructed by designing a parallel algorithm from scratch In most cases, we begin with a sequential version – the quest for efficient automated parallelization continues… Converting a sequential program involves:  Decomposition of the computation into tasks  Assignment of tasks to processes  Orchestration of data access, communication, and synchronization  Mapping or binding processes to processors

7 Partitioning Decomposition and Assignment are together called partitioning – partitioning is algorithmic, while orchestration is a function of the programming model and architecture The number of tasks at any given time is the level of concurrency in the application – the average level of concurrency places a bound on speedup (Amdahl’s Law) To reduce inter-process communication or load imbalance, many tasks may be assigned to a single process – this assignment may be either static or dynamic We will assume that processes do not migrate (fixed mapping) in order to preserve locality

8 Parallelization Goals StepArchitecture- Dependent? Major Performance Goals DecompositionMostly noExpose enough concurrency AssignmentMostly noBalance workload Reduce communication volume OrchestrationYesReduce communication via data locality Reduce communication and synch cost Reduce serialization at shared resources Schedule tasks to satisfy dependences early MappingYesPut related processes on same processor Exploit locality in network topology

9 Case Study: Ocean Kernel Gauss-Seidel method: sweep through the entire 2D array and update each point with the average of its value and its neighboring values; repeat until the values converge Since we sweep from top to bottom and left to right, the averaging step uses new values for the top and left neighbors, and old values for the bottom and right neighbors

10 Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure

11 Concurrency Need synch after every anti-diagonal Potential load imbalance

12 Algorithmic Modifications Red-Black ordering: the grid is colored red and black similar to a checkerboard; sweep through all red points, then sweep through all black points; there are no dependences within a sweep Asynchronous updates: ignore dependences within a sweep  you may or may not get the most recent value Either of these algorithms expose sufficient concurrency, but you may or may not converge quickly

13 Assignment With the asynchronous method, each process can be assigned a subset of all rows What is the degree of concurrency? What is the communication to computation ratio

14 Orchestration Orchestration is a function of the programming model and architecture Consider the shared address space model – by using the following primitives, the program appears very similar to the sequential version:  CREATE: creates p processes that start executing at procedure proc  LOCK and UNLOCK: acquire and release mutually exclusive access  BARRIER: global synchronization: no process gets past the barrier until n processes have arrived  WAIT_FOR_END: wait for n processes to terminate

15 Shared Address Space Model int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; BARRIER (bar1, nprocs); endwhile

16 Message Passing Model main() read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); if (pid != 0) RECEIVE(&myA[0,0], n, pid-1, ROW); if (pid != nprocs-1) RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; endfor if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, I, DONE); endfor endif endwhile

17 Message Passing Model Note that each process has two additional rows to store data produced by its neighbors Unlike the shared memory model, execution is deterministic -- two executions will produce the same result A send-receive match is a synchronization event – hence, we no longer need locks while updating the diff counter, or barriers while allowing processes to proceed with the next iteration

18 Models for SEND and RECEIVE Synchronous: SEND returns control back to the program only when the RECEIVE has completed – will it work for the program on the previous slide? Blocking Asynchronous: SEND returns control back to the program after the OS has copied the message into its space -- the program can now modify the sent data structure Nonblocking Asynchronous: SEND and RECEIVE return control immediately – the message will get copied at some point, so the process must overlap some other computation with the communication – other primitives are used to probe if the communication has finished or not

19 Title Bullet