Lecture 2: Parallel Programs

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

1 Parallelization of An Example Program Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program.
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 19: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Lecture 18: Multiprocessors
1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Parallel Programs. 2 Why Bother with Programs? They’re what runs on the machines we design Helps make design decisions Helps evaluate systems tradeoffs.
1 Lecture 24: Multiprocessors Today’s topics:  Directory-based cache coherence protocol  Synchronization  Consistency  Writing parallel programs Reminder:
Lecture 13: Consistency Models
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
1 Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections )
1 Lecture 1: Introduction Course organization:  13 lectures on parallel architectures  ~5 lectures on cache coherence, consistency  ~3 lectures on TM.
PRASHANTHI NARAYAN NETTEM.
1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 Programming for Performance Reference: Chapter 3, Parallel Computer Architecture, Culler et.al.
1 Lecture 2: Parallel Programs Topics: parallel applications, parallelization process, consistency models.
SE-292 High Performance Computing Intro. to Parallelization Sathish Vadhiyar.
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
Steps in Creating a Parallel Program
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
Concurrency and Performance Based on slides by Henri Casanova.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
18-447: Computer Architecture Lecture 30B: Multiprocessors
Lecture 11: Consistency Models
Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,
Steps in Creating a Parallel Program
Perspective on Parallel Programming
Steps in Creating a Parallel Program
Lecture 26: Multiprocessors
Parallel Application Case Studies
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Steps in Creating a Parallel Program
Lecture 26: Multiprocessors
Steps in Creating a Parallel Program
Steps in Creating a Parallel Program
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture: Coherence Protocols
Steps in Creating a Parallel Program
Parallelization of An Example Program
Lecture: Coherence Protocols
Lecture 22: Consistency Models, TM
Parallelismo.
Steps in Creating a Parallel Program
Lecture 27: Multiprocessors
Lecture 17: Multi-threaded Applications
Lecture 26: Multiprocessors
Mattan Erez The University of Texas at Austin
Lecture 24: Multiprocessors
Lecture 3: Coherence Protocols
Steps in Creating a Parallel Program
Lecture 19: Coherence Protocols
Lecture 18: Cache Coherence
Parallel Programming in C with MPI and OpenMP
Lecture: Coherence and Synchronization
Lecture: Consistency Models, TM
Steps in Creating a Parallel Program
Multidisciplinary Optimization
Presentation transcript:

Lecture 2: Parallel Programs Topics: consistency, parallel applications, parallelization process

Sequential Consistency A multiprocessor is sequentially consistent if the result of the execution is achievable by maintaining program order within a processor and interleaving accesses by different processors in an arbitrary fashion For example, the code below should ensure mutual exclusion on a sequentially consistent machine Initially A = B = 0 P1 P2 A  1 B  1 … … if (B == 0) if (A == 0) Crit.Section Crit.Section

Relaxing Memory Ordering Initially A = B = 0 P1 P2 A  1 B  1 … … if (B == 0) if (A == 0) Crit.Section Crit.Section Executing memory accesses in order is extremely slow; we attempt optimizations  seq consistency is lost For example, each processor can be out-of-order; within P1, the write to A and the read of B are independent since they refer to different memory locations Ooo execution will allow each process to enter CS

Relaxed Consistency Models In order to write correct programs, the programmer must understand that memory accesses do not always happen in order The consistency model specifies how memory ordering differs from that of sequential consistency If the programmer demands sequential consistency in places, he/she can impose it with special fence instructions – a fence ensures that we make progress only after completing earlier memory accesses Fences are slow – a better understanding of the program and the consistency model can eliminate some fences

Parallel Application Examples Simulating ocean currents Simulating evolution of galaxies Visualizing complex scenes using raytracing Mining data for associations

Ocean Simulates motion of water currents, influenced by wind, friction, etc. We examine a horizontal cross-section of the ocean at a time and the cross-section is modeled as a grid of equidistant points At each time step, the value of each variable at each grid point is updated based on neighboring values and equations of motion Apparently high concurrency

Barnes-Hut Problem studies how galaxies evolve by simulating mutual gravitational effects of n bodies A naïve algorithm computes pairwise interactions in every time step (O(n2)) – a hierarchical algorithm can achieve good accuracy and run in O(n log n) by grouping distant stars Apparently high concurrency, but varying star density can lead to load balancing issues

Data Mining Data mining attempts to identify patterns in transactions For example, association mining computes sets of commonly purchased items and the conditional probability that a customer purchases a set, given they purchased another set of products Itemsets are iteratively computed – an itemset of size k is constructed by examining itemsets of size k-1 Database queries are simpler and computationally less expensive, but also represent an important benchmark for multiprocessors

Parallelization Process Ideally, a parallel application must be constructed by designing a parallel algorithm from scratch In most cases, we begin with a sequential version – the quest for efficient automated parallelization continues… Converting a sequential program involves: Decomposition of the computation into tasks Assignment of tasks to processes Orchestration of data access, communication, and synchronization Mapping or binding processes to processors

Partitioning Decomposition and Assignment are together called partitioning – partitioning is algorithmic, while orchestration is a function of the programming model and architecture The number of tasks at any given time is the level of concurrency in the application – the average level of concurrency places a bound on speedup (Amdahl’s Law) To reduce inter-process communication or load imbalance, many tasks may be assigned to a single process – this assignment may be either static or dynamic We will assume that processes do not migrate (fixed mapping) in order to preserve locality

Parallelization Goals Step Architecture-Dependent? Major Performance Goals Decomposition Mostly no Expose enough concurrency Assignment Balance workload Reduce communication volume Orchestration Yes Reduce communication via data locality Reduce communication and synch cost Reduce serialization at shared resources Schedule tasks to satisfy dependences early Mapping Put related processes on same processor Exploit locality in network topology

Case Study: Ocean Kernel Gauss-Seidel method: sweep through the entire 2D array and update each point with the average of its value and its neighboring values; repeat until the values converge Since we sweep from top to bottom and left to right, the averaging step uses new values for the top and left neighbors, and old values for the bottom and right neighbors

Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure

Concurrency Need synch after every anti-diagonal Potential load imbalance

Algorithmic Modifications Red-Black ordering: the grid is colored red and black similar to a checkerboard; sweep through all red points, then sweep through all black points; there are no dependences within a sweep Asynchronous updates: ignore dependences within a sweep  you may or may not get the most recent value Either of these algorithms expose sufficient concurrency, but you may or may not converge quickly

Assignment With the asynchronous method, each process can be assigned a subset of all rows What is the degree of concurrency? What is the communication to computation ratio

Orchestration Orchestration is a function of the programming model and architecture Consider the shared address space model – by using the following primitives, the program appears very similar to the sequential version: CREATE: creates p processes that start executing at procedure proc LOCK and UNLOCK: acquire and release mutually exclusive access BARRIER: global synchronization: no process gets past the barrier until n processes have arrived WAIT_FOR_END: wait for n processes to terminate

Shared Address Space Model procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main

Title Bullet