Definitions A Synchronous application is one where all processes must reach certain points before execution continues. Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues. A barrier is the basic message passing mechanism for synchronizing processes. Deadlock occurs when groups of processors are permanently waiting for messages that cannot be satisfied because the sending processes are also permanently waiting for messages.
Barrier Illustration P0 Barrier Waiting Executing C: MPI_Barrier(MPI_COMM_WORLD); Processor code will reach barrier points at different times. This leads to idle time and load imbalance.
Counter (linear) Barrier: Implementation Master Processor O(P) steps For (i=0; i<P; i++) // Entry Phase Receive null message from any processor For (i=0; i<P; i++)// Release Phase Send null message to release slaves Slave Processors Send null message to enter barrier Receive null message for barrier release Note: This logic avoids processors arriving before prior release Barriers consist of two phases: Entry phase and departure phases
Tree (non-linear) Barrier P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 Entry Phase Release Phase Note: Implementation logic is similar to divide and conquer The release phase uses the inverse tree construction, entry and departure each require O(lg P) steps
Butterfly Barrier –Stage 1: P 0 p 1 ; p 2 p 3 ; p 4 p 5 ; p 6 p 7 –Stage 2: p 0 p 2 ; p 1 p 3 ; p 4 p 6 ; p 5 p 7 –Stage 3: p 0 p 4 ; p 1 p 5 ; p 2 p 6 ; p 3 p 7 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 Advantages: requires only single parallel single send() and receive() pairs at each stage. Completes in only O(lg P) steps Note: At stage s, processor p synchronizes with (p + 2 s-1 )mod P
Local Synchronization Even Processors Send null message to processor i-1 Receive null message from processor i-1 Send null message to processor i+1 Receive null message from processor i+1 Odd Numbered Processors Receive null message from processor i+1 Send null message to processor i+1 Receive null message from processor i-1 Send null message to processor i-1 Notes: –Local Synchronization is an incomplete barrier: processors exit after receiving messages from their neighbors –Reminder: Deadlock can occur with incorrrect message passing orders. MPI_Sendrecv() and MPI_Sendrecv_replace() are deadlock free Synchronize with neighbors before proceeding
Local Synchronization Example Heat Distribution Problem –Goal Determine final temperature at each n x n grid point –Initial boundary condition Know initial temperatures at the designated points (ex: outer rim or internal heat sink) –Cannot proceed to next iteration until local synchronization completes DO Average each grid point with its neighbors UNTIL temperature changes are small enough New Value = (∑neighbors)/4
Sequential Heat Distribution Code Initialize rows 0,n and columns 0,n of g and h Iteration = 0 DO FOR (i=1; i<n; i++) FOR (j=1; j<n; j++) IF (iteration %2) h i,j = (g i-1,j +g i+1,j +g i,j-1 +g i,j+1 )/4 ELSEg i,j = (h i-1,j +h i+1,j +h i,j-1 +h i,j+1 )/4 iteration++ UNTIL max (|g i – h i |) MAX Notes Even iterations update g ij array; Odd iterations iterate g ij array Recall: Odd/Even sort
Block or Strip Partitioning p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p9p9 p 10 p 11 p 12 p 13 p 14 p 15 p0p0 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 Blocks Column Strips Block Partitioning (allocate in squares) –Eight messages exchanged at each iteration –Data exchanged per message is n/sqrt(P) Strip Partitioning –Four messages exchanged at each iteration –Data exchanged per message is n/P Question : Which is better? Assign portions of the grid to processors in the topology
Strip versus Block Partitioning Characteristics –Strip partitioning – generally more data, less messages –Block partitioning – generally less data, more messages –Choice: Low latency favors block; High latency favors strip Example: Grid is 64 x 64, p = 16 –Strip Partitioning – Strips are 4x64; 4 x 64 cells transferred per iteration per processor –Block Partitioning – Blocks are 16 x 16; 8 x 16 cells transferred per iteration per processor Example: Grid is 64 x 64, p = 4 –Strip Partitioning – Strips are 8x64, 4 x 64 cells transferred per iteration per processor –Block Partitioning – Blocks are 32 x 32, 8 x 32 cells transferred per iteration per processor
Parallel Implementation Modifications to the sequential algorithm Declare “ghost” rows to hold adjacent data (declare array of 10 x 10 for an 8 x 8 block) Exchange data with neighbor processors Perform the calculation for the local grid cells PiPi Cells to east Cells to south Cells to west Cells to north
Heat Distribution Partitioning SendRcv(row,col) if row,col is not local if myrank even Send(point,p row,col ) Recv(point,p row,col ) Else Recv(point,p row,col ) Send(point,p row,col ) Main logic For each iteration For each point compute new temperature SendRcv(row-1,col,point) SendRcv(row+1,col,point) SendRcv(row,col-1,point) SendRcv(row,col+1,point)
Full Synchronization Data Parallel Computations –Simultaneously apply the same operation to different data –This approach models many numerical computations –They are easy to program and scale well to large data sets Sequential Code for (i=0; i<n; i++) a[i] = someFunction(a[i]) Shared Memory Code Forall (i=0; i<n; i++) {bodyOfInstructions} –Note: the for loop semantics imply a natural barrier Distributed processing For local a[i]; {someFunction(a[i])} barrier();
Data Parallel Example A[0] += kA[1] += kA[n-1] += k A[] += k All processors execute instructions in “lock step” forall (i=0; i<n; i++) a[i] += k Note: Multi-computers partition data into course grain blocks p0p0 p1p1 pnpn
Prefix-Based Operations Definition: Given a set of n values a1, a2,…, an and an associative operation, the operation is applied to all predecessor values Prefix Sum: {2, 7, 9, 4} {2, 9, 18, 22} Application: Radix Sort Solution by Doubling: An algorithm where operations calculate in increasing powers of 2 Example: 1, 2, 4, 8, etc., (each iteration doubles)
Prefix Sum by Doubling Overview –1. Add each data[i] is added to data[i+1] –2. Add each data[i] is added to data[i+2] –3. Add each data[i] is added to data[i+4] –4. Add each data[i] is added to data[i+8] –ETC….. Note: Skip the operation if i+increment > array length
Prefix Sum Illustration
Prefix Sum Example Sequential Time: O(n), Parallel Time: O(N/P lg N/P ) Note: * means the sum is not added at the next step
Prefix Sum Parallel Implementation Sequential code for (j=0;j<lg(n);j++) for (i=0; i<n – 2 j ; i++) a[i] += a[i+2 j ]; Parallel shared memory fine grain logic for (j=0; j<lg(n); j++) forall (i=0; i<n–2 j ; i++) a[i+2 j ] +=a[i]; Parallel distributed course grain logic for (j=1; j<= log(n); j++) if (myrank>=2 j-1 receive(sum, myrank – 2 j-1 ) add sum to processor's data else send(processor's data, myrank + 2 j-1 )
Synchronous Iteration Processes synchronize at each iteration step Example: Simulation of Natural Processes Shared memory code for (j=0; j<n; j++) forall (i=0; i<N; i++) algorithm(i); Distributed memory code for (j=0; j<n; j++) algorithm(myRank); barrier();
Example: n equations, n unknowns a n-1,0 x 0 + a n,1 x 1 … + a n,n-1 x n-1 = b k ∙∙∙ a k,0 x 0 + a k,1 x 1 … + a k,n-1 x n-1 = b k ∙∙∙ a 1,0 x 0 + a 1,1 x 1 … + a 1,n-1 x n-1 = b 1 a 0,0 x 0 + a 0,1 x 1 … + a 0,n-1 x n-1 = b 0 Or we can rewrite the equations as follows: x k =(b k –a k,0 x 0 -…-a k,j-1 x j-1 -a k,j+1 x j+1 -…-a k,n-1 x n-1 )/a k,k = (b k - ∑ j≠k a i,j x j )/a i,i
Jacobi Iteration Pseudo Code xnew i = initial guess DO x i = xnew i xnew i = Calculated next guess UNTIL ∑ i |xnew i – x i |<tolerance Jacobi iteration always converges if: a k,k > ∑ i≠k a i,0 (The diagonal value dominates the column sum) ii+1 Error Iteration xixi Numerical Algorithm to solve N equations with N unknowns Traditional solutions are O(N 3 ), or O(N 2 ) for special cases
Parallel Jacobi Code xnew i = b i DO for each i x i = xnew i sum = -a i,i * x i FOR (j=0; j<n; j++) sum += a i,i * x j xnew i = (b i – sum)/a i,i allgather(xnew i ) barrier() Until iterations>MAX or ∑ i |xnew i – x i |<tolerance xnew 0 xnew 1 xnew n-1 xixi Allgather() xnew i into x i
Additional Jacobi Notes If P (processor count) < n, allocate blocks of variables to processors Block Allocation: Allocate consecutive x i to processors Cyclic Allocation –Allocate x 0, x P, … to p0 –Allocate x 1, x p+1, … to p1 … etc. Question: Which allocation scheme is better? Time Processors Computation Communication Jacobi Performance
Cellular Automata The System has a finite grid of cells Each cell can assume a finite number of states Cells change state according to a well-defined rule set All cell changes of state occur simultaneously The system iterates through a number of generations Serious Applications Fluid and gas dynamics Biological growth Airplane wing airflow Erosion modeling Groundwater pollution Definition Note: Animations of these systems can lead to interesting insights Fun Applications Game of Life Sharks and Fishes Foxes and Rabbits Gaming applications
Conway’s Game of Life The grid (world) is a two dimension array of cells Note: The grid ends can optionally wrap around (like a torus) Each cell –Can hold one “organism” –There are eight neighbor cells: North, Northeast, East, Southeast, South, Southwest, West, Northwest Rules (run the simulation over many generations) 1.Organism dies (loneliness) if <2 organisms live in neighbor cells 2.Organism survives if 2 organisms live in adjacent cells 3.An empty cell with 3 living neighbors gives birth to organisms in every empty adjacent cell 4.Organism dies (overpopulation) >= 4 organisms live in neighbor cells
Sharks and Fishes The grid (ocean) is modeled by a three dimension array Note: The grid ends can optionally wrap around (like a torus) Each cell –Can hold either a fish or a shark, but not both –There are twenty six adjacent cubic cells Rules for fish 1.Fish move randomly to empty adjacent cells 2.If there are no empty adjacent cells, fish stay put 3.Fish of breeding age leave a baby fish in the vacating cell 4.Fish die after some fixed (or random) number of generations Rules for sharks 1.Sharks randomly move to adjacent cells that don't contain sharks 2.If they enter a cell containing a fish, they eat the fish 3.Sharks stay put when all adjacent cells contain sharks 4.Sharks of breeding age leave a baby shark in a vacating cell 5.Sharks die (starvation) if they don’t eat a fish for some fixed (or random) number of generations