Definitions A Synchronous application is one where all processes must reach certain points before execution continues. Local synchronization is a requirement.

Slides:

Advertisements

Similar presentations

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Advertisements

Lecture 19: Parallel Algorithms

Sharks and Fishes – The problem

CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.

EECC756 - Shaaban #1 lec # 8 Spring Synchronous Iteration Iteration-based computation is a powerful method for solving numerical (and some.

Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.

Game of Life Courtesy: Dr. David Walker, Cardiff University.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Definitions A Synchronous application is one where all processes must reach certain points before execution continues. Local synchronization is a requirement.

Introduction to Parallel Processing Final Project SHARKS & FISH Presented by: Idan Hammer Elad Wallach Elad Wallach.

Numerical Algorithms Matrix multiplication

Numerical Algorithms • Matrix multiplication

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.

CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Synchronous Computations

EECC756 - Shaaban #1 lec # 9 Spring Synchronous Iteration (Synchronous Parallelism) : –Barriers: Counter Barrier Implementation. Tree Barrier.

RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.

Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Synchronization (Barriers) Parallel Processing (CS453)

Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

Synchronous Computations ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Jan 23, 2013 In a (fully) synchronous computation, all the processes.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Parallel and Distributed Simulation Time Parallel Simulation.

CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.

Introduction to Parallel Programming Language notation: message passing 5 parallel algorithms of increasing complexity: –Matrix multiplication –Successive.

CSCI-455/552 Introduction to High Performance Computing Lecture 15.

1 1 2 What is a Cellular Automaton? A one-dimensional cellular automaton (CA) consists of two things: a row of "cells" and a set of "rules". Each of.

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.

Numerical Algorithms Chapter 11.

Synchronous Computations

Chapter 3 of Programming Languages by Ravi Sethi

Courtesy: Dr. David Walker, Cardiff University

Synchronous Computations

Problem Solving Strategies

Pipelined Computations

Parallel Sorting Algorithms

PRAM Model for Parallel Computation

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Introduction to Parallel Programming

Lecture 22: Parallel Algorithms

Stencil Pattern A stencil describes a 2- or 3- dimensional layout of processes, with each process able to communicate with its neighbors. Appears in simulating.

Synchronous Computations

Unit-2 Divide and Conquer

Decomposition Data Decomposition Functional Decomposition

Numerical Algorithms • Parallelizing matrix multiplication

Pipelined Computations

Parallelization of An Example Program

Parallel Sorting Algorithms

Notes on Assignment 3 OpenMP Stencil Pattern

Using compiler-directed approach to create MPI code automatically

Load Balancing Definition: A load is balanced if no processes are idle

Patterns Paraguin Compiler Version 2.1.

Jacobi Project Salvatore Orlando.

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Oct 14, 2014 slides6b.ppt 1.

Introduction to High Performance Computing Lecture 16

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson StencilPattern.ppt Oct 14,

Data Parallel Pattern 6c.1

Synchronizing Computations

Data Parallel Computations and Pattern

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

Data Parallel Computations and Pattern

Presentation transcript:

Definitions A Synchronous application is one where all processes must reach certain points before execution continues. Local synchronization is a requirement that a subset of processes (usually neighbors) reach a synchronous point before execution continues. A barrier is the basic message passing mechanism for synchronizing processes. Deadlock occurs when groups of processors are permanently waiting for messages that cannot be satisfied because the sending processes are also permanently waiting for messages.

Barrier Illustration C: MPI_Barrier(MPI_COMM_WORLD); P0 Barrier Waiting Executing Processor code will reach barrier points at different times. This leads to idle time and load imbalance.

Counter (linear) Barrier: Implementation Barriers consist of two phases: Entry phase and departure phases Master Processor O(P) steps For (i=0; i<P; i++) // Entry Phase Receive null message from any processor For (i=0; i<P; i++) // Release Phase Send null message to release slaves Slave Processors Send null message to enter barrier Receive null message for barrier release Note: This logic avoids processors arriving before prior release

Tree (non-linear) Barrier The release phase uses the inverse tree construction, entry and departure each require O(lg P) steps P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 Release Phase Entry Phase P7 Note: Implementation logic is similar to divide and conquer

Butterfly Barrier Stage 1: P0p1; p2p3; p4p5; p6p7 Advantages: requires only single parallel single send() and receive() pairs at each stage. Completes in only O(lg P) steps Note: At stage s, processor p synchronizes with (p + 2s-1)mod P

Local Synchronization Synchronize with neighbors before proceeding Even Processors Send null message to processor i-1 Receive null message from processor i-1 Send null message to processor i+1 Receive null message from processor i+1 Odd Numbered Processors Notes: Local Synchronization is an incomplete barrier: processors exit after receiving messages from their neighbors Reminder: Deadlock can occur with incorrrect message passing orders. MPI_Sendrecv() and MPI_Sendrecv_replace() are deadlock free

Local Synchronization Example Heat Distribution Problem Goal Determine final temperature at each n x n grid point Initial boundary condition Know initial temperatures at the designated points (ex: outer rim or internal heat sink) Cannot proceed to next iteration until local synchronization completes DO Average each grid point with its neighbors UNTIL temperature changes are small enough New Value = (∑neighbors)/4

Sequential Heat Distribution Code Initialize rows 0,n and columns 0,n of g and h Iteration = 0 DO FOR (i=1; i<n; i++) FOR (j=1; j<n; j++) IF (iteration %2) hi,j = (gi-1,j+gi+1,j+gi,j-1+gi,j+1)/4 ELSE gi,j = (hi-1,j+hi+1,j+hi,j-1+hi,j+1)/4 iteration++ UNTIL max (|gi – hi|)<tolerance or iteration>MAX Notes Even iterations update gij array; Odd iterations iterate gij array Recall: Odd/Even sort

Block or Strip Partitioning Assign portions of the grid to processors in the topology Block Partitioning (allocate in squares) Eight messages exchanged at each iteration Data exchanged per message is n/sqrt(P) Strip Partitioning Four messages exchanged at each iteration Data exchanged per message is n/P Question: Which is better? p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 Blocks p0 p1 p2 p3 p4 p5 p6 p7 Column Strips

Strip versus Block Partitioning Characteristics Strip partitioning – generally more data, less messages Block partitioning – generally less data, more messages Choice: Low latency favors block; High latency favors strip Example: Grid is 64 x 64, p = 16 Strip Partitioning – Strips are 4x64; 4 x 64 cells transferred per iteration per processor Block Partitioning – Blocks are 16 x 16; 8 x 16 cells transferred per iteration per processor Example: Grid is 64 x 64, p = 4 Strip Partitioning – Strips are 8x64, 4 x 64 cells transferred per iteration per processor Block Partitioning – Blocks are 32 x 32, 8 x 32 cells transferred per iteration per processor

Parallel Implementation Pi Cells to east Cells to south Cells to west Cells to north Modifications to the sequential algorithm Declare “ghost” rows to hold adjacent data (declare array of 10 x 10 for an 8 x 8 block) Exchange data with neighbor processors Perform the calculation for the local grid cells

Heat Distribution Partitioning Main logic For each iteration For each point compute new temperature SendRcv(row-1,col,point) SendRcv(row+1,col,point) SendRcv(row,col-1,point) SendRcv(row,col+1,point) SendRcv(row,col) if row,col is not local if myrank even Send(point,prow,col) Recv(point,prow,col) Else

Full Synchronization Data Parallel Computations Sequential Code Simultaneously apply the same operation to different data This approach models many numerical computations They are easy to program and scale well to large data sets Sequential Code for (i=0; i<n; i++) a[i] = someFunction(a[i]) Shared Memory Code Forall (i=0; i<n; i++) {bodyOfInstructions} Note: the for loop semantics imply a natural barrier Distributed processing For local a[i]; {someFunction(a[i])} barrier();

Data Parallel Example A[] += k A[0] += k A[1] += k A[n-1] += k p0 p1 pn All processors execute instructions in “lock step” forall (i=0; i<n; i++) a[i] += k Note: Multi-computers partition data into course grain blocks

Prefix-Based Operations Definition: Given a set of n values a1, a2,…, an and an associative operation, the operation is applied to all predecessor values Prefix Sum: {2, 7, 9, 4}  {2, 9, 18, 22} Application: Radix Sort Solution by Doubling: An algorithm where operations calculate in increasing powers of 2 Example: 1, 2, 4, 8, etc., (each iteration doubles)

Prefix Sum by Doubling Overview 1. Add each data[i] is added to data[i+1] 2. Add each data[i] is added to data[i+2] 3. Add each data[i] is added to data[i+4] 4. Add each data[i] is added to data[i+8] ETC….. Note: Skip the operation if i+increment > array length

Prefix Sum Illustration

Prefix Sum Example Sequential Time: O(n), Parallel Time: O(N/P lg N/P ) Note: * means the sum is not added at the next step

Prefix Sum Parallel Implementation Sequential code for (j=0;j<lg(n);j++) for (i=0; i<n – 2j; i++) a[i] += a[i+2j]; Parallel shared memory fine grain logic for (j=0; j<lg(n); j++) forall (i=0; i<n–2j; i++) a[i+2j] +=a[i]; Parallel distributed course grain logic for (j=1; j<= log(n); j++) if (myrank>=2j-1 receive(sum, myrank – 2j-1) add sum to processor's data else send(processor's data, myrank + 2j-1)

Synchronous Iteration Processes synchronize at each iteration step Example: Simulation of Natural Processes Shared memory code for (j=0; j<n; j++) forall (i=0; i<N; i++) algorithm(i); Distributed memory code for (j=0; j<n; j++) algorithm(myRank); barrier();

Example: n equations, n unknowns an-1,0x0 + an,1x1 … + an,n-1xn-1 = bn-1 ∙∙∙ ak,0x0 + ak,1x1 … + ak,n-1xn-1 = bk ∙∙∙ a1,0x0 + a1,1x1 … + a1,n-1xn-1 = b1 a0,0x0 + a0,1x1 … + a0,n-1xn-1 = b0 Or we can rewrite the equations as follows: xk=(bk–ak,0x0-…-ak,j-1xj-1-ak,j+1xj+1-…-ak,n-1xn-1)/ak,k = (bk - ∑j≠kai,j xj)/ai,i

Jacobi Iteration Pseudo Code Numerical Algorithm to solve N equations with N unknowns Pseudo Code xnewi = initial guess DO xi = xnewi xnewi = Calculated next guess UNTIL ∑i|xnewi – xi|<tolerance Jacobi iteration always converges if: ak,k > ∑i≠k ai,0 (The diagonal value dominates the column sum) xi Error Iteration i i+1 Traditional solutions are O(N3), or O(N2) for special cases

Parallel Jacobi Code xnew0 xnew1 xnewn-1 xi Allgather() xnewi into xi xnewi = bi DO for each i xi = xnewi sum = -ai,i * xi FOR (j=0; j<n; j++) sum += ai,i * xj xnewi = (bi – sum)/ai,i allgather(xnewi) barrier() Until iterations>MAX or ∑i|xnewi – xi|<tolerance xnew0 xnew1 xnewn-1 xi Allgather() xnewi into xi

Additional Jacobi Notes If P (processor count) < n, allocate blocks of variables to processors Block Allocation: Allocate consecutive xi to processors Cyclic Allocation Allocate x0, xP, … to p0 Allocate x1, xp+1, … to p1 … etc. Question: Which allocation scheme is better? Time Computation Communication Answer to question: It depends on the cache and whether the arrays are row-order or column-order 4 8 12 16 20 24 Processors Jacobi Performance

Cellular Automata Definition The System has a finite grid of cells Each cell can assume a finite number of states Cells change state according to a well-defined rule set All cell changes of state occur simultaneously The system iterates through a number of generations Note: Animations of these systems can lead to interesting insights Serious Applications Fluid and gas dynamics Biological growth Airplane wing airflow Erosion modeling Groundwater pollution Fun Applications Game of Life Sharks and Fishes Foxes and Rabbits Gaming applications

Conway’s Game of Life The grid (world) is a two dimension array of cells Note: The grid ends can optionally wrap around (like a torus) Each cell Can hold one “organism” There are eight neighbor cells: North, Northeast, East, Southeast, South, Southwest, West, Northwest Rules (run the simulation over many generations) Organism dies (loneliness) if <2 organisms live in neighbor cells Organism survives if 2 organisms live in adjacent cells An empty cell with 3 living neighbors gives birth to organisms in every empty adjacent cell Organism dies (overpopulation) >= 4 organisms live in neighbor cells

Sharks and Fishes The grid (ocean) is modeled by a three dimension array Note: The grid ends can optionally wrap around (like a torus) Each cell Can hold either a fish or a shark, but not both There are twenty six adjacent cubic cells Rules for fish Fish move randomly to empty adjacent cells If there are no empty adjacent cells, fish stay put Fish of breeding age leave a baby fish in the vacating cell Fish die after some fixed (or random) number of generations Rules for sharks Sharks randomly move to adjacent cells that don't contain sharks If they enter a cell containing a fish, they eat the fish Sharks stay put when all adjacent cells contain sharks Sharks of breeding age leave a baby shark in a vacating cell Sharks die (starvation) if they don’t eat a fish for some fixed (or random) number of generations