Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita
Chapter 3 Principles of Parallel Programming
Data Dependency Processors Communication Mapping Granularity
Data Dependency data flow dependency data anti-dependency data output dependency data input dependency data control dependency
data flow dependency & data anti-dependency - data flow dependency S1: A = B + C S2: D = A + E - data anti-dependency S1: A = B + C S2: B = D + E
data output dependency & data input dependency - data output dependency S1: A = B + C S2: A = D + E - data input dependency S1: A = B + C S2: D = B + E
data control dependency S1: A = B - C if ( A > 0 ) S2: then D = 1; S3: else D = 2; endif S1 ⇒ S2,S3
Dependency Graph G = ( V, E ) V(Vertices) : statements E(Edges) : dependency ex) S1: A = B + C S1 → S2 : flow dep., anti-dep. S2: B = A * 3 S1 → S3 : output dep., input dep. S3: A = 2 * C S2 → S3 : anti-dep. S4: P = B ≧ 0 S2 → S4 : flow dep. if ( P is True) S4 → S5, S6 : control dep. S5: then D = 1 S6: else D = 2 endif
Elimination of the dependencies data output dependency data anti-dependency ⇒ renaming can remove these form of dependency
Elimination of the dependencies(2) Ex) S1: A = B + C S2: B = D + E S3: A = F + G S1,S2 : anti-dep. S1,S3 : output-dep. S1: A = B + C S2: B’ = D + E S3: A’ = F + G No dependency! Renaming
Processors communication Message passing communication - processors communicate via communication links Shared memory communication - processors communicate via common memory
Message Passing System Interconnection Network PE1 M1 PE2 M2 PE3 M3 PEm Mm ・・ ・
Message Passing System(2) Send and Receive operations - blocking - nonblocking System - synchronous blocking send and blocking receive operations - asynchronous nonblocking send and blocking receive operations ( the messages are buffered )
Shared memory system Interconnection Network PE1PE2PE3PEm ・・ ・ M1M2M3Mm ・・ ・ Global Memory
Mapping … matching parallel algorithms to parallel architecture mapping to - asynchronous architecture - synchronous architecture - distributed architecture
Mapping to Asynchronous Architecture Mapping a program to an asynchronous shared- memory computer has the following steps: 1.Allocate each statement in the program to a processor 2.Allocate each variable to a memory 3.Specify the control flow for each processor the sender may send many messages without the receiver removing them from the channel ⇒ the need for buffering messages
Mapping to Synchronous Architecture Mapping a program is the same as Asynchronous Architecture a common clock for synchronization purposes each processor executes an instruction at each step ( at each clock tick) only one message exists at any one time on the channel ⇒ no buffering is needed
Mapping to Distributed Architecture A local memory accessible only by owning processor only a pair of processor along the channel Mapping a program is the same as in the asynchronous shared-memory architecture, except that each variable is allocated either to a processor or a channel
Granularity relates to the ratio of the amount of computation to the amount of communication fine : at statement level medium : at procedure level coarse : at program level
Program Level Parallelism a program creates a new process by creating a complete copy of itself - Fork() ( UNIX)
Statement Level Parallelism ・ Parbegin-Parend block Par-begin Statement 1 Statement 2 : Statement n Par-end ⇒ the statements Statement 1 ~ Statement n are executed in parallel
Statement Level Parallelism(2) ex) (a + b) * (c + d) – (e / f ) Par-begin t1 = a + b t2 = c + d Par-end t4 = t1 * t2 t3 = e / f Par-end t5 = t4 – t3
Statement Level Parallelism(3) ・ Fork, Join, Quit - Fork x cause a new process to be created and to start executing at the instruction labeled x - Join t, y t = t – 1 if ( t = 0) then go to y - Quit the process terminates
Statement Level Parallelism(4) ex) (a + b) * (c + d) – (e / f ) n = 2 m = 2 Fork P2 Fork P3 P1: t1 = a + b; Join m, P4; Quit; P2: t2 = c + d; Join m, P4; Quit; P4: t4 = t1*t2; Join n, P5; Quit; P3: t3 = e / f; Join n, P5; Quit; P5: t5 = t4 – t3