Download presentation
Presentation is loading. Please wait.
Published byMuhammad Reaves Modified over 9 years ago
1
Parallel Algorithms and Computing Selected topics Parallel Architecture
2
2 References An introduction to parallel algorithms Joseph Jaja Introduction to parallel computing Vipin Kumar, Ananth Grama, Anshul Gupta, George KArypis Parallel sorting algorithms Selim G. Akl
3
3 Models Three models: Graphs (DAG : Directed Acyclic Graph) Parallel Randon Access Machine Network
4
4 Graphs Not studied here
5
Parallel Architecture Parallel random access machine
6
6 Parallel Randon Access Machine Flynn classifies parallel machines based on: – Data flow – Instruction flow Each flow can be: – Single – Multiple
7
7 Parallel Randon Access Machine Flynn classification SINGLEMULTIPLE SINGLESISDSIMD MULTIPLEMISDMIMD Data flow Instruction flow
8
8 Parallel Randon Access Machine Extend the traditional RAM (Random Access Memory) machine Interconnection network between global memory and processors Multiple processors Mémoire Globale (Shared – Memory) P1P2Pp
9
9 Parallel Randon Access Machine Characteristics Processors Pi (i (0 i p-1 ) – each with a local memory – i is a unique identity for processor P i A global shared memory – it can be accessed by all processors
10
10 Parallel Randon Access Machine Types of operations: Synchronous – Processors work in locked step at each step, a processor is active or idle suited for SIMD and MIMD architectures Asynchronous – processors have local clocks – needs to synchronize the processors suited for MIMD architecture
11
11 Parallel Randon Access Machine Example of synchronous operation Algorithm : Processor i (i=0 … 3) Input : A, B i processor id Output : (1) C Begin If ( B==0) C = A Else C = A/B End
12
12 Parallel Randon Access Machine Step 1 A : 7 B : 0 C : 7 (Actif, B=0) A : 2 B : 1 C : 0 (Inactif, (B 0) A : 4 B : 2 C : 0 (Inactif, (B 0) A : 5 B : 0 C : 5 (Actif, (B=0) Processeur 3Processeur 2Processeur 1Processeur 0 Initial A : 7 B : 0 C : 0 A : 2 B : 1 C : 0 A : 4 B : 2 C : 0 A : 5 B : 0 C : 0 Processeur 3Processeur 2Processeur 1Processeur 0 (idle B 0) (active B = 0)
13
13 Parallel Randon Access Machine Step 2 A : 7 B : 0 C : 7 A : 2 B : 1 C : 2 A : 4 B : 2 C : 2 A : 5 B : 0 C : 5 Processeur 3Processeur 2Processeur 1Processeur 0 (active B 0) (idle B = 0)
14
14 Parallel Randon Access Machine Read / Write conflicts EREW : Exclusive - Read, Exclusive -Write – no concurrent ( read or write) operation on a variable CREW : Concurrent – Read, Exclusive – Write – concurrent reads allowed on same variable – exclusive write only
15
15 Parallel Randon Access Machine ERCW : Exclusive Read – Concurrent Write CRCW : Concurrent – Read, Concurrent – Write
16
16 Parallel Randon Access Machine Concurrent write on a variable X Common CRCW : only if all processors write the same value on X SUM CRCW : write the sum all variables on X Random CRCW : choose one processor at random and write its value on X Priority CRCW : processor with hign priority writes on X
17
17 Parallel Randon Access Machine Example: Concurrent write on X by processors P1 (50 X), P2 (60 X), P3 (70 X) Common CRCW ou ERCW : Failure SUM CRCW : X is the sum (180) of the written values Random CRCW : final value of X { 50, 60, 70 }
18
18 Parallel Randon Access Machine Basic Input/Output operations On global memory – global read (X, x) – global write (Y, y) On local memory – read (X, x) – write (Y, y)
19
19 Example 1: Matrix-Vector product Matrix-Vector produt Y = AX – A is a nXn matrix – X = [ x1, x2, …, xn] a vector of n elements – p processeurs ( p n ) and r = n/p Each processor is assigned a bloc of r= n/p elements
20
20 Example 1: Matrix-Vector product Y1 Y2 …. Yn A1,1 A1,2 … A1,n A2,1 A2,2 … A2,n …….. A n,1 An,2... An,n = X1 X2 …. Xn X Global memory P1P2Pp Processors
21
21 Example 1: Matrix-Vector product Partition A in p blocks Ai Compute p partial products in parallel Processor Pi compute the partial product Yi = Ai * X A = A1 A2 …. Ap A1,1 A1,2 … A1,n ……. Ar,1 A2,2 … A2,n ……. A(p-1)r,1 A(p-1),2 … A(p-1),n …….. A pr,1 Apr,2 ….Apr,n = A1 Ap r lignes
22
22 Example 1: Matrix-Vector product Processeur Pi computes Yi = Ai * X A1,1 A1,2 … A1,n ……. Ar,1 A2,2 … A2,n X1 X2 …. Xn X Ar+1,1 Ar+1,2 … Ar+1,n ……. A2r,1 A2r,2 … A2r,n X1 X2 …. Xn X A(p-1)r,1 A(p-1)r,2 …A(p-1)r,n ……. Apr,1 Apr,2 … Apr,n X1 X2 …. Xn X Y1 Y2 …. Yr Y(p-1)r+1 Y(p-1)r+2 …. Ypr Yr+1 Yr+2 …. Y2r P1 P2 Pp
23
23 Example 1: Matrix-Vector product Solution requires : p concurrents reads of vector X each processor Pi makes an exclusive read of block Ai = A [((i-1)r +1) : ir, 1:n] Each processor Pi makes an exclusive write on block Yi = Y[((i-1)r +1) : ir ] Required architecture : PRAM CREW
24
24 Example 1: Matrix-Vector product Algorithm: processor Pi (i=1,2, …, n) Input A : nxn matruix in global memory X : a vector in global memory Output y = AX (y is a vector in global memory) Local variables i : Pi processor id p: number of processors n : dimension of A and X Begin 1. Global read ( x, z) 2. global read (A((i-1)r + 1 : ir, 1:n), B) 3. calculer W = Bz 4. global write (w, y(i-1)r+1 : ir)) End
25
25 Example 1: Matrix-Vector product Analysis Computation cost Ligne 3: O( n 2 /p) opérations arithmétiques by Pi r lignes X n opérations ( avec r = n/p) Communication cost Ligne 1 : O(n) numbers transferred from global to local memory by Pi Ligne 2 : O(n 2 /p) numbers transferred from global to local memory by Pi Ligne 4 : O(n/p) numbers transferred from global to local memory by Pi Overall: Algorithm run in O(n 2 /p) time
26
26 Example 1: Matrix-Vector product Other way to partition the matrix is vertically Ai and X are split into blocks – A1, A2, … Ap – X1, X2 … Xp Solution in two phases : – Compute partial products Z1 =A1X1, … Zp = ApXp – Synchronize the processors – Add partial results to get Y Y= AX = Z1 + Z2 + … + Zp
27
27 Example 1: Matrix-Vector product Y1 Y2 …. Yn A1,1 … A1,r A2,1 … A2,r An,1 … An,r X1 … Xr * r columns Processor P1 A1,(p-1)r +1... A1,pr A2,(p-1)r +1... A2,pr An,(p-1)r +1... An,pr X(p-1)r +1... Xpr * r columns Processor Pp …….. Synchronization
28
28 Example 1: Matrix-Vector product Algorithm: processor Pi (i=1,2, …, n) Begin 1. Global read ( x( (i-1)r +1 : ir), z) 2. global read (A(1:n, (i-1)r + 1 : ir), B) 3. compute W = Bz 4. Synchronize processors Pi (i=1, 2, …, n) 5. global write (w, y(i-1)r+1 : ir)) End Input A : nxn matruix in global memory X : a vector in global memory Output y = AX (* y: vector in global memory *) Local variables i : Pi processor id p: number of processors n : dimension of A and X
29
29 Example 1: Matrix-Vector product Analysis Work out the details Overall: Algorithm run in O(n 2 /p) time
30
30 Example 2: Sum on the PRAM model An aray A of n = 2 k numbers A PRAM machine with n processor Compute S = A(1) + A(2) + …. + A(n) Construct a binary tree to compute the sum in log 2 n time
31
31 Example 2: Sum on the PRAM model B(1) =A(1) B(2) =A(2) B(1) =A(1) B(2) =A(2) B(1) =A(1) B(2) =A(2) B(1) =A(1) B(2) =A(2) P1P2P3P4P5P6P7P8 B(1)B(2)B(3)B(4) P1 P2 P3P4 B(1) B(2) P1 P2 B(1) S=B(1) P1 Level >1, Pi compute B(i) = B(2i-1) + B(2i) Level 1, Pi B(i) = A(i)
32
32 Example 2: Sum on the PRAM model Algorithm processor Pi ( i=0,1, …n-1) Input A : array of n = 2 k elements in mémoire global Output S : où S= A(1) + A(2) + ….. A(n) Local variables Pi n : i : processor Pi identity Begin 1. global read ( A(i), a) 2. global write (a, B(i)) 3. for h = 1 to log n do if ( i ≤ n / 2 h ) then begin global read (B(2i-1), x) global read (b(2i), y) z = x +y global write (z,B(i)) end 4. if i = 1 then global write(z,S) End
33
Parallel Architecture Network model
34
34 Network model Characteristics Communication structure is important Network can be seen as a graph G=(N,E): – Node i N is a processor – Edge (i,j) E represents a two way communication between processors i and j Basi communication operation – Send (X, P i ) – Receive(X, P i ) No global shared memory
35
35 Network model P1 P2 P3Pn … n processors Linear array n processor ring P1 P2 P3Pn …
36
36 Network model P11 P12 P13P1n … n 2 processors Grid P21 P22 P23P2n … P31 P32 P33P3n … Pn1 Pn2 Pn3Pnn … n 2 processors Torus: columns and rows are n rings
37
37 Network model (P0) (P2) (P1) (P7) (P3) (P4) (P5) (P6) n=2 k hypercube
38
38 Network model P11 P12 P13P1n … n 2 processors Grid P21 P22 P23P2n … P31 P32 P33P3n … Pn1 Pn2 Pn3Pnn … n 2 processors Torus: columns and rows are n rings
39
39 Exemple 1: Matrix-Vector Product on linear array A=[aij] an nxn matrix, i,j [1,n] X=[xi] i [1,n] Compute
40
40 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm for n=4 x4 x3 x2 x1 a14 a13 a12 a11 a24 a23 a22 a21 a34 a33 a32 a31 a44 a43 a42 a41...... P1 P2 P3P4
41
41 Exemple 1: Matrix-Vector Product on linear array At step j, xj enters the processor P1. At step j, processor Pi receives (when possible) a value from its left and a value from the top. It updates its partial as follows: Yi = Yi + aij*xj, j=1,2,3, …. Values xj and aij reach processor i at the same time at step (i+j-1) (x1, a11) reach P1 at step 1 = (1+1-1) (x3, a13) reach P1 at setep 3 = (1+3-1) In general, Yi is computed at step N+i-1
42
42 Exemple 1: Matrix-Vector Product on linear array The computation is completed when x4 and a44 reach processor P4 at Step N + N –1 = 2N-1 Conclusion: The algorithm requires (2N-1) steps. At each step, active processor Perform an addition and a multiplication Complexity of the algorithm: O(N)
43
43 Exemple 1: Matrix-Vector Product on linear array J=1 4 y2 = a 2j *x j J=1 4 y3 = a 3j *x j J=1 4 y4 = a 4j *x j J=1 4 y1 = a 1j *x j x2 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 Step 1 2 3 4 5 6 7 x1 x1 x3 x4 x1 x1 x2 x1 x1 x1 x1 x1 x1 x1 x1 x3 x2 x1 x1 x4
44
44 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm: Time-Cost analysis P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 1 Add; 1 Mult; active: P1 idle: P2, P3, P4 2 Add; 2 Mult; active: P1, P2 idle: P3, P4 3 Add; 3 Mult; active: P1, P2,P3 idle: P4 4 Add; 4 Mult; active: P1, P2,P3 P4 idle: 3 Add; 3 Mult; active: P2,P3,P4 idle: P1 2 Add; 2 Mult; active: P3,P4 idle: P1,P2 1 Add; 1 Mult; active: P4 idle: P1,P2,P3
45
45 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm: Time-Cost analysis x2 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 Step 1 2 3 4 5 6 7 x1 x1 x3 x4 x1 x1 x2 x1 x1 x1 x1 x3 x4 1 Add; 1 Mult; active: P1 idle: P2, P3, P4 2 Add; 2 Mult; active: P1, P2 idle: P3, P4 3 Add; 3 Mult; active: P1, P2,P3 idle: P4 4 Add; 4 Mult; active: P1, P2,P3 P4 idle: 3 Add; 3 Mult; active: P2,P3,P4 idle: P1 2 Add; 2 Mult; active: P3,P4 idle: P1,P2 1 Add; 1 Mult; active: P4 idle: P1,P2,P3
46
46 Exemple 2: Matrix multiplication on a 2-D nxn Mesh Given two nxn matrices A = [aij] and B = [bij], i,j [1,n], Compute the product C=AB, where C is given by :
47
47 Exemple 2: Matrix multiplication on a 2-D nxn Mesh At step i, Row i of A (starting with ai1) is entered from the top into column i (into processor P1i) At step j, Column j of B (starting with b1j) is entered from the left into row j (to processor Pj1) The values aik and bkj reach processor (Pji) at step (i+j+k-2). At the end of this step, aik is sent down and bkj is sent right.
48
48 Exemple 2: Matrix multiplication on a 2-D nxn Mesh Example: Systolic mesh algorithm for n=4 STEP 1 (1,1) (2,1) (3,1) (4,1) (1,2)(1,3) (1,4) (2,4)(2,3)(2,2) (3,4)(3,3)(3,2) (4,4)(4,3)(4,2) a14 a13 a12 a11 a24 a23 a22 a21 a34 a33 a32 a31. a44 a43 a42 a41.. b41 b3 b21 b11 b42 b32 b22 b12. b43 b33 b23 b13.. b44 b34 b24 b14..
49
49 Exemple 2: Matrix multiplication on a 2-D nxn Mesh Example: Systolic mesh algorithm for n=4 STEP 5 a11 a24a33a42 b41b31b21 a14 a13 a12 b42 b33 b24 b32 b22 b12 b23b13 b14 a23a32 a41 a22a22 a31 a21 b43 b34b44 b11 a34 a43 a44 A B
50
50 Exemple 2: Matrix-Vector multiplication on a ring Analysis To determine the number of steps for completing the multiplication of the matrice, we must find the step at which the terms a nn and b nn reach rocessor Pnn. – Values aik and bkj reach processor Pji at i+j+k-2 – Substituing n for i,j,k yields : n + n + n – 2 = 3n - 2 Complexity of the solution: O(N)
51
51 Exemple 3: Matrix-Vector multiplication on a ring N=4 X4X3X2X1 P1P2P3 P4 a13 a12 a11 a14 a22 a21 a24 a23 a31 a34 a33 a32 a44 a43 a42 a41 Xi aij This algorithm requires N steps for a matrix-vector multiplication
52
52 Exemple 3: Matrix-Vector multiplication on a ring Goal Pipeline data into the processors, so that n product terms are computed and added to partial sums at each step. Distribution of X on the processors Xj 1 j N, Xj is assigned to processor N-j+1 This algorithm requires N steps for a matrix-vector multiplication
53
53 Exemple 3: Matrix-Vector multiplication on a ring Another way to distribute the Xi over the processors and to input Matrix A – Row i of the matrix A is shifted (rotated) down i (mod n) times and entered into processor Pi. – Xi is assigned to processor Pi, at each step the Xi are shifted right
54
54 Exemple 3: Matrix-Vector multiplication on a ring X1X2X3X4 P1P2P3 P4 a12 a13 a14 a11 a23 a24 a21 a22 a34 a31 a32 a33 a41 a42 a43 a44 Xi aij N=4 Diagonal
55
55 Exemple 4: Sum of n=2 P numbers on a d-hypercube Assignment: xi is on processor Pi 0 1 3 2 4 5 6 7 (X0) (X2) (X1) (X7) (X3) (X4) (X5) (X6) Computation of S = xi,
56
56 Exemple 4: Sum of n=2 P numbers on a d-hypercube Step 1: Processors of the sub-cube 1XX send their data to corresponding processors in sub-cube 0XX 0 1 3 2 4 5 6 7 (X0+X4) (X2+X6) (X1+X5) (X3+X7) 0XX sub-cube 1XX sub-cube
57
57 Exemple 4: Sum of n=2 P numbers on a d-hypercube Step 2: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X 0 1 3 2 4 5 6 7 (X0+X4+X2+X6) (X1+X5+X3+X7) P Processeurs actifs P Processeurs inactifs
58
58 Exemple 4: Sum of n=2 P numbers on a d-hypercube Step 3: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X P Processeurs actifs P Processeurs inactifs 0 1 3 2 4 5 6 7 S = (X0+X4+X2+X6+ X1+X5+X3+X7) The sum of the n numbers is stored on node P0
59
59 Exemple 4: Sum of n=2 P numbers on a d-hypercube Algorithm: Processor Pi Input: 1) An array of X of n=2p of numbers, X[i] is assigned to processor Pi 2) processor identity id Output: S= X[0]+…+X[n] stored on processor P0 Processor Pi Begin My_id = id ( My_id i) S=X[i] For j = 0 to (d-1) do begin Partner = M_id XOR 2 j if My_id AND 2 j = 0 begin receive(Si, Partner) S = S + Si end if My_id AND 2 j 0 begin send(S, Partner) exit end end end
60
Parallel Architecture Message broadcast on network model (ring, torus, hypercube)
61
61 Basic communication Message Broadcast One-to-all broadcast – Ring – Mesh (Torus) – Hypercube All-to-all broadcast – Ring – Mesh (Torus) – Hypercube
62
62 Communication cost Message from Pi Pj l number of links traversed Communication cost = t s + t w *m*l t s :message preparation time m: message length t w : unit transfer time (byte) l : number of links traversed by the message
63
63 Communication cost Communication time bounds: – Ring t s + (t w )m p/2 – Mesh t s + (t w )m ((p) 1/2 )/2 – Hypercube t s + (t w )m log 2 p Depends on the maximum number of links traversed by the message
64
64 One-to-All broadcast Simple solution P0 send message M0 to processors P1, P2, … Pn-1 successively. P1P0 M0 P2P0 M0 P3P0 M0 Pp-1P0 M0 ( P0 P1 P2 ) ( P0 P1 P2 P3 ) ( P0 P1 P2 … Pp-1 ) Communication cost = (t s + t w m0 ) i = (t s + t w m0 )*( p(p+1)/2)
65
65 One-to-all Broadcast Processor send a message M to all processors 01P-101 MMMM …… One-to-all broadcast Dual operation (Accumulation)
66
66 All-to-all broadcast All-to-All Broadcast : several simultanous One-to-All broadcast where each processor Pi initiates the communication. 01p-1 … X0X0 01 … All-to-All broadcast Accumulation vers plusieurs noeuds X1X1 X p-1 … X0X0 X1X1 … X0X0 X1X1 … X0X0 X1X1
67
Parallel Architecture Examples of message broadcasts
68
68 Example 1: One-to-All broadcast on a ring Each processor forwards message to the next processor. Initially, message sent in two directions 0123 7654 12 3 4 3 2 4 Communication cost : T = (t s + t w * m) p/2 où p est le nombre de processeurs Parallel Steps
69
69 Example 2: One-to-All broadcast on a Torus Two phases : Phase 1 : One-to-All broacast on first row 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 12 2
70
70 Example 2: One-to-All broadcast on a Torus Phase 2 : parallel one-to-all broadcasts in the columns. 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 12 2 33334444 4444
71
71 Example 2: One-to-All broadcast on a Torus Communication cost : 2 T = 2 * (t s + t w * m) p(1/2)/ 2 (p) is the number of processors Broadcast on line T com = (t s + t w m) p (1/2) /2 Broadcast on columns T com = (t s + t w m) p (1/2) /2
72
72 Example 3: One-to-All broadcast on a Hypercube Requires d steps. Each step doubles the number of active processors 01 2 3 45 6 7 1 22 3 3 3 3 Coût de communication : T = 2 * (t s + t w * m)*logp P is the number of processors
73
73 Example 3: One-to-All broadcast on a Hypercube Broadcast an element X stored on one processor (say P0) to the other processors of the hypercube. Broadcast can be performed in O(logn) as follows 0 1 3 2 4 5 6 7 X Initial distribution of data
74
74 Example 3: One-to-All broadcast on a Hypercube Step 1: Processor Po sends X to processor P1 Step 2: Processors P0 and P1 send X to P2 and P3 respectively Step 3: Processor P0, P1, P2 and P3 send X to P4, P5, P6 and P7 X 0 1 3 2 4 5 6 7 X Step 1 Step 2 0 1 3 2 4 5 6 7 XX XX XX 0 1 3 2 4 5 6 7 X X XX X X Step 3 P Active processors P Idle processors
75
75 Example 3: One-to-All broadcast on a Hypercube Algorithm for a broadcast of X on a p-hypercube Input: 1) X assigned to processor P0 2) processor identity id Output: All processor Pi contain X Processor Pi Begin If i = 0 then B = X My_id = id ( My_id i) For j = 0 to (d-1) do if My_id 2j begin Partner = My_id XOR 2j if My_id > Partner receive(B, Partner) if My_id < Partner send(S, Partner) end
76
76 All-to-all Broadcast on a ring STEP 1 0123 7654 1(0)1(1)1(2) 1(3) 1(7) 1(6)1(5)1(4) (0)(1) (2) (3) (4)(5)(6)(7)
77
77 All-to-all Broadcast on a ring Step 2 0123 7654 2(7)2(0)2(1) 2(2) 2(6) 2(5)2(4)2(3) (0,7)(1,0) (2,1) (3,2) (4,3)(5,4)(6,5)(7,6)
78
78 All-to-all Broadcast on a ring Etape 3 0123 7654 3(6)3(7)3(0) 3(1) 3(5) 3(4)3(3)3(2) (0,7,6)(1,0,7) (2,1,0) (3,2,1) (4,3,2) (5,4,3)(6,5,4)(7,6,5)
79
79 All-to-all Broadcast on a ring Etape 7 0123 7654 7(2)7(3)7(4) 7(5) 7(1) 7(0)7(7)7(6) (0,7,6,5,4,3,2) (1,0,7,6,5,4,3)(2,1,0,7,6,5,4) (3,2,1,0,7,6,5) (4,3,2,1,0,7,6) (5,4,3,2,1,0,7)(6,5,4,3,2,1,0)(7,6,5,4,3,2,1)
80
80 All-to-all Broadcast on a 2-dimensional Torus Two phases – Phase 1: All-to-all broadcast on each line. Each processor Pi holds a message of size Mi = (p 1/2 )m – Phase 2: All-to-All broadcast in the columns
81
81 All-to-all Broadcast 012 345 678 (0) (1)(2) (3) (4)(5) (6) (7)(8) All-to-All on the rows Start of Phase 1
82
82 All-to-all Broadcast 012 345 678 (0,1,2) (3,4,5) (6,7,8) All-to-All on columns Start of Phase 2
83
83 All-to-all Broadcast Communication cost = Cost phase 1 + cost phase 2 = (p 1/2 -1)(t s + t w m) + (p 1/2 -1) (t s + t w (p 1/2 )m)
84
Parallel Algorithms and Computing Selected topics Sorting in Parallel
85
85 Performance mesures Speedup Efficiency Work-Time Amdhal’s law
86
86 Speed up Speed up : S(p) (p number of processors in the parallel solution) S(p) < 1 : parallel solution is worst 1<S(p) ≤ 1 : Normal p<S(p) : Hyper-speed up not very frequent T(1) : sequential execution time T(p) : parallel execution time with p processors Poor performance Normal speed up Hyper- speed up S(p) = p S(p) 1 p (ideal)
87
87 Accélération Is hyper speed up normal? Poor non optimal sequential algorithm Storage space a factor
88
88 Efficiency Efficiency : E(p) 0< E(p) ≤ 1 : Normal 1<E(p) : Hyper-accélération Intérêt : Speed up : User point of view Efficacité : Manager’s point of view Accélération et Efficacité : designer’s point of view
89
89 Amdhal’s law A program consists of two part : + … Sequential part Parallel prt
90
90 Amdhal’s law Bound on Speedup
91
91 Amdhal’s law Bound on Speedup Sequential fraction (f s ) et Parallel fraction (f p ): Speed up can be rewritten as
92
92 Amdhal’s law Bound on Speedup
93
93 Amdhal’s law Bound on Speedup For example if f s is equal to 1%, S(p) is less than 100. 1/f s S(p) p 1 1
94
94 Amdhal’s law The above computation of speed up bound does not take into account communication and synchronization overheadsOverhead
95
95 Parallel sorting Types of sorting algorithms Properties – Processor ordering determines order of the final result – Where input and output are stored – Basic compare-exchange operation
96
96 Issues in sorting algorithms Internal/External sort : CPU Data fits in processor memory (RAM) Performances based : Comparison Basic operations Complexity O(nlogn) Internal CPU Data in memory and on disk ( RAM) (Disk) Performances based : Basic operations Overlap of computing and I/O External
97
97 Issues in sorting algorithms Comparaison-Based Non-Compared-Based – Ordering based on properties of the keys Executions of : Comparaison Permutation
98
98 Issues in sorting algorithms Internal sort (shared memory : PRAM) PP Share data Minimize memory access conflicts Each processor sort part of the data in memory
99
99 Issues in sorting algorithms Internal sort (distributed memory) – Each processor is assigned a block of N/P elements – Processor locally sorts the assigned block (using any sort algorithm internal ot ) Initial data N/P elements per processor P i P 1 P 2 P 3 < < Input : Distributed among processors Output : Store on processors Ordre final : processor order defines the final ordering of list
100
100 Issues in sorting algorithms Internal sort (distributed memory) (0)(1) (3)(2) (4) (7) (5) (6) Example : Final order defined by the gray code labelling of processors 1 2345
101
101 Issues in sorting algorithms Building block: compare-exchange operation CPU (a i, a j ) RAM Sequentiel (a i < a j ) ?? a i ↔ a j (a i ) CPU (a j ) CPU a i a j a i = min(a i, a j ) a j a i a j = max(a i, a j ) Parallel Exchange-Compare-Min (P(i+1)) P(i)P(i+1) Exchange-Compare-Max (P(i-1))
102
102 Issues in sorting algorithms Compare-exchange : N/p elements per processor 168111362279101263168111362279101263279101263168111362168279101263111362 minmax P(i)P(i+1) N/p smallest elements Exchange-compare-min(P(i+1)) n/p largest elements Exchange-compare-max(P(i-1))
103
Example: Odd-Even Merge Sort Unsorted list of n elements A 0 A 1 A 2 A 3 A M-1 B 0 B 1 B 2 B 3 B M-1 Divide list in two lists (n/2 elements) Sort each sub-list A 0 A 2 … A M-2 B 0 B 2 … B M-2 Divide each in sub-lists of Odd-even index A 1 A 3 … A M-1 B 1 B 3 … B M-1 Merge sort the Odd – even sublists E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 Merge the two list and Exchange out of position elements E 0 O 0 E 1 O 1 …. E M-1 O M-1
104
Where is parallelism???? E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 1xN 4xN/4 2xN/2
105
105 Example: Odd-Even Merge Sort Key to the Merge Sort algorithm: method used to merge the sorted sub-list Consider 2 sorted lists of m =2 k elements: A= a 0, a 1, ….a m-1 et B= b 0, b 1, ….b m-1 Even(A)= a 0, a 2, ….a m-2, Odd(A)= a 1, a 3, ….a m-1 Even(B)= b 0, b 2, ….b m-2, Odd(B)= b 1, b 3, ….b m-1
106
106 Example: Odd-Even Merge Sort Create 2 merged lists: Merge Even(A) and Odd(B) to E = E 0 E 1 …E m-1 Merge Even(B) and Odd(A) to O = O 0 O 1 …O m-1 Merge E and O as follows to create a List L’ L’ = E 0 O 0 E 1 O 1 …E m-1 O m-1 Exchange out of order elements of L’ to obtain L
107
107 Example: Odd-Even Merge Sort A=2,3,4,8 et B=1,5,6,7 Even (A)= 2,4 and Odd(A)= 3,8 Even (B)= 1, 6 et Odd(B)= 5,7 E = 2,4,5,7 et O=1,3,6,8 L’ = 2 ↔1, 4 ↔ 3, 5, 6, 7, 8 L = 1, 2, 3, 4, 5, 6, 7, 8
108
Parallel sorting Quicksort
109
109 Review: Quicksort Recursively: choose a pivot divide list in two using the pivot sort left and right sub-list Recall: Sequential Quicksort Performance
110
110 Review: Quicksort Sequential Quicksort void Quicksort (double *A, int q, int r) { int s, i; double pivot; if (q < r ) {/* divide A using the pivot */ pivot = A[q]; s = q; for (i = q+1; i ≤ pivot { s = s+1; exchange(A,s,i); } } exchange(A,q,s); /* recursive calls to sort the new sublist*/ Quicksort(A,q,s-1); Quicksort(A, s+1, r); } }
111
111 Review: Quicksort Create a binary tree of processor, one new processor for each recursive call of Quicksort Easy to implement, but can be inefficient performance wise
112
112 Review: Quicksort Implantation en mémoire partagée (avec primitives Fork()) double A[nmax]; qoid quicksort(int q, int r) { int s, i, n; double pivot; if q < r {/*partitions */ pivot=A[q]; s=q; for (i=q+1; i <= r; i++){ if A[i] <= pivot){ s= s+1; exchange(A, s,i); } exchange(A, q, s); /*Create a new processor */ n=fork() if ( n== 0 ) exec("quicksort",q, s-1); else quicksort(,s+1,r); }
113
113 Quicksort on a d-hypercube d étapes : all processors active in each step A processor is assigned N/p elements (p= 2 d ) Etapes de la solution: – Initially (Step 0), 1 pivot is chosen and broadcast to all processors – Each processors its elements in two sub-lists: one less (inferior) than the current pivot and the other greater or equal (superior) – Exchange the inferior and superior sub-lists based on dimension (d-0), creating two sub-cubes along dimension d-0 (one for the inferior lists and the other for the superior lists) – Each processor merges the (inferior and superior) lists – Repeat for each sub-cube
114
114 Quicksort on a d-hypercube 000 010 110100 001 101111 011 0XX1XX Step 0 Pivot P0 Division along dimension 3. Two blocks of elements are created : 1 Block of elements less than pivot P0 1 block of elements greater than or equal to P0 Example on a 3-Hypercube < P0> P0
115
115 Quicksort on a d-hypercube 000 010 110100 001 101111 011 Etape 1 Pivot P10 Division along dimension 2. Divide each sub-cube in two smaller sub-cubes 00X 01X 10X 11X Pivot P11 < P10 > P10 < P11 > P11 Example on a 3-Hypercube
116
116 Quicksort on a d-hypercube 000 010 110100 001 101111 011 Etape 2 Pivot P20 Division along dimension 1. Final order defined by the label ordering of the processors 000 010 Pivot P22 001 011 100101 110 111 Pivot P21 Pivot P23 <P20 <P21 >P20 >P21 <P22 >P22 >P23<P23 Example on a 3-Hypercube
117
117 Quicksort on a d-hypercube 000 010 110100 001 101111 011 Final step Each processor sorts its final list, using for example a sequential quicksort 000 010 110100 001 101111 011 local sort {} {} : empty list Example on a 3-Hypercube
118
118 Quicksort on a d-hypercube Data exchange at the initial step: sub-cubes P0XX and P1XX P0XXP1XX Broadcast Pivot P0 < P0> P0 < P0> P0 < P0> P0 Exchange sub-lists inferior / superior < P0> P0 Exchange sub-lists inférior / superio P1XX P0XX Sort the sub-lists at the end of each step?
119
119 Quicksort on a d-hypercube Algorithm: Processor k (k=0, …, p-1) Hypercube-Quicksort(B, d) { /* B contains the elements assigned to processor i*/ /* d is the hypercube dimension*/ int i; double x, B1[ ], B2[ ], T[ ]; my-id = k; /* Processor id*/ for ( i = d-1 to 0 ) { x = pivot (my-id, i); partition(B, x, B1, B2): /* B1 inferior sub-list, B2 superior sub-list */ if ( my-id AND 2 i == 0) { /* i th bit is 0 */ send( B2, My-neighbor in dimension i); receive(T, My-neighbor in dimension i); B = B1 T; } else { send( B1, My-neighbor in dimension i); receive(T, My-neighbor in dimension i); B = B2 T; } Sequential-Quicksort( B); End Hypercube-Quicksort
120
120 Quicksort on a d-hypercube Choice of pivot More important for performance than in the sequential case. It has a great impact on: – la répartition de charge entre processeurs – la performance de l’algorithme (dégradation rapide de la performance)
121
121 Quicksort on a d-hypercube Worst case: At step 0, largest element of list is selected as the pivot Pivot0 = x=max{ X i } 000 010 110100 001 101111 011 {} Foreground processors are overloaded Background processors are idle
122
122 Quicksort on a d-hypercube Choice of pivot : ideal case In parallel do, – Sort the initial list assigned to each processor – Choose the median element of one of the processors of the cube – Assuming uniform distribution of elements of the list List assigned processor Pi Median element List assigned processor Pi Median element Median element of whole list
123
123 Quicksort on a d-hypercube Steps of the algorithm Local sort of assigned list Selection of pivot by rocessor Broadcast of pivot in sub-d’hypercube d-i Division based on pivot (Binary search) Exchanges of sub-lists between neighbors Merge sorted sub-lists repeat Time complexity O(1*d) d=logp
124
124 Parallel Quicksort on a PRAM Parallel QUICKSORT algorithm /* The solution constructs a binary tree of processors chich is traversed in IN-ORDER to yield the sorted */ Variables shared by all processors root : Root of the global binary tree A[n] : AN array of n elements ( 1, 2, ….., n) Leftchild [i] : The root of the left sub-tree of processor i (i=1,2, …) Rightchild[i] : The root of the right sub-tree of processor (i=1,2, …
125
125 Parallel Quicksort on a PRAM Process /* Do in parallel for each processor i */ begin Root := I; Parent := i; Leftchild[i] := Rightchild[i] := n+1; End Repeat for each processor i root do begin if (A[i] < A[Parent] …..) then begin Leftchild[Parent] := i if i = Leftchild[Parent] then exit else Parent := Leftchild[Parent] end else begin Rightchild[Parent] := i If i = Rightchild[Parent] then exit else Parent := Leftchild[Parent] end end repeat end process
126
126 Parallel Quicksort on a PRAM Example 332113 5482334072 1 2 3 4 5 6 7 8 Root = processeur 4 [4] {54} 12 3 6 7 5 8 Binary tree 9 1 2 3 4 5 6 7 8 Leftchild Rightchild 9999999 99999999 Step 0
127
127 Parallel Quicksort on a PRAM Example Racine = processeur 4 [4] {54} 12 3 6 7 5 8 Binary tree 1 2 3 4 5 6 7 8 Leftchild Rightchild 1 5 Step 1 Processorr 1 wins the competition for the left subtree of 4 and 5 wins at right 2 3 6 7 8 [1] {33} [5] {82}
128
128 Parallel Quicksort on a PRAM Example Racine = processeur 4 238 1 2 3 4 5 6 7 8 Leftchild Rightchild 1 67 5
129
129 Parallel Quicksort on a PRAM Example [4] {54} 12 3 6 7 5 8 Binary tree 2 3 6 7 8 [1] {33} [5] {82} 3 [2] {21} 7 [6] {33} [8] {72}
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.