Download presentation
Presentation is loading. Please wait.
Published byBaldric Elwin Johns Modified over 8 years ago
1
Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology
2
What This Talk Is About Automatic generation of efficient large base cases for divide and conquer programs
3
Outline 1.Motivating Example 2.Computation Structure 3.Transformations 4.Related Work 5.Conclusion
4
1. Motivating Example
5
Divide and Conquer Matrix Multiply Divide matrices into sub-matrices: A 0, A 1, A 2 etc Use blocked matrix multiply equations A0A0 A1A1 A2A2 A3A3 B0B0 B1B1 B2B2 B3B3 A 0 B 0 +A 1 B 2 A 0 B 1 +A 1 B 3 A 2 B 0 +A 3 B 2 A 2 B 1 +A 3 B 3 = A B = R
6
Divide and Conquer Matrix Multiply Recursively multiply sub-matrices A0A0 A1A1 A2A2 A3A3 B0B0 B1B1 B2B2 B3B3 A0B0+A1B2A0B0+A1B2 A0B1+A1B3A0B1+A1B3 A2B0+A3B2A2B0+A3B2 A2B1+A3B3A2B1+A3B3 = A B = R
7
Divide and Conquer Matrix Multiply Terminate recursion with a simple base case = A B = R a0a0 b0b0 a 0 b 0
8
Divide and Conquer Matrix Multiply void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); } Implements R += A B
9
Divide and Conquer Matrix Multiply Divide matrices in sub-matrices and recursively multiply sub-matrices void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }
10
Divide and Conquer Matrix Multiply Identify sub-matrices with pointers void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }
11
Divide and Conquer Matrix Multiply Use a simple algorithm for the base case void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }
12
Divide and Conquer Matrix Multiply Advantage of small base case: simplicity Code is easy to: Write Maintain Debug Understand void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }
13
Divide and Conquer Matrix Multiply Disadvantage: inefficiency Large control flow overhead: Most of the time is spent in dividing the matrix in sub-matrices void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }
14
Hand Coded Implementation void serialmul(block *As, block *Bs, block *Rs) { int i, j; DOUBLE *A = (DOUBLE *) As; DOUBLE *B = (DOUBLE *) Bs; DOUBLE *R = (DOUBLE *) Rs; for (j = 0; j < 16; j += 2) { DOUBLE *bp = &B[j]; for (i = 0; i < 16; i += 2) { DOUBLE *ap = &A[i * 16]; DOUBLE *rp = &R[j + i * 16]; register DOUBLE s0_0 = rp[0], s0_1 = rp[1]; register DOUBLE s1_0 = rp[16], s1_1 = rp[17]; s0_0 += ap[0] * bp[0]; s0_1 += ap[0] * bp[1]; s1_0 += ap[16] * bp[0]; s1_1 += ap[16] * bp[1]; s0_0 += ap[1] * bp[16]; s0_1 += ap[1] * bp[17]; s1_0 += ap[17] * bp[16]; s1_1 += ap[17] * bp[17]; s0_0 += ap[2] * bp[32]; s0_1 += ap[2] * bp[33]; s1_0 += ap[18] * bp[32]; s1_1 += ap[18] * bp[33]; s0_0 += ap[3] * bp[48]; s0_1 += ap[3] * bp[49]; s1_0 += ap[19] * bp[48]; s1_1 += ap[19] * bp[49]; s0_0 += ap[4] * bp[64]; s0_1 += ap[4] * bp[65]; s1_0 += ap[20] * bp[64]; s1_1 += ap[20] * bp[65]; s0_0 += ap[5] * bp[80]; s0_1 += ap[5] * bp[81]; s1_0 += ap[21] * bp[80]; s1_1 += ap[21] * bp[81]; s0_0 += ap[6] * bp[96]; s0_1 += ap[6] * bp[97]; s1_0 += ap[22] * bp[96]; s1_1 += ap[22] * bp[97]; s0_0 += ap[7] * bp[112]; s0_1 += ap[7] * bp[113]; s1_0 += ap[23] * bp[112]; s1_1 += ap[23] * bp[113]; s0_0 += ap[8] * bp[128]; s0_1 += ap[8] * bp[129]; s1_0 += ap[24] * bp[128]; s1_1 += ap[24] * bp[129]; s0_0 += ap[9] * bp[144]; s0_1 += ap[9] * bp[145]; s1_0 += ap[25] * bp[144]; s1_1 += ap[25] * bp[145]; s0_0 += ap[10] * bp[160]; s0_1 += ap[10] * bp[161]; s1_0 += ap[26] * bp[160]; s1_1 += ap[26] * bp[161]; s0_0 += ap[11] * bp[176]; s0_1 += ap[11] * bp[177]; s1_0 += ap[27] * bp[176]; s1_1 += ap[27] * bp[177]; s0_0 += ap[12] * bp[192]; s0_1 += ap[12] * bp[193]; s1_0 += ap[28] * bp[192]; s1_1 += ap[28] * bp[193]; s0_0 += ap[13] * bp[208]; s0_1 += ap[13] * bp[209]; s1_0 += ap[29] * bp[208]; s1_1 += ap[29] * bp[209]; s0_0 += ap[14] * bp[224]; s0_1 += ap[14] * bp[225]; s1_0 += ap[30] * bp[224]; s1_1 += ap[30] * bp[225]; s0_0 += ap[15] * bp[240]; s0_1 += ap[15] * bp[241]; s1_0 += ap[31] * bp[240]; s1_1 += ap[31] * bp[241]; rp[0] = s0_0; rp[1] = s0_1; rp[16] = s1_0; rp[17] = s1_1; } cilk void matrixmul(long nb, block *A, block *B, block *R) { if (nb == 1) { flops = serialmul(A, B, R); } else if (nb >= 4) { spawn matrixmul(nb/4, A, B, R); spawn matrixmul(nb/4, A, B+(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B+(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B, R+3*(nb/4)); sync; spawn matrixmul(nb/4, A+(nb/4), B+2*(nb/4), R); spawn matrixmul(nb/4, A+(nb/4), B+3*(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+3*(nb/4)); sync; }
15
Goal The programmer writes simple code with small base cases The compiler automatically generates efficient code with large base cases
16
2. Computation Structure
17
Running Example – Array Increment void f(char *p, int n) if (n == 1) { /* base case: increment one element */ (*p) += 1; } else { f(p, n/2); /* increment first half */ f(p+n/2, n/2); /* increment second half */ }
18
Dynamic Call Tree for n=4 Execution of f(p,4)
19
Dynamic Call Tree for n=4 Test n=1 Call f Execution of f(p,4)
20
Dynamic Call Tree for n=4 Test n=1 Call f Execution of f(p,4) Activation Frame on the Stack
21
Dynamic Call Tree for n=4 Test n=1 Call f Execution of f(p,4) Executed Instructions
22
Dynamic Call Tree for n=4 Test n=1 Call f Execution of f(p,4)
23
Dynamic Call Tree for n=4 Test n=1 Call f Test n=1 Call f Test n=1 Call f n=4 n=2 Execution of f(p,4)
24
Dynamic Call Tree for n=4 Test n=1 Call f Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p n=4 n=2 n=1 Execution of f(p,4)
25
Control Flow Overhead Test n=1 Call f Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p n=4 n=2 n=1 Execution of f(p,4) Call overhead
26
Control Flow Overhead Test n=1 Call f Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p n=4 n=2 n=1 Execution of f(p,4) Call overhead + Test overhead
27
Computation Test n=1 Call f Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p n=4 n=2 n=1 Execution of f(p,4) Call overhead + Test overhead Computation
28
Large Base Cases = Reduced Overhead Test n=2 Call f n=4 n=2 Execution of f(p,4) Test n=2 Inc *p Inc *(p+1) Test n=2 Inc *p Inc *(p+1)
29
3. Transformations
30
Transformation 1: Recursion Inlining void f (char *p, int n) if (n == 1) { (*p) += 1; } else { f(p, n/2); f(p+n/2, n/2); } Start with the original recursive procedure
31
Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { f1(p, n/2); f1(p+n/2, n/2); } void f2(char *p, int n) if (n == 1) { (*p) += 1; } else { f2(p, n/2); f2(p+n/2, n/2); } Make two copies of the original procedure
32
Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { f2(p, n/2); f2(p+n/2, n/2); } void f2(char *p, int n) if (n == 1) { (*p) += 1; } else { f1(p, n/2); f1(p+n/2, n/2); } Transform direct recursion to mutual recursion
33
Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { f2(p, n/2); f2(p+n/2, n/2); } void f2(char *p, int n) if (n == 1) { (*p) += 1; } else { f1(p, n/2); f1(p+n/2, n/2); } Inline procedure f2 at call sites in f1
34
Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }
35
Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Reduced procedure call overhead More code exposed at the intra-procedural level Opportunities to simplify control flow in the inlined code
36
Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Reduced procedure call overhead More code exposed at the intra-procedural level Opportunities to simplify control flow in the inlined code: identical condition expressions
37
Transformation 2: Conditional Fusion void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Merge if statements with identical conditions
38
Transformation 2: Conditional Fusion void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Merge if statements with identical conditions Reduced branching overhead and bigger basic blocks Larger base case for n/2 = 1
39
Unrolling Iterations void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Repeatedly apply inlining and conditional fusion
40
Second Unrolling Iteration void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } void f2(char *p, int n) if (n == 1) { *p += 1; } else { f2(p, n/2); f2(p+n/2, n/2); }
41
Second Unrolling Iteration void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f2(p, n/2/2); f2(p+n/2/2, n/2/2); f2(p+n/2, n/2/2); f2(p+n/2+n/4, n/2/2); } void f2(char *p, int n) if (n == 1) { *p += 1; } else { f1(p, n/2); f1(p+n/2, n/2); }
42
Result of Second Unrolling Iteration void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2, n/2/2/2); }
43
Unrolling Iterations The unrolling process stops when the number of iterations reaches the desired unrolling factor The unrolled recursive procedure: Has base cases for larger problem sizes Divides the given problem into more sub-problems of smaller sizes In our example: Base cases for n=1, n=2, and n=4 Problems are divided into 8 problems of 1/8 size
44
Speedup for Matrix Multiply Matrix of 512 x 512 elements
45
Speedup for Matrix Multiply Matrix of 512 x 512 elements
46
Speedup for Matrix Multiply Matrix of 1024 x 1024 elements
47
Efficiency of Unrolled Recursive Part Because the recursive part is also unrolled, recursion may not exercise the large base cases Which base case is executed depends on the size of the input problem In our example: For a problem of size n=8, the base case for n=1 is executed For a problem of size n=16, the base case for n=2 is executed The efficient base case for n=4 is not executed in these cases
48
Solution: Recursion Re-Rolling Roll back the recursive part of the unrolled procedure after the large base cases are generated Re-Rolling ensures that larger base cases are always executed, independent of the input problem size The compiler unrolls the recursive part only temporarily, to generate the base cases
49
Transformation 3: Recursion Re-Rolling void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2, n/2/2/2); }
50
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } Identify the recursive part else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2, n/2/2/2); } Transformation 3: Recursion Re-Rolling
51
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } Replace with the recursive part of the original procedure else { f1(p, n/2); f1(p+n/2, n/2); } Transformation 3: Recursion Re-Rolling
52
Final Result void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } else { f1(p, n/2); f1(p+n/2, n/2); }
53
Speedup for Matrix Multiply Matrix of 512 x 512 elements
54
Speedup for Matrix Multiply Matrix of 1024 x 1024 elements
55
Other Optimizations Inlining moves code from the inter-procedural level to the intra-procedural level Conditional fusion brings code from the inter-basic- block level to the intra-basic-block level Together, inlining and conditional fusion give subsequent compiler passes the opportunity to perform more aggressive optimizations
56
Comparison to Hand Coded Programs Two applications: Matrix multiply, LU decomposition Three machines: Pentium III, Origin 2000, PowerPC Two different problem sizes Compare automatically unrolled programs to optimized, hand coded versions from the Cilk benchmarks Best automatically unrolled version performs: Between 2.2 and 2.9 times worse for matrix multiply As good as hand coded version for LU
57
Procedure Inlining: Scheifler (1977) Richardson, Ghanapathi (1989) Chambers, Ungar (1989) Cooper, Hall, Torczon (1991) Appel (1992) Chang, Mahlke, Chen, Hwu (1992 ) Related Work
58
Conclusion Recursion Unrolling analogous to the loop unrolling transformation Divide and Conquer Programs The programmer writes simple base cases The compiler automatically generates large base cases Key Techniques Inlining: conceptually inline recursive calls Conditional Fusion: simplify intra-procedural control flow Re-Rolling: ensure that large base cases are executed
62
Comparison to Hand Coded Programs Matrix multiply 512 x 512 elements: Best automatically unrolled program: 2.55 sec. Hand coded with three nested loops: 3.46 sec. Hand coded Cilk program:1.16 sec. Matrix multiply for 1024 x 1024 elements: Best automatically unrolled program: 20.47 sec. Hand coded with three nested loops: 27.40 sec. Hand coded Cilk program:9.19 sec.
63
Correctness Recursion unrolling preserves the semantics of the program: The unrolled program terminates if and only if the original recursive program terminates When both the original and the unrolled program terminate, the yield the same result
64
Speedup for Matrix Multiply Pentium III, Matrix of 512 x 512 elements
65
Speedup for Matrix Multiply Pentium III, Matrix of 1024 x 1024 elements
66
Speedup for Matrix Multiply Power PC, Matrix of 512 x 512 elements
67
Speedup for Matrix Multiply Power PC, Matrix of 1024 x 1024 elements
68
Speedup for Matrix Multiply Origin 2000, Matrix of 512 x 512 elements
69
Speedup for Matrix Multiply Origin 2000, Matrix of 1024 x 1024 elements
70
Speedup for LU Pentium III, Matrix of 512 x 512 elements
71
Speedup for LU Pentium III, Matrix of 1024 x 1024 elements
72
Speedup for LU Power PC, Matrix of 512 x 512 elements
73
Speedup for LU Power PC, Matrix of 1024 x 1024 elements
74
Speedup for LU Origin 2000, Matrix of 1024 x 1024 elements
75
Speedup for LU Origin 2000, Matrix of 512 x 512 elements
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.