The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran, uiuc.edu University of Illinois at Urbana Champaign
LCPC20032 Motivation Long basic blocks Compiler optimizations Library generators SPIRAL, ATLAS Speedup obtained from unrolling FFTs The effectiveness of register allocation on long basic blocks
LCPC20033 Contributions Apply Belady ’ s MIN algorithm to long basic blocks Compared with MIPSPro, Belady ’ s MIN algorithm performs 10% faster on Matrix Multiplication code 12% faster on FFT code of size 32 33% faster on FFT code of size 64
LCPC20034 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions
LCPC20035 Belady ’ s MIN Algorithm Chooses the farthest next use Guarantees the minimum number of reloads but not the stores. 1 c = a + b 2e = a + d 3f = a - b 4h = d + c 3 FP registers load R1, a load R2, b add R3, R1, R2 store R3, c load R3, d … …… … RegR1R2R3 Varabc Next use234 Statusclean dirty spill d 4 clean
LCPC20036 Belady’s algorithm A Simple Compiler Long Basic Blocks Parsing Register Allocation MIPS assembly code Target code generation SPIRAL after optimizations ATLAS after optimizations
LCPC20037 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions
LCPC20038 SPIRAL and FFT Code Make use of formula transformations and intelligent search to generate optimized DSP libraries Search for the best degree of unrolling Small size (2 to 64) Straight line code FFT64: stmts large size (128+) use small size results as components. FrFr FsFs
LCPC20039 Interesting Patterns in FFT Codes … y32 = y2 - y18 y33 = y3 - y19 y34 = y2 + y18 y35 = y3 + y19 y36 = y10 - y26 y37 = y11 - y27 y38 = y10 + y26 y39 = y11 + y27 y40 = y34 - y38 y41 = y35 - y39 y42 = y34 + y38 y43 = y35 + y39 y44 = y32 - y37 y45 = y33 + y36 … One define Two uses Close uses Simplify: One-define, one-use program It can be proved that Belady ’ s MIN algorithm generates the minimum number of reloads and stores!
LCPC Performance Evaluation: FFT Speedup: FFT 32: 12% FFT 64: 33% no spill spills Performance of the best formula for FFT 4-64 Performance of all the formulas for FFT 64
LCPC Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions
LCPC ATLAS and Matrix Multiplication An empirical optimizer searching for best parameters (degree of unrolling, etc.) Study: innermost loop body When KU=64 NU:MU=2:2 to 8: LOC A MU KU B NU KU Tile Size C for (int j =0; j<TileSize; j+=NU) for (int i=0; i<TileSize; i+=MU) load C 1..8 into registers for (int k=0; k<TileSize; k+=KU) load A k, 1..4 into registers load B 1..2, k into registers C 1 += A k, 1 * B 1, k C 2 += A k, 1 * B 2, k … C 8 += A k, 4 * B 2,k Repeat * for k+1, k+KU-1 store C back to memory *
LCPC Performance Evaluation for MM no spill spills Spills for NU:MU = 4:8 MIPSPro: 438 MIN: 892
LCPC Explanation Keep spilling c elements More stores Long dependency chain 1 c 0 += a0 * b0 2 c 1 += a1 * b0 3 c 2 += a2 * b0 4 c 3 += a3 * b0 5 c 64 += a0 * b64 6 c 65 += a1 * b64 7 c 66 += a2 * b64 8 c 67 += a3 * b64 9 c 0 += a64* b1 10 c 1 += a65* b1 11 c 2 += a66* b1 12 c 3 += a67* b1 13 c 64 += a64* b65 14 c 65 += a65* b65 15 c 66 += a66 * b65 16 c 67 += a67 * b65 Original load R0, a0 load R1, b0 load R2, c0 1 madd R2, R2, R0, R1 load R3, a1 load R4, c1 2 madd R4, R4, R3, R1 load R5, a2 store R4, c1 load R4, c2 3 madd R4, R4, R5, R1 store R4, c2 load R4, a3 store R2, c0 load R2, c3 4 madd R2, R2, R4, R1 … …… …
LCPC Solution Spill a, b, c elements Less stores Break dependency chain Use the instruction scheduling from MIPSPro c 0 += a0 * b0 c 0 += a64* b1 c 1 += a65* b1 c 1 += a1 * b0 c 2 += a2 * b0 c 3 += a3 * b0 c 2 += a66* b1 c 3 += a67* b1 c 66 += a2 * b64 c 64 += a0 * b64 c 66 += a67 * b65 c 64 += a64* b65 c 65 += a1 * b64 c 67 += a3 * b64 c 67 += a68 * b65 c 65 += a65* b65 Scheduled by MIPSPro
LCPC Performance Evaluation for MM 10% better when mu:nu are larger than 4:6 no spill spills Spills for NU:MU = 4:8 MINSched: 356 MIPSPro: 438 MIN: 892
LCPC Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions
LCPC Conclusions Belady ’ s MIN algorithm Generates minimum number of reloads and stores for one-define and one-use problem Performs better than the state of the art compilers like MIPSPro and GCC Speedup: 12% (FFT32), 33%(FFT64), 10%(MM) Further benchmarks to be tested
LCPC Code Size after Register Allocation
LCPC Speedup from Fully Unrolling