The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran, padua @ uiuc.edu University of Illinois at Urbana Champaign

LCPC20032 Motivation Long basic blocks  Compiler optimizations  Library generators  SPIRAL, ATLAS Speedup obtained from unrolling FFTs The effectiveness of register allocation on long basic blocks

LCPC20033 Contributions Apply Belady ’ s MIN algorithm to long basic blocks Compared with MIPSPro, Belady ’ s MIN algorithm performs  10% faster on Matrix Multiplication code  12% faster on FFT code of size 32  33% faster on FFT code of size 64

LCPC20034 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

LCPC20035 Belady ’ s MIN Algorithm Chooses the farthest next use Guarantees the minimum number of reloads but not the stores. 1 c = a + b 2e = a + d 3f = a - b 4h = d + c 3 FP registers load R1, a load R2, b add R3, R1, R2 store R3, c load R3, d … …… … RegR1R2R3 Varabc Next use234 Statusclean dirty spill d 4 clean

LCPC20036 Belady’s algorithm A Simple Compiler Long Basic Blocks Parsing Register Allocation MIPS assembly code Target code generation SPIRAL after optimizations ATLAS after optimizations

LCPC20038 SPIRAL and FFT Code Make use of formula transformations and intelligent search to generate optimized DSP libraries Search for the best degree of unrolling  Small size (2 to 64)  Straight line code  FFT64: 1400+ stmts  large size (128+)  use small size results as components. FrFr FsFs

LCPC20039 Interesting Patterns in FFT Codes … y32 = y2 - y18 y33 = y3 - y19 y34 = y2 + y18 y35 = y3 + y19 y36 = y10 - y26 y37 = y11 - y27 y38 = y10 + y26 y39 = y11 + y27 y40 = y34 - y38 y41 = y35 - y39 y42 = y34 + y38 y43 = y35 + y39 y44 = y32 - y37 y45 = y33 + y36 … One define Two uses Close uses Simplify: One-define, one-use program It can be proved that Belady ’ s MIN algorithm generates the minimum number of reloads and stores!

LCPC200310 Performance Evaluation: FFT Speedup: FFT 32: 12% FFT 64: 33% no spill spills Performance of the best formula for FFT 4-64 Performance of all the formulas for FFT 64

LCPC200312 ATLAS and Matrix Multiplication An empirical optimizer searching for best parameters (degree of unrolling, etc.) Study: innermost loop body  When KU=64 NU:MU=2:2 to 8:8 300-4000 LOC A MU KU B NU KU Tile Size C for (int j =0; j<TileSize; j+=NU) for (int i=0; i<TileSize; i+=MU) load C 1..8 into registers for (int k=0; k<TileSize; k+=KU) load A k, 1..4 into registers load B 1..2, k into registers C 1 += A k, 1 * B 1, k C 2 += A k, 1 * B 2, k … C 8 += A k, 4 * B 2,k Repeat * for k+1, k+KU-1 store C back to memory *

LCPC200313 Performance Evaluation for MM no spill spills Spills for NU:MU = 4:8 MIPSPro: 438 MIN: 892

LCPC200314 Explanation Keep spilling c elements  More stores Long dependency chain 1 c 0 += a0 * b0 2 c 1 += a1 * b0 3 c 2 += a2 * b0 4 c 3 += a3 * b0 5 c 64 += a0 * b64 6 c 65 += a1 * b64 7 c 66 += a2 * b64 8 c 67 += a3 * b64 9 c 0 += a64* b1 10 c 1 += a65* b1 11 c 2 += a66* b1 12 c 3 += a67* b1 13 c 64 += a64* b65 14 c 65 += a65* b65 15 c 66 += a66 * b65 16 c 67 += a67 * b65 Original load R0, a0 load R1, b0 load R2, c0 1 madd R2, R2, R0, R1 load R3, a1 load R4, c1 2 madd R4, R4, R3, R1 load R5, a2 store R4, c1 load R4, c2 3 madd R4, R4, R5, R1 store R4, c2 load R4, a3 store R2, c0 load R2, c3 4 madd R2, R2, R4, R1 … …… …

LCPC200315 Solution Spill a, b, c elements  Less stores Break dependency chain Use the instruction scheduling from MIPSPro c 0 += a0 * b0 c 0 += a64* b1 c 1 += a65* b1 c 1 += a1 * b0 c 2 += a2 * b0 c 3 += a3 * b0 c 2 += a66* b1 c 3 += a67* b1 c 66 += a2 * b64 c 64 += a0 * b64 c 66 += a67 * b65 c 64 += a64* b65 c 65 += a1 * b64 c 67 += a3 * b64 c 67 += a68 * b65 c 65 += a65* b65 Scheduled by MIPSPro

LCPC200316 Performance Evaluation for MM 10% better when mu:nu are larger than 4:6 no spill spills Spills for NU:MU = 4:8 MINSched: 356 MIPSPro: 438 MIN: 892

LCPC200318 Conclusions Belady ’ s MIN algorithm  Generates minimum number of reloads and stores for one-define and one-use problem  Performs better than the state of the art compilers like MIPSPro and GCC  Speedup: 12% (FFT32), 33%(FFT64), 10%(MM) Further benchmarks to be tested

LCPC200320 Code Size after Register Allocation

LCPC200321 Speedup from Fully Unrolling

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Similar presentations

Presentation on theme: "The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Similar presentations

Presentation on theme: "The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,"— Presentation transcript:

Similar presentations

About project

Feedback