Download presentation
Presentation is loading. Please wait.
1
The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran, padua @ uiuc.edu University of Illinois at Urbana Champaign
2
LCPC20032 Motivation Long basic blocks Compiler optimizations Library generators SPIRAL, ATLAS Speedup obtained from unrolling FFTs The effectiveness of register allocation on long basic blocks
3
LCPC20033 Contributions Apply Belady ’ s MIN algorithm to long basic blocks Compared with MIPSPro, Belady ’ s MIN algorithm performs 10% faster on Matrix Multiplication code 12% faster on FFT code of size 32 33% faster on FFT code of size 64
4
LCPC20034 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions
5
LCPC20035 Belady ’ s MIN Algorithm Chooses the farthest next use Guarantees the minimum number of reloads but not the stores. 1 c = a + b 2e = a + d 3f = a - b 4h = d + c 3 FP registers load R1, a load R2, b add R3, R1, R2 store R3, c load R3, d … …… … RegR1R2R3 Varabc Next use234 Statusclean dirty spill d 4 clean
6
LCPC20036 Belady’s algorithm A Simple Compiler Long Basic Blocks Parsing Register Allocation MIPS assembly code Target code generation SPIRAL after optimizations ATLAS after optimizations
7
LCPC20037 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions
8
LCPC20038 SPIRAL and FFT Code Make use of formula transformations and intelligent search to generate optimized DSP libraries Search for the best degree of unrolling Small size (2 to 64) Straight line code FFT64: 1400+ stmts large size (128+) use small size results as components. FrFr FsFs
9
LCPC20039 Interesting Patterns in FFT Codes … y32 = y2 - y18 y33 = y3 - y19 y34 = y2 + y18 y35 = y3 + y19 y36 = y10 - y26 y37 = y11 - y27 y38 = y10 + y26 y39 = y11 + y27 y40 = y34 - y38 y41 = y35 - y39 y42 = y34 + y38 y43 = y35 + y39 y44 = y32 - y37 y45 = y33 + y36 … One define Two uses Close uses Simplify: One-define, one-use program It can be proved that Belady ’ s MIN algorithm generates the minimum number of reloads and stores!
10
LCPC200310 Performance Evaluation: FFT Speedup: FFT 32: 12% FFT 64: 33% no spill spills Performance of the best formula for FFT 4-64 Performance of all the formulas for FFT 64
11
LCPC200311 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions
12
LCPC200312 ATLAS and Matrix Multiplication An empirical optimizer searching for best parameters (degree of unrolling, etc.) Study: innermost loop body When KU=64 NU:MU=2:2 to 8:8 300-4000 LOC A MU KU B NU KU Tile Size C for (int j =0; j<TileSize; j+=NU) for (int i=0; i<TileSize; i+=MU) load C 1..8 into registers for (int k=0; k<TileSize; k+=KU) load A k, 1..4 into registers load B 1..2, k into registers C 1 += A k, 1 * B 1, k C 2 += A k, 1 * B 2, k … C 8 += A k, 4 * B 2,k Repeat * for k+1, k+KU-1 store C back to memory *
13
LCPC200313 Performance Evaluation for MM no spill spills Spills for NU:MU = 4:8 MIPSPro: 438 MIN: 892
14
LCPC200314 Explanation Keep spilling c elements More stores Long dependency chain 1 c 0 += a0 * b0 2 c 1 += a1 * b0 3 c 2 += a2 * b0 4 c 3 += a3 * b0 5 c 64 += a0 * b64 6 c 65 += a1 * b64 7 c 66 += a2 * b64 8 c 67 += a3 * b64 9 c 0 += a64* b1 10 c 1 += a65* b1 11 c 2 += a66* b1 12 c 3 += a67* b1 13 c 64 += a64* b65 14 c 65 += a65* b65 15 c 66 += a66 * b65 16 c 67 += a67 * b65 Original load R0, a0 load R1, b0 load R2, c0 1 madd R2, R2, R0, R1 load R3, a1 load R4, c1 2 madd R4, R4, R3, R1 load R5, a2 store R4, c1 load R4, c2 3 madd R4, R4, R5, R1 store R4, c2 load R4, a3 store R2, c0 load R2, c3 4 madd R2, R2, R4, R1 … …… …
15
LCPC200315 Solution Spill a, b, c elements Less stores Break dependency chain Use the instruction scheduling from MIPSPro c 0 += a0 * b0 c 0 += a64* b1 c 1 += a65* b1 c 1 += a1 * b0 c 2 += a2 * b0 c 3 += a3 * b0 c 2 += a66* b1 c 3 += a67* b1 c 66 += a2 * b64 c 64 += a0 * b64 c 66 += a67 * b65 c 64 += a64* b65 c 65 += a1 * b64 c 67 += a3 * b64 c 67 += a68 * b65 c 65 += a65* b65 Scheduled by MIPSPro
16
LCPC200316 Performance Evaluation for MM 10% better when mu:nu are larger than 4:6 no spill spills Spills for NU:MU = 4:8 MINSched: 356 MIPSPro: 438 MIN: 892
17
LCPC200317 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions
18
LCPC200318 Conclusions Belady ’ s MIN algorithm Generates minimum number of reloads and stores for one-define and one-use problem Performs better than the state of the art compilers like MIPSPro and GCC Speedup: 12% (FFT32), 33%(FFT64), 10%(MM) Further benchmarks to be tested
20
LCPC200320 Code Size after Register Allocation
21
LCPC200321 Speedup from Fully Unrolling
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.