Download presentation
Presentation is loading. Please wait.
Published byJett Warrior Modified over 9 years ago
1
Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen
2
Example 1 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } source code
3
Example 2 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code FFF F 0 faults exceptions x load ? mis-speculations
4
Example 3 R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 assembly code B AD S TUFF H APPENS !
5
R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3 Example 4 assembly code just re-execute! convention: use checkpoints/buffers
6
It’s Idempotent! 5 idempoh… what…? int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x; } =
7
Idempotent Region Construction 6 previously… in PLDI ’12 idempotent regions A LL T HE T IME before: after:
8
Idempotent Code Generation 7 now… in CGO ’13 int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x; } how do we get from here...
9
Idempotent Code Generation 8 now… in CGO ’13 to here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3
10
Idempotent Code Generation 9 now… in CGO ’13 not here (this is not idempotent)... R2 = load [R1] R1 = 0 LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3
11
Idempotent Code Generation 10 now… in CGO ’13 and not here (this is slow)... R3 = R1 R2 = load [R3] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3
12
Idempotent Code Generation 11 now… in CGO ’13 here... R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3
13
12 FFF F 0 faults exceptions x load ? mis-speculations Hampton & Asanović, ICS ’06 De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12 Kim et al., TOPLAS ’06 Zhang et al., ASPLOS ‘13 De Kruijf et al., ISCA ’10 Feng et al., MICRO ’11 De Kruijf et al., PLDI ’12 Idempotent Code Generation applications to prior work
14
Idempotent Code Generation 13 executive summary (1)how do we generate efficient idempotent code? (2) how do external factors affect overhead? (a)idempotent region size (b)instruction set (ISA) characteristics (c)control flow side-effects each can affect overheads by 10% or more algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler not covered in this talk
15
14 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects
16
Analysis 15 (a) idempotent region size region size overhead - number of inputs increasing - likelihood of spills growing - maximum spill cost reached - amortized over more instructions
17
Analysis 16 (b) ISA characteristics (1) two-address (e.g. x86) vs. three-address (e.g. ARM) ADD R1, R2 -> R1 Idempotent? NO! (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers ADD R1, R2 = R3 idempotent? YES! for register-memory, register spills may be less costly (microarchitecture dependent) impact is obvious, but… more registers is not always enough (see back-up slide)
18
Analysis (c) control flow side-effects x =...... = f(x) y =... 17 region boundaries x ’s “shadow interval” given no side-effects x ’s live interval
19
Analysis (c) control flow side-effects x =...... = f(x) y =... 18 region boundaries x ’s “shadow interval” given side-effects x ’s live interval
20
19 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation (a)idempotent region size (b)ISA characteristics (c)control flow side-effects
21
Evaluation 20 methodology measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length) benchmarks – SPEC 2006, PARSEC, and Parboil suites
22
Evaluation region size overhead Y OU ARE HERE (baseline: typically 10-30 instructions) ? (a) idempotent region size 21 10+ instructions 13.1% (geometric mean)
23
region size overhead 22 detection latency ? ? Evaluation (a) idempotent region size 13.1%
24
region size overhead 23 detection latency re-execution time ? Evaluation (a) idempotent region size 0.06% 13.1% 11.1%
25
Evaluation 24 percentage overhead x86-64ARMv7 Three-address support matters more for FP benchmarks Register-memory matters more for integer benchmarks (b) ISA characteristics
26
Evaluation 25 percentage overhead no side-effectsside-effects (c) control flow side-effects substantial only in two cases; insubstantial otherwise intuition: typically compiler already spills for control flow divergence
27
26 Presentation Overview ❶ Introduction ❷ Analysis ❸ Evaluation
28
Conclusions 27 (a) region size – matters a lot; large regions are ideal if recovery is infrequent overheads approach zero as regions grow overheads drop below 10% only with careful co-design (b) instruction set – matters when region sizes must be small supporting control flow side-effects is not expensive (c) control flow side-effects – generally does not matter
29
Conclusions 28 code generation and static analysis algorithms http://research.cs.wisc.edu/vertical/iCompiler applicability not limited to architecture design see Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]” thank you!
30
Back-up Slides 29
31
ISA Characteristics 30 more registers isn’t always enough x = 0; if (y > 0) x = 1; z = x + y; C code R0 = 0 if (R1 > 0) R0 = 1 R2 = R0 + R1
32
ISA Characteristics 31 more registers isn’t always enough R0 = 0 if (R1 > 0) R3 = R0 x = 0; if (y > 0) x = 1; z = x + y; C code R3 = 1 R2 = R3 + R1 need an extra instruction no matter what
33
32 data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only) percentage overhead ISA Characteristics idempotence vs. fewer registers no idempotence, #GPR reduced from 16
34
Very Large Regions 33 how do we get there? Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help Problem #3: large array structures – awareness of array access patterns can help (next slides) Problem #4: intra-procedural scope – limited scope aggravates all effects listed above
35
Very Large Regions 34 Re: Problem #2 (cut in loops are bad) i 0 = φ(0, i 1 ) i 1 = i 0 + 1 if (i 1 < X) for (i = 0; i < X; i++) {... } C codeCFG + SSA
36
Very Large Regions 35 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code N O B OUNDARIES = N O P ROBLEM
37
Very Large Regions 36 Re: Problem #2 (cut in loops are bad) R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) {... } C codemachine code
38
Very Large Regions 37 Re: Problem #2 (cut in loops are bad) R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) for (i = 0; i < X; i++) {... } C codemachine code – “redundant” copy – extra boundary (pressure)
39
Very Large Regions 38 Re: Problem #3 (array access patterns) [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c; non-clobber antidependences… GONE! PLDI ‘12 algorithm makes this simplifying assumption: cheap for scalars, expensive for arrays
40
Very Large Regions 39 Re: Problem #3 (array access patterns) not really practical for large arrays but if we don’t do it, non-clobber antidependences remain solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis) // initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i);
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.