Download presentation
Presentation is loading. Please wait.
1
Compiler Speculative Optimizations Wei Hsu 7/05/2006
2
Speculative Execution It means the early execution of code, the result of which may not be needed (work may be wasteful). In pipelined processor, speculative execution is often used to reduce the cost of branch mis- predictions. Some processors automatically prefetch the next instruction and/or data cache lines into the on-chip caches. Prefetch has also been used for disk read. More aggressive speculative execution has been used in “run-ahead” or “execute-ahead” processors to warm up the caches. Value prediction/speculation is another example
3
Compiler Controlled Speculation Speculation is one of the most important methods for finding and exploiting ILP. Allows the execution to exploit statistical ILP ( e.g. a branch is taken 90% of time, or the address of pointer *p is different from the address of pointer *q most of the time ) To overcome two most common constraints for instruction scheduling (and other optimizations) Control dependence Memory dependence
4
Compiler Controlled Speculation (cont.) Allows compiler to issue operation early before a dependency Removes latency of operation from the critical path Helps hiding long latency memory operations Control Speculation –the execution of an operation before the branch which guards it Data Speculation –which is the execution of a memory load prior to a preceding store which may alias with it
5
Control Speculation Example lw $r1, 0($r2) add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… sub$r3, $r6… bne… In this block, there is no room to schedule the load !! Why not moving the load instruction into the previous block?
6
Control Speculation Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } sub$r3, $r6… bne… 1)Is the cond most likely to be true? profile feedback may guide the optimization 2) What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? lw $r1, 0($r2) lw $r6,…
7
Control Speculation Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause a memory fault? can we have a special load instruction that ignores memory faults? Faul t!! Core dump
8
Control Speculation Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? Make this special inst, so it never faults!! For example, Sparc supports non-faulting load instructions that can ignore segmentation faults…
9
Architecture Supports in SparcV9 SparcV9 provides non-faulting loads (similar to silent loads used in Multiflow’s Trace and Cydrome’s Cydra-5 computers). Nonfaulting loads execute as any other loads except that, segmentation fault conditions do not cause program termination. To minimize page faults when a speculative load references a Null pointer (address zero), it is desirable to map low addresses (especially address zero) to a page with special attribute. Non-faulting loads are often used in data prefetching, but are not for general code motions.
10
Using non-faulting loads for prefetching While (j < k) { i=Index[j][1]; x=array[i]; y=x+… j+=m; } While (j < k) { load $r1, index[j][1]; load $r2, array($r1) add $r3, $r2,$r4 …. } May incur cache misses on each iteration While (j < k) { load $r1, index[j][1]; load $r5, index[j+m][1]; load $r2, array($r1) prefetch array($r5) add $r3, $r2,$r4 …. } load $r5 may fault !!
11
Using non-faulting loads for prefetching While (j < k) { i=Index[j][1]; x=array[i]; y=x+… j+=m; } While (j < k) { load $r1, index[j][1]; load $r2, array($r1) add $r3, $r2,$r4 …. } May incur cache misses on each iteration While (j < k) { load $r1, index[j][1]; nf-ld $r5, index[j+m][1]; load $r2, array($r1) prefetch array($r5) add $r3, $r2,$r4 …. } load $r5 may fault !!
12
Non-faulting Loads Insufficient add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) beq… If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… beq… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? But what if the real load of p cause a memory fault? We cannot just ignore it!!
13
check.s $r1 add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? But what if the real load of p cause a memory fault? We cannot just ignore it!! Let’s remember the fault status, and check when the loaded data is actually used
14
check.s $r1, recovery add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … recovery: lw $r1, 0($r2) If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… Recovery Code
15
check.s $r3, recovery lw$r5,4($r3) sw$r5,4($sp) … recovery: lw $r1, 0($r2) add$r3, $r1,$r4 If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… add$r3, $r1,$r4 bne… Recovery Code All instructions that are data dependent on the speculative load and moved with it must go to the recovery block
16
Architecture Supports in IA64 control speculation original: (p1)br.cond ld8 r1 = [ r2 ] transformed: ld8.s r1 = [ r2 ]... (p1)br.cond... chk.s r1, recovery Control dependence
17
Data Speculation Example lw$r3,4($sp) sw $r3, 0($r1) lw$r5,0($r2) addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } In this block, there is no room to schedule the load !! How can we move the load instruction ahead of the store? $r2 and $r1 may be different most of the time, but could possibly be the same.
18
Data Speculation Example sw $r3, 0($r1) addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } lw$r3,4($sp) lw$r5,0($r2)
19
Data Speculation Example sw $r3, 0($r1) If (r1==r2) copy $r5,$r3 addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } lw$r3,4($sp) lw$r5,0($r2) What if there are m loads moving above n stores? m x n comparisons must be generated !! So some HW/AR supports are needed
20
Architecture Supports in IA64 Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery Memory dependence
21
ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r2)=0x00001ab0
22
ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r3)= 0x0000111a There is no match in the ALAT table. No change to ALAT. chk.a find entry r1 in ALAT It turns into a NOP
23
ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r3)= 0x00001ab0 There is a match in the ALAT table. The r1 entry will be removed chk.a find no entry of r1 in ALAT, check failed, branch to recovery routine
24
More Cases for Data Speculation Many high performance architectural features are not effectively exploited by compilers due to imprecise analysis. Examples: Automatic vectorization / parallelization Local memory allocation / assignment Register allocation …
25
Examples Vectorization loop (k=1; k<n; k++) a[k] = a[j] * b[k]; end Register Allocation = a->b; *p = … = a->b; What if a,b are pointers? What if j == k? Can we allocate a->b to a register? Could *p modify a->b? or a ?
26
Complete alias and dependence analysis are costly and difficult need Inter-procedural analysis hard to handle dynamic allocated memory objects runtime disambiguation is expensive But … true memory dependence rarely happen!!
27
Static and Dynamic Dependences Most ambiguous data dependences identified by compiler do not occur at runtime
28
Speculation can compensate for imprecise alias information if speculation failure can be efficiently detected and recovered Can we effectively use hardware supports to speculatively promote memory references to registers? Can we speculatively vectorize or parallelize loops?
29
Motivation Example … = *p *q =.. … = *p Original program ld r32=[r31] *q = … ld r32=[r31] … = r32 Traditional compiler code
30
Another Example if (p->s1->s1->x1) { …. *ip = 0; p->s1->s1->x1++; …. } Original program ld8 r14=[r32] adds r14=8,r14 ld8 r14 = [r14] ld4 r14 = [r14] cmp4 p6,p7=0,r14 (p6) br…. st [r16] = r0 ld8 r14=[r32] adds r14=8,r14 ld8 r15 = [r14] ld4 r14 = [r15] adds r14=1,r14 st4 [r15] = r14 Traditional compiled code
31
Our approach at UM Use alias profile or compiler heuristics to obtain approximated alias information Use data speculation to verify such alias information at run time Use the Advance Load Address Table (ALAT) in IA64 for the necessary support of data speculation
32
Background of ALAT in IA64 ; ;
33
Speculative Register Promotion Use ld.a for the first load Check the subsequent loads –Scheme 1: use ld.c for subsequent reads to the same reference. –Scheme 2: use chk.a for subsequent reads. This allows promotion of multi-level pointer variables. ( e.g. if a->b->c is speculatively promoted to a register, but a is aliased and modified, then the recovery code to reload a, a->b and a->b->c must be executed )
34
Examples =*p+1; *q=… =*p+3; ld.a r1=[p] add r3=r1,1 *q = …. ld.c r1=[p] add r4=r1, 3 a. read after read *p= ; *q =…. …=*p+3; st [p]=r1 ld.a r1=[p] *q = …. ld.c r1=[p] add r4=r1, 3 b. read after write =*p; *q = … =*p; *q = … =*p; ld.a r1=[p] *q = … ld.c.nc r1=[p] *q = … ld.c.clr r1=[p] c. multiple redundant loads
35
Compiler Support for Speculative Register Promotion Enhanced SSA form with the notion of data speculation SSA form for indirect memory reference – operator : MayMod – operator : MayUse Speculative SSA form – s operator: the variable in s is unlikely to be updated by the corresponding definition statement – s operator: the variable in s is unlikely to be referenced by the indirect reference
36
Speculative SSA Form According To Alias Profiling *p = b 2 (b 1 ) a 2 (a 1 ) v 2 (v 1 ) (b 1 ) (a 1 ) (v 1 ) = *p The two examples assume that the points-to set of p generated by compiler is {a, b}, the points-to set of p obtained from alias profiling is {b}. v is the virtual variable for *p. a j stands for version j of variable a. s s s s
37
Overview of Speculative Register Promotion* Phi insertion Rename Down_safety Will_be_available Finalize Code motion * Based on SSAPRE [Kennedy, et.al. ACM TOLPAS ‘99]
38
Enhanced Rename … = a 1 *p 1 = … v 2 (v 1 ) a 2 (a 1 ) b 2 (b 1 ) … = a 2 a) Traditional Renaming … = a 1 *p 1 = … v 2 (v 1 ) a 2 s (a 1 ) b 2 (b 1 ) … = a 1 (b) Speculative Renaming The target set of *p generated by the compiler is {a, b} and v is the virtual variable for *p. The target set of *p generated by the alias profiling is {b}.
39
Example of Speculative Code Motion … = a 1 *p 1 = … v 4 s (v 3 ) a 2 s (a 1 ) b 4 (b 3 ) … = a 1 (a) Before Code Motion t 1 = a 1 (ld.a) … *p 1 = … v 4 (v 3 ) a 2 s (a 1 ) b 4 (b 3 ) t 4 = a 1 (ld.c) … (b) Final Output
40
Implementation Open Research Compiler v1.1 Benchmark – Spec2000 C programs Platform –HP i2000, 733 MHz Itanium processor, 1GB SDRAM –Redhat Linux v7.1 Pfmon v1.1
41
Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} A[][][] and v[][] are not promoted to registers due to possible alias with w[][].
42
Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} Promoting A[][][] and v[][] to registers using ALAT improves this Procedure by 10%
43
Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} Using heuristic rules, our compiler can promote both ***A and **v to registers. But using alias profile, our compiler fails to promote **v, because at the call site v and w are passed with the same array name.
44
Performance Improvement of Speculative Register Promotion
45
Effectiveness of Speculative Register Promotion
46
Performance Improvement of Speculative Register Promotion based on Heuristic Rules
47
Performance Improvement of Speculative Register Promotion on Itanium-2
48
Advantages of Using Heuristic Rules Full coverage. Input-insensitive. Efficient. Scalable.
49
A case for using Profiles DO 140 L = L3,L4, 2 DO 140 L = L3,L4, 2 Q(IJ(L)) = Q(IJ(L))+W1(L)*QMLT(L) Q(IJ(L)) = Q(IJ(L))+W1(L)*QMLT(L) Q(IJ(L)+1) = Q(IJ(L)+1)+W2(L)*QMLT(L) Q(IJ(L)+1) = Q(IJ(L)+1)+W2(L)*QMLT(L) …. …. Q(IJ(L+1))=Q(IJ(L+1))+W1(L+1)*QMLT(L+1) Q(IJ(L+1)+1)=Q(IJ(L+1)+1)+W2(L+1)*QMLT(L+1) Q(IJ(L+1))=Q(IJ(L+1))+W1(L+1)*QMLT(L+1) Q(IJ(L+1)+1)=Q(IJ(L+1)+1)+W2(L+1)*QMLT(L+1) …… …… 140 CONTINUE Heuristic rules think Q(IJ(L)) is different from Q(IJ(L+1)), but they are actually identical since IJ() is often sorted. e.g. 1,1,2,2,2,5,5,6,6,6,6,9,9,9
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.