Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compiler Speculative Optimizations Wei Hsu 7/05/2006.

Similar presentations


Presentation on theme: "Compiler Speculative Optimizations Wei Hsu 7/05/2006."— Presentation transcript:

1 Compiler Speculative Optimizations Wei Hsu 7/05/2006

2 Speculative Execution  It means the early execution of code, the result of which may not be needed (work may be wasteful).  In pipelined processor, speculative execution is often used to reduce the cost of branch mis- predictions.  Some processors automatically prefetch the next instruction and/or data cache lines into the on-chip caches. Prefetch has also been used for disk read.  More aggressive speculative execution has been used in “run-ahead” or “execute-ahead” processors to warm up the caches.  Value prediction/speculation is another example

3 Compiler Controlled Speculation  Speculation is one of the most important methods for finding and exploiting ILP.  Allows the execution to exploit statistical ILP ( e.g. a branch is taken 90% of time, or the address of pointer *p is different from the address of pointer *q most of the time )  To overcome two most common constraints for instruction scheduling (and other optimizations)  Control dependence  Memory dependence

4 Compiler Controlled Speculation (cont.)  Allows compiler to issue operation early before a dependency  Removes latency of operation from the critical path  Helps hiding long latency memory operations  Control Speculation –the execution of an operation before the branch which guards it  Data Speculation –which is the execution of a memory load prior to a preceding store which may alias with it

5 Control Speculation  Example lw $r1, 0($r2) add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… sub$r3, $r6… bne… In this block, there is no room to schedule the load !! Why not moving the load instruction into the previous block?

6 Control Speculation  Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } sub$r3, $r6… bne… 1)Is the cond most likely to be true? profile feedback may guide the optimization 2) What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? lw $r1, 0($r2) lw $r6,…

7 Control Speculation  Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause a memory fault? can we have a special load instruction that ignores memory faults? Faul t!! Core dump

8 Control Speculation  Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? Make this special inst, so it never faults!! For example, Sparc supports non-faulting load instructions that can ignore segmentation faults…

9 Architecture Supports in SparcV9  SparcV9 provides non-faulting loads (similar to silent loads used in Multiflow’s Trace and Cydrome’s Cydra-5 computers).  Nonfaulting loads execute as any other loads except that, segmentation fault conditions do not cause program termination.  To minimize page faults when a speculative load references a Null pointer (address zero), it is desirable to map low addresses (especially address zero) to a page with special attribute.  Non-faulting loads are often used in data prefetching, but are not for general code motions.

10 Using non-faulting loads for prefetching While (j < k) { i=Index[j][1]; x=array[i]; y=x+… j+=m; } While (j < k) { load $r1, index[j][1]; load $r2, array($r1) add $r3, $r2,$r4 …. } May incur cache misses on each iteration While (j < k) { load $r1, index[j][1]; load $r5, index[j+m][1]; load $r2, array($r1) prefetch array($r5) add $r3, $r2,$r4 …. } load $r5 may fault !!

11 Using non-faulting loads for prefetching While (j < k) { i=Index[j][1]; x=array[i]; y=x+… j+=m; } While (j < k) { load $r1, index[j][1]; load $r2, array($r1) add $r3, $r2,$r4 …. } May incur cache misses on each iteration While (j < k) { load $r1, index[j][1]; nf-ld $r5, index[j+m][1]; load $r2, array($r1) prefetch array($r5) add $r3, $r2,$r4 …. } load $r5 may fault !!

12 Non-faulting Loads Insufficient add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) beq… If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… beq… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? But what if the real load of p cause a memory fault? We cannot just ignore it!!

13 check.s $r1 add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? But what if the real load of p cause a memory fault? We cannot just ignore it!! Let’s remember the fault status, and check when the loaded data is actually used

14 check.s $r1, recovery add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … recovery: lw $r1, 0($r2) If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… Recovery Code

15 check.s $r3, recovery lw$r5,4($r3) sw$r5,4($sp) … recovery: lw $r1, 0($r2) add$r3, $r1,$r4 If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… add$r3, $r1,$r4 bne… Recovery Code All instructions that are data dependent on the speculative load and moved with it must go to the recovery block

16 Architecture Supports in IA64 control speculation original: (p1)br.cond ld8 r1 = [ r2 ] transformed: ld8.s r1 = [ r2 ]... (p1)br.cond... chk.s r1, recovery Control dependence

17 Data Speculation  Example lw$r3,4($sp) sw $r3, 0($r1) lw$r5,0($r2) addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } In this block, there is no room to schedule the load !! How can we move the load instruction ahead of the store? $r2 and $r1 may be different most of the time, but could possibly be the same.

18 Data Speculation  Example sw $r3, 0($r1) addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } lw$r3,4($sp) lw$r5,0($r2)

19 Data Speculation  Example sw $r3, 0($r1) If (r1==r2) copy $r5,$r3 addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } lw$r3,4($sp) lw$r5,0($r2) What if there are m loads moving above n stores?  m x n comparisons must be generated !!  So some HW/AR supports are needed

20 Architecture Supports in IA64 Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery Memory dependence

21 ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r2)=0x00001ab0

22 ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r3)= 0x0000111a There is no match in the ALAT table. No change to ALAT. chk.a find entry r1 in ALAT It turns into a NOP

23 ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r3)= 0x00001ab0 There is a match in the ALAT table. The r1 entry will be removed chk.a find no entry of r1 in ALAT, check failed, branch to recovery routine

24 More Cases for Data Speculation  Many high performance architectural features are not effectively exploited by compilers due to imprecise analysis. Examples:  Automatic vectorization / parallelization  Local memory allocation / assignment  Register allocation  …

25 Examples  Vectorization loop (k=1; k<n; k++) a[k] = a[j] * b[k]; end  Register Allocation = a->b; *p = … = a->b; What if a,b are pointers? What if j == k? Can we allocate a->b to a register? Could *p modify a->b? or a ?

26  Complete alias and dependence analysis are costly and difficult  need Inter-procedural analysis  hard to handle dynamic allocated memory objects  runtime disambiguation is expensive But … true memory dependence rarely happen!!

27 Static and Dynamic Dependences  Most ambiguous data dependences identified by compiler do not occur at runtime

28  Speculation can compensate for imprecise alias information if speculation failure can be efficiently detected and recovered  Can we effectively use hardware supports to speculatively promote memory references to registers?  Can we speculatively vectorize or parallelize loops?

29 Motivation Example … = *p *q =.. … = *p Original program ld r32=[r31] *q = … ld r32=[r31] … = r32 Traditional compiler code

30 Another Example if (p->s1->s1->x1) { …. *ip = 0; p->s1->s1->x1++; …. } Original program ld8 r14=[r32] adds r14=8,r14 ld8 r14 = [r14] ld4 r14 = [r14] cmp4 p6,p7=0,r14 (p6) br…. st [r16] = r0 ld8 r14=[r32] adds r14=8,r14 ld8 r15 = [r14] ld4 r14 = [r15] adds r14=1,r14 st4 [r15] = r14 Traditional compiled code

31 Our approach at UM  Use alias profile or compiler heuristics to obtain approximated alias information  Use data speculation to verify such alias information at run time  Use the Advance Load Address Table (ALAT) in IA64 for the necessary support of data speculation

32 Background of ALAT in IA64 ; ;

33 Speculative Register Promotion  Use ld.a for the first load  Check the subsequent loads –Scheme 1: use ld.c for subsequent reads to the same reference. –Scheme 2: use chk.a for subsequent reads. This allows promotion of multi-level pointer variables. ( e.g. if a->b->c is speculatively promoted to a register, but a is aliased and modified, then the recovery code to reload a, a->b and a->b->c must be executed )

34 Examples =*p+1; *q=… =*p+3; ld.a r1=[p] add r3=r1,1 *q = …. ld.c r1=[p] add r4=r1, 3 a. read after read *p= ; *q =…. …=*p+3; st [p]=r1 ld.a r1=[p] *q = …. ld.c r1=[p] add r4=r1, 3 b. read after write =*p; *q = … =*p; *q = … =*p; ld.a r1=[p] *q = … ld.c.nc r1=[p] *q = … ld.c.clr r1=[p] c. multiple redundant loads

35 Compiler Support for Speculative Register Promotion  Enhanced SSA form with the notion of data speculation  SSA form for indirect memory reference –  operator : MayMod –  operator : MayUse  Speculative SSA form –  s operator: the variable in  s is unlikely to be updated by the corresponding definition statement –  s operator: the variable in  s is unlikely to be referenced by the indirect reference

36 Speculative SSA Form According To Alias Profiling *p = b 2  (b 1 ) a 2  (a 1 ) v 2  (v 1 )  (b 1 )  (a 1 )  (v 1 ) = *p The two examples assume that the points-to set of p generated by compiler is {a, b}, the points-to set of p obtained from alias profiling is {b}. v is the virtual variable for *p. a j stands for version j of variable a. s s s s

37 Overview of Speculative Register Promotion*  Phi insertion  Rename  Down_safety  Will_be_available  Finalize  Code motion * Based on SSAPRE [Kennedy, et.al. ACM TOLPAS ‘99]

38 Enhanced Rename … = a 1 *p 1 = … v 2  (v 1 ) a 2  (a 1 ) b 2  (b 1 ) … = a 2 a) Traditional Renaming … = a 1 *p 1 = … v 2  (v 1 ) a 2  s (a 1 ) b 2  (b 1 ) … = a 1 (b) Speculative Renaming The target set of *p generated by the compiler is {a, b} and v is the virtual variable for *p. The target set of *p generated by the alias profiling is {b}.

39 Example of Speculative Code Motion … = a 1 *p 1 = … v 4  s (v 3 ) a 2  s (a 1 ) b 4  (b 3 ) … = a 1 (a) Before Code Motion t 1 = a 1 (ld.a) … *p 1 = … v 4  (v 3 ) a 2  s (a 1 ) b 4  (b 3 ) t 4 = a 1 (ld.c) … (b) Final Output

40 Implementation  Open Research Compiler v1.1  Benchmark – Spec2000 C programs  Platform –HP i2000, 733 MHz Itanium processor, 1GB SDRAM –Redhat Linux v7.1  Pfmon v1.1

41 Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} A[][][] and v[][] are not promoted to registers due to possible alias with w[][].

42 Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} Promoting A[][][] and v[][] to registers using ALAT improves this Procedure by 10%

43 Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} Using heuristic rules, our compiler can promote both ***A and **v to registers. But using alias profile, our compiler fails to promote **v, because at the call site v and w are passed with the same array name.

44 Performance Improvement of Speculative Register Promotion

45 Effectiveness of Speculative Register Promotion

46 Performance Improvement of Speculative Register Promotion based on Heuristic Rules

47 Performance Improvement of Speculative Register Promotion on Itanium-2

48 Advantages of Using Heuristic Rules   Full coverage.   Input-insensitive.   Efficient.  Scalable.

49 A case for using Profiles DO 140 L = L3,L4, 2 DO 140 L = L3,L4, 2 Q(IJ(L)) = Q(IJ(L))+W1(L)*QMLT(L) Q(IJ(L)) = Q(IJ(L))+W1(L)*QMLT(L) Q(IJ(L)+1) = Q(IJ(L)+1)+W2(L)*QMLT(L) Q(IJ(L)+1) = Q(IJ(L)+1)+W2(L)*QMLT(L) …. …. Q(IJ(L+1))=Q(IJ(L+1))+W1(L+1)*QMLT(L+1) Q(IJ(L+1)+1)=Q(IJ(L+1)+1)+W2(L+1)*QMLT(L+1) Q(IJ(L+1))=Q(IJ(L+1))+W1(L+1)*QMLT(L+1) Q(IJ(L+1)+1)=Q(IJ(L+1)+1)+W2(L+1)*QMLT(L+1) …… …… 140 CONTINUE Heuristic rules think Q(IJ(L)) is different from Q(IJ(L+1)), but they are actually identical since IJ() is often sorted. e.g. 1,1,2,2,2,5,5,6,6,6,6,9,9,9


Download ppt "Compiler Speculative Optimizations Wei Hsu 7/05/2006."

Similar presentations


Ads by Google