Download presentation
Presentation is loading. Please wait.
1
Loop Tiling for Iterative Stencil Computations Marta Jiménez
2
What is an Iterative Stencil Computation? ISC often performed for PDE, GM, IP –swim, tomcatv, mgrid (from SPEC95 benchmark) –Jacobi DO K = 1, NITER /* time-step loop */ do J =... do I =... {A(I,J), A(I+1,J),…} enddo {wrapped-around computations} ENDDO Matrix A
3
Loop Tiling –divides IS into regular tiles to make the working set fit in the memory level being exploited –can be applied hierarchically (Multilevel Tiling) Current algorithms for Loop Tiling are limited to loops that: –are “perfectly” nested –are fully permutable –define a rectangular IS However, in iterative stencil computations, loops are: –NOT perfectly nested –NOT fully permutable
4
Show how Loop Tiling can be applied to iterative stencil computations –based on Song & Li’s paper [PLDI99] define a Program Model 1 Level of 1D-Tiling (cache) –program example: SWIM 2 levels of Tiling –2D-Tiling at the cache level –1D-Tiling at the register level (based on Jiménez et al. [ICS98][HPCA98]) Performance Results –Loop Tiling on EV5 & EV6 Today’s talk
5
Steps 1- Apply a set of transformations to the original program to achieve the desired program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level
6
1 st Step: achieve desired program model DO K = 1, NITER /* time-step loop */ do J1 = L J1, U J1 do I1 = L I1, U I1 {A(I,J), A(I+1,J),…} enddo... do Jm = L Jm, U Jm do Im = L Im, U Im {A(I,J), A(I+1,J),…} enddo ENDDO Program Model: Usually, programs are NOT directly written in this form – We must apply a set of transformations to achieve this program model
7
SWIM original code initializations 90 NCYCLE = NCYCLE +1 CALL CALC1 CALL CALC2 IF (NCYCLE >= ITMAX) STOP IF (NCYCLE <= 1) THEN CALL CALC3Z ELSE CALL CALC3 ENDIF GO TO 90 Transformations – Inline subroutines – Convert GO TO into DO -loop – Peel iterations of the time-step loop to eliminate IF-statements guarded by NCYCLE SUBROUTINE CALCX do J = 1,N do I = 1,M... enddo c wrapped-around computations do J = 1, N... enddo do I = 1, M... enddo...
8
Wrapped-around Computations DO K = 2, ITMAX-1 do J = 1,N do I = 1,M... enddo wrapped-around comp do J = 1, N... enddo do I = 1, M... enddo... do J = 1,N do I = 1,M... enddo... ENDDO J I I J CALC1 CALC2 CALC3
9
Projection along direction I DO K = 2, ITMAX-1 do J = 1,N... enddo wrapped-around comp do J = 1, N... enddo do J = 1,N... enddo wrapped-around comp do J = 1, N... enddo... ENDDO c J Wrapped-around Computations c Another way of dealing with the wrapped-around computations is performing code sinking
10
DO K = 2, ITMAX-1 do J = 1,N... enddo wrapped-around do J = 1,N... enddo wrapped-around do J = 1,N... enddo wrapped-around ENDDO J 1 st Step: achieved program model Flow dependencies & iterations space for SWIM ( Projection along direction I ) CALC1 CALC2 CALC3 K -loop (time) K=2 K=3 1N
11
Steps 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level
12
1D-Tiling K=2 K=3 K=4 J 1N Dependencies are violated Tiling parameters: SLOPE, OFFSETS-i SLOPE OFFSET-i J 1N 1N
13
2D-Tiling K (time-step loop) J I 1 M N1 1 M 1 M N1N1 Tiling parameters: SLOPE, OFFSETS-i for each tiled dimension ( J and I ) Computed using the JI -loop distance subgraph N1N1N1 1 M 1 M 1 M
14
flow dependencies anti-dependencies output dependencies JI 3 -loop JI 2 -loop JI 1 -loop [1,-1,0] [1,0,-1] [1,-1,-1] [1,0,0] [0,0,0] [1,-1,0] [1,0,-1] [1,-1,0] [0,0,0] JI-loop Distance Subgraph Each node represents a JI -loop nest Each edge represents a dependence (distance vector)
15
SWIM: Projection along direction I Wrapped-around Computations Backward dependencies with large distances make Tiling not profitable – apply Circular Loop Skewing to shorten backward dependencies DO K = 2, ITMAX-1 do J = 1,N... enddo wrapped-around do J = 1,N... enddo wrapped-around do J = 1,N... enddo wrapped-around ENDDO K -loop (time) K=2 K=3 1N J
16
Shorts backward dependencies by changing the iteration order Circular Loop Skewing 1N J CLS parameters: BETA-i, DELTA (computed using the JI -loop distance subgraph) K=2 K=3 1N J 1423 BETA-i DELTA 2 2
17
Circular Loop Skewing DO K = 2, ITMAX-1 do JX = 1+BETA1+DELTA(K-2), N+BETA1+DELTA(K-2) J = MOD(JX-1, N) + 1... enddo wrapped-around do JX = 1+BETA2+DELTA(K-2), N+BETA2+DELTA(K-2) J = MOD(JX-1, N) + 1... enddo wrapped-around do JX = 1+BETA3+DELTA(K-2), N+BETA3+DELTA(K-2) J = MOD(JX-1, N) + 1... enddo wrapped-around ENDDO
18
DO JJ =... DO II =... DO K =... if (first tile) then do JX =... offsets iter. enddo endif do JX =... Iter. inside tile enddo do JX =... Iter. inside tile enddo do JX =... Iter. inside tile enddo ENDDO SWIM: projection along direction I CLS parameters: DELTA=2, BETA1=0, BETA2=1, BETA3=2 Tiling parameters: SLOPE=2, OFFSET1=1, OFFSET2=OFFSET3=0 2 nd Step: 2D-Tiling for cache level J 312N 3 1 2 K=2 K=3 K=4 312N312
19
Steps 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level
20
3 rd Step: 1D-Tiling for register level DO JJ =... DO II =... DO K =...... do JX = L J, U J J = MOD (JX-1, N)+1 do IX = L I, U I I = MOD (IX-1, M)+1 [loop body: {I,J}] enddo... ENDDO The MOD operation introduced by CLS prevents us to fully unroll the loop Apply first Index Set Splitting to loop J J I 1 M M-1 2 M-2 N1N-12N-2 unrolled
21
Index Set Splitting ISS splits a loop into two new loops that iterate over non-intersecting portions of the iteration space DO JJ =... DO II =... DO K =...... do JX = L J, min(N,U J ) J = JX do IX =... enddo do JX = max(N+1,L J ), U J J = JX-N do IX =... enddo... ENDDO J I 1 M M-1 2 M-2 N1N-12N-2 ISS
22
3 rd Step: 1D-Tiling for register level
23
Code Transformations Summary 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li –Inline subroutines –Convert GOTO into DO -loop –Peel iterations of the time-step loop to eliminate IF-statements 2- Perform 2D-Tiling for the Cache Level –Construct JI -loop distance subgraph –Compute DELTA and BETAs and apply CLS to shorten backwards dep. –Update JI -loop distance subgraph –Compute OFSSETs and SLOPE and tile the IS 3- Perform 1D-Tiling for the Register Level –Index Set Splitting –Tiling in a straightforward manner
24
Architecture: EV56 (500Mhz, L1:8KB, L2:96KB), EV6(500MHz, L1:64KB, L2:4MB) Compiler Invocation: –f77 -O5 -arch ev56 (EV5) –kf77 -O5 -arch ev6 -notransform_loop -unroll 1 (EV6) Programs: –1D-Tiling for the Cache Level: loop J, TS = 4 (EV5), TS=8 (EV6) –2D -Tiling for the Cache Level: TS IxJ = 32x16 (EV5), TS IxJ =40x12(EV6) –1D-Tiling for the register level: loop J, TS=4 (EV5 & EV6) Performance Results (SWIM) Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT 439s658s294s371s578s296s (execution time) 1519s1533s1023s999s1009s677sEV5 EV6
25
Architecture: EV56 (500Mhz, L1:8KB, L2:96KB) Compiler invocations: –base: kf77 -O5 -arch ev56 –no_prefetch: kf77 -O5 -arch ev56 -switch nolu_prefetch_fetch ….. Performance Results EV5 (SWIM) Speedup over ORI (base) ORI ORI + RT 1D 1D + RT 2D 2D + RT Speedup
26
Architecture: EV6(500MHz, L1:64KB, L2:4MB) Compiler invocations: –base: f77 -O5 -arch ev6 –no_prefetch: f77 -O5 -arch ev6 -switch nolu_prefetch_fetch ….. Performance Results EV6 (SWIM) Speedup over ORI (base) Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT
27
J Code for Result Verification DO K = 2, ITMAX-1... do J = 1,N... enddo result verification IF (MOD(K,MPRINT).eq.0) THEN do I = do J = UCHECK = UCHECK + {UNEW(I,J)} enddo UNEW (I,I) =... enddo PRINTS ENDIF do J = 1,N... enddo ENDDO c Apply strip-mining to loop K (only useful if MPRINT is large) NEW in SPEC2000!!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.