Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSL718 : VLIW - Software Driven ILP

Similar presentations


Presentation on theme: "CSL718 : VLIW - Software Driven ILP"— Presentation transcript:

1 CSL718 : VLIW - Software Driven ILP
Compiler Support for Exposing and Exploiting ILP 1st Apr, 2006 Anshul Kumar, CSE IITD

2 Code Scheduling for VLIW
Objective is to move code around and form packets of concurrently executable instructions Two possibilities: Local work on a straight line piece of code (basic block), i.e., do not go across conditional branches Global code can move across conditional branches Loops need to be tackled in both cases Anshul Kumar, CSE IITD

3 Pipeline scheduling example
for (i=1000; i>0; i--) x[i] = x[i] + s; Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Anshul Kumar, CSE IITD

4 Latency due to data hazards
Producer instruction Consumer instruction Latency FP ALU op 3 Store double 2 Load double 1 Assume no structural hazards Anshul Kumar, CSE IITD

5 Straight forward scheduling
Loop: L.D F0, 0(R1) stall ADD.D F4, F0, F stall stall S.D F4, 0(R1) DADDUI R1, R1, # stall BNE R1, R2, Loop stall Anshul Kumar, CSE IITD

6 A better schedule Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2
ADD.D F4, F0, F stall BNE R1, R2, Loop S.D F4, 8(R1) Anshul Kumar, CSE IITD

7 Loop unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) 6
DADDUI R1, R1, #-32 BNE R1, R2, Loop /4=7 Anshul Kumar, CSE IITD

8 Re-scheduling Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1)
ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, -16(R1) BNE R1, R2, Loop S.D F16, -24(R1) /4=3.5 Anshul Kumar, CSE IITD

9 Limits to unrolling Decrease in amount of loop overhead amortized with each unroll Growth in code size Register renaming leads to register pressure Anshul Kumar, CSE IITD

10 Scheduling example with 2 issue proc
Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) ADD.D F4, F0, F L.D F14,-24(R1) ADD.D F8, F6, F L.D F18,-32(R1) ADD.D F12, F10, F2 5 S.D F4, 0(R1) ADD.D F16, F14, F2 6 S.D F8, -8(R1) ADD.D F20, F18, F2 7 S.D F12,-16(R1) DADDUI R1,R1,# S.D F16, 16(R1) BNE R1,R2,Loop S.D F20, 8(R1) Anshul Kumar, CSE IITD

11 Scheduling example with 5 issue proc
L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2 S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#-56 S.D F20, 24(R1) S.D F24, 16(R1) S.D F28, 8(R1) BNE R1,R2,Loop Anshul Kumar, CSE IITD

12 Scheduling Results cycles/iteration Straight forward scheduling 10
With instruction re-ordering 6 With loop unrolling (4 times) 7 Unrolling + re-ordering 3.5 Scheduling on 2 issue VLIW 2.5 Scheduling on 5 issue VLIW 1.3 Anshul Kumar, CSE IITD

13 Loop Level Parallelism
Dependences in context of a loop: Dependence within an iteration Dependence across iteration or loop carried dependence Example with no loop carried dependence for (i=1000; i>0; i--) x[i] = x[i] + s; There is dependence within an iteration. Anshul Kumar, CSE IITD

14 Example with loop carried dependence
for (i=1; i<=100; i++){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } Assume that arrays are distinct and non-overlapping S1 uses a value of A from previous iteration restricts overlapping of different iterations S2 uses a value of A from the same iteration restricts movement of instructions within iteration Anshul Kumar, CSE IITD

15 Another example for (i=1; i<=100; i++){
A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } S1 uses a value of computed by S2 in the previous iteration. Still, iterations can be parallelized. There is no cycle among dependences. A transformation can remove loop carried dependence. Anshul Kumar, CSE IITD

16 Transformed loop of previous example
A[1] = A[1] + B[1]; for (i=1; i<=99; i++){ B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } Now there is no loop carried dependence. The iterations can be parallelized, preserving dependence within the iteration. Anshul Kumar, CSE IITD

17 Dependence distance for (i=6; i<=100; i++){ B[i] = B[i-5] + B[i]; }
Dependence distance is 5. This gives an opportunity for parallelization in spite of loop carried dependence. Anshul Kumar, CSE IITD

18 Finding dependence Important for Analysis is complicated by
code scheduling loop parallelization removal of false dependences Analysis is complicated by pointers and passing of parameters by reference and consequent potential for aliasing use of complex expressions as indices of arrays Anshul Kumar, CSE IITD

19 Analysis with affine indices
for (i=m; i<=n; i++){ A[ai+b] = A[ci+d] + B[i]; } Can ai+b become equal to ci+d for some values of i within the range m to n? Difficult to determine in general. a,b,c,d may not be known at compile time. These could depend on other loop indices. When a,b,c,d are constants, we can use GCD test. Dependence implies: GCD(c,a) must divide d-b. Anshul Kumar, CSE IITD

20 Example with affine indices
for (i=1; i<=100; i++){ A[4*i+1] = A[6*i+4] + B[i]; } GCD(4,6) is 2 and 4-1 is 3. 2 does not divide 3. Therefore, indices can never have same values. values of 4*i+1 = 5,9,13,17,21,25,29,33,37,41,.... values of 6*i+4 = 10,16,22,28,34,40,46,52,58,.... Sometimes GCD(c,a) my divide d-b, but dependence may not exist. Anshul Kumar, CSE IITD

21 Reducing impact of dependent computations
Copy propagation DADDUI R1,R2,#4 DADDUI R1,R1,#4 Tree height reduction ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 DADDUI R1,R2,#8 (used in loop unrolling) ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 Anshul Kumar, CSE IITD

22 Software pipelining: symbolic loop unrolling
iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 iteration 6 Anshul Kumar, CSE IITD

23 Software pipelining example
Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop iteration i L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+1 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+2 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) Anshul Kumar, CSE IITD

24 Difficulties in software pipelining
Overheads - increased register requirement Register management required Loop body may be complex It may require several transformations before pipelining Anshul Kumar, CSE IITD

25 Global code scheduling
Consider regions of code which are larger than basic blocks Include multiple basic blocks and conditionals How to move code across branch and join points? Anshul Kumar, CSE IITD

26 Static scheduling and branch prediction
Static branch prediction is helpful with delayed branches static scheduling Methods Fixed prediction Opcode based prediction Address based prediction Profile driven prediction (misprediction ~10%, instructions between mispredictions ~100) Anshul Kumar, CSE IITD

27 Branch prediction and scheduling
LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1, L OR R4, R5, R6 DADDU R10, R4, R3 L: DADDU R7, R8, R9 A: move when branch is predicted as not taken and R4 not needed in taken path B: move when branch is predicted as taken and R7 not needed in taken path A B Anshul Kumar, CSE IITD

28 Global code scheduling
When can assignment to B be moved before the comparison? When can assignment to C be moved above the join point? above the comparison? A[i]=A[i]+B[i] A[i]=0? B[i]=... X C[i]=... predicted path Anshul Kumar, CSE IITD

29 Trace scheduling A[i]=A[i]+B[i] A[i]=0? trace exit B[i]=...
C[i]=... trace exit trace entry Trace scheduling A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry

30 Region for global scheduling
Trace linear path through code (with high probability) multiple entries and exits Superblock linear path with single entry, multiple exits Hyperblock superblock plus internal control flow Treegion tree with single entry, multiple exits Trace-2 loop free region Anshul Kumar, CSE IITD

31 Trace B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD

32 Superblock B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD

33 Superblock with tail duplication
10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 50.4 39.6 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 B5 24 20 B6 B6’ 50.4 5.6 4.4 39.6 Anshul Kumar, CSE IITD

34 Hyperblock 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 72 18 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 B6’ 72 8 2 18 Anshul Kumar, CSE IITD

35 Treegion 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 24 B5 B5’ 14 6 B6 B6’ B6” B6”’ 5.6 1.4 2.4 0.6 50.4 12.6 21.6 5.4 Anshul Kumar, CSE IITD


Download ppt "CSL718 : VLIW - Software Driven ILP"

Similar presentations


Ads by Google