CSL718 : VLIW - Software Driven ILP Compiler Support for Exposing and Exploiting ILP 1st Apr, 2006 Anshul Kumar, CSE IITD
Code Scheduling for VLIW Objective is to move code around and form packets of concurrently executable instructions Two possibilities: Local work on a straight line piece of code (basic block), i.e., do not go across conditional branches Global code can move across conditional branches Loops need to be tackled in both cases Anshul Kumar, CSE IITD
Pipeline scheduling example for (i=1000; i>0; i--) x[i] = x[i] + s; Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Anshul Kumar, CSE IITD
Latency due to data hazards Producer instruction Consumer instruction Latency FP ALU op 3 Store double 2 Load double 1 Assume no structural hazards Anshul Kumar, CSE IITD
Straight forward scheduling Loop: L.D F0, 0(R1) 1 stall 2 ADD.D F4, F0, F2 3 stall 4 stall 5 S.D F4, 0(R1) 6 DADDUI R1, R1, #-8 7 stall 8 BNE R1, R2, Loop 9 stall 10 Anshul Kumar, CSE IITD
A better schedule Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 stall 4 BNE R1, R2, Loop 5 S.D F4, 8(R1) 6 Anshul Kumar, CSE IITD
Loop unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) 6 DADDUI R1, R1, #-32 BNE R1, R2, Loop 28 28/4=7 Anshul Kumar, CSE IITD
Re-scheduling Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 8 S.D F4, 0(R1) S.D F8, -8(R1) 10 DADDUI R1, R1, #-32 S.D F12, -16(R1) 12 BNE R1, R2, Loop S.D F16, -24(R1) 14 14/4=3.5 Anshul Kumar, CSE IITD
Limits to unrolling Decrease in amount of loop overhead amortized with each unroll Growth in code size Register renaming leads to register pressure Anshul Kumar, CSE IITD
Scheduling example with 2 issue proc Loop: L.D F0, 0(R1) 1 L.D F6, -8(R1) 2 L.D F10,-16(R1) ADD.D F4, F0, F2 3 L.D F14,-24(R1) ADD.D F8, F6, F2 4 L.D F18,-32(R1) ADD.D F12, F10, F2 5 S.D F4, 0(R1) ADD.D F16, F14, F2 6 S.D F8, -8(R1) ADD.D F20, F18, F2 7 S.D F12,-16(R1) 8 DADDUI R1,R1,#-40 9 S.D F16, 16(R1) 10 BNE R1,R2,Loop 11 S.D F20, 8(R1) 12 Anshul Kumar, CSE IITD
Scheduling example with 5 issue proc L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2 S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#-56 S.D F20, 24(R1) S.D F24, 16(R1) S.D F28, 8(R1) BNE R1,R2,Loop Anshul Kumar, CSE IITD
Scheduling Results cycles/iteration Straight forward scheduling 10 With instruction re-ordering 6 With loop unrolling (4 times) 7 Unrolling + re-ordering 3.5 Scheduling on 2 issue VLIW 2.5 Scheduling on 5 issue VLIW 1.3 Anshul Kumar, CSE IITD
Loop Level Parallelism Dependences in context of a loop: Dependence within an iteration Dependence across iteration or loop carried dependence Example with no loop carried dependence for (i=1000; i>0; i--) x[i] = x[i] + s; There is dependence within an iteration. Anshul Kumar, CSE IITD
Example with loop carried dependence for (i=1; i<=100; i++){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } Assume that arrays are distinct and non-overlapping S1 uses a value of A from previous iteration restricts overlapping of different iterations S2 uses a value of A from the same iteration restricts movement of instructions within iteration Anshul Kumar, CSE IITD
Another example for (i=1; i<=100; i++){ A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } S1 uses a value of computed by S2 in the previous iteration. Still, iterations can be parallelized. There is no cycle among dependences. A transformation can remove loop carried dependence. Anshul Kumar, CSE IITD
Transformed loop of previous example A[1] = A[1] + B[1]; for (i=1; i<=99; i++){ B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } Now there is no loop carried dependence. The iterations can be parallelized, preserving dependence within the iteration. Anshul Kumar, CSE IITD
Dependence distance for (i=6; i<=100; i++){ B[i] = B[i-5] + B[i]; } Dependence distance is 5. This gives an opportunity for parallelization in spite of loop carried dependence. Anshul Kumar, CSE IITD
Finding dependence Important for Analysis is complicated by code scheduling loop parallelization removal of false dependences Analysis is complicated by pointers and passing of parameters by reference and consequent potential for aliasing use of complex expressions as indices of arrays Anshul Kumar, CSE IITD
Analysis with affine indices for (i=m; i<=n; i++){ A[ai+b] = A[ci+d] + B[i]; } Can ai+b become equal to ci+d for some values of i within the range m to n? Difficult to determine in general. a,b,c,d may not be known at compile time. These could depend on other loop indices. When a,b,c,d are constants, we can use GCD test. Dependence implies: GCD(c,a) must divide d-b. Anshul Kumar, CSE IITD
Example with affine indices for (i=1; i<=100; i++){ A[4*i+1] = A[6*i+4] + B[i]; } GCD(4,6) is 2 and 4-1 is 3. 2 does not divide 3. Therefore, indices can never have same values. values of 4*i+1 = 5,9,13,17,21,25,29,33,37,41,.... values of 6*i+4 = 10,16,22,28,34,40,46,52,58,.... Sometimes GCD(c,a) my divide d-b, but dependence may not exist. Anshul Kumar, CSE IITD
Reducing impact of dependent computations Copy propagation DADDUI R1,R2,#4 DADDUI R1,R1,#4 Tree height reduction ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 DADDUI R1,R2,#8 (used in loop unrolling) ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 Anshul Kumar, CSE IITD
Software pipelining: symbolic loop unrolling iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 iteration 6 Anshul Kumar, CSE IITD
Software pipelining example Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop iteration i L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+1 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+2 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) Anshul Kumar, CSE IITD
Difficulties in software pipelining Overheads - increased register requirement Register management required Loop body may be complex It may require several transformations before pipelining Anshul Kumar, CSE IITD
Global code scheduling Consider regions of code which are larger than basic blocks Include multiple basic blocks and conditionals How to move code across branch and join points? Anshul Kumar, CSE IITD
Static scheduling and branch prediction Static branch prediction is helpful with delayed branches static scheduling Methods Fixed prediction Opcode based prediction Address based prediction Profile driven prediction (misprediction ~10%, instructions between mispredictions ~100) Anshul Kumar, CSE IITD
Branch prediction and scheduling LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1, L OR R4, R5, R6 DADDU R10, R4, R3 L: DADDU R7, R8, R9 A: move when branch is predicted as not taken and R4 not needed in taken path B: move when branch is predicted as taken and R7 not needed in taken path A B Anshul Kumar, CSE IITD
Global code scheduling When can assignment to B be moved before the comparison? When can assignment to C be moved above the join point? above the comparison? A[i]=A[i]+B[i] A[i]=0? B[i]=... X C[i]=... predicted path Anshul Kumar, CSE IITD
Trace scheduling A[i]=A[i]+B[i] A[i]=0? trace exit B[i]=... C[i]=... trace exit trace entry Trace scheduling A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry
Region for global scheduling Trace linear path through code (with high probability) multiple entries and exits Superblock linear path with single entry, multiple exits Hyperblock superblock plus internal control flow Treegion tree with single entry, multiple exits Trace-2 loop free region Anshul Kumar, CSE IITD
Trace B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD
Superblock B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD
Superblock with tail duplication 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 50.4 39.6 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 B5 24 20 B6 B6’ 50.4 5.6 4.4 39.6 Anshul Kumar, CSE IITD
Hyperblock 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 72 18 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 B6’ 72 8 2 18 Anshul Kumar, CSE IITD
Treegion 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 24 B5 B5’ 14 6 B6 B6’ B6” B6”’ 5.6 1.4 2.4 0.6 50.4 12.6 21.6 5.4 Anshul Kumar, CSE IITD