CSL718 : VLIW - Software Driven ILP

CSL718 : VLIW - Software Driven ILP
Compiler Support for Exposing and Exploiting ILP 1st Apr, 2006 Anshul Kumar, CSE IITD

Code Scheduling for VLIW
Objective is to move code around and form packets of concurrently executable instructions Two possibilities: Local work on a straight line piece of code (basic block), i.e., do not go across conditional branches Global code can move across conditional branches Loops need to be tackled in both cases Anshul Kumar, CSE IITD

Pipeline scheduling example
for (i=1000; i>0; i--) x[i] = x[i] + s; Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Anshul Kumar, CSE IITD

Latency due to data hazards
Producer instruction Consumer instruction Latency FP ALU op 3 Store double 2 Load double 1 Assume no structural hazards Anshul Kumar, CSE IITD

Straight forward scheduling
Loop: L.D F0, 0(R1) stall ADD.D F4, F0, F stall stall S.D F4, 0(R1) DADDUI R1, R1, # stall BNE R1, R2, Loop stall Anshul Kumar, CSE IITD

A better schedule Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2
ADD.D F4, F0, F stall BNE R1, R2, Loop S.D F4, 8(R1) Anshul Kumar, CSE IITD

Loop unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) 6
DADDUI R1, R1, #-32 BNE R1, R2, Loop /4=7 Anshul Kumar, CSE IITD

Re-scheduling Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1)
ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, -16(R1) BNE R1, R2, Loop S.D F16, -24(R1) /4=3.5 Anshul Kumar, CSE IITD

Limits to unrolling Decrease in amount of loop overhead amortized with each unroll Growth in code size Register renaming leads to register pressure Anshul Kumar, CSE IITD

Scheduling example with 2 issue proc
Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) ADD.D F4, F0, F L.D F14,-24(R1) ADD.D F8, F6, F L.D F18,-32(R1) ADD.D F12, F10, F2 5 S.D F4, 0(R1) ADD.D F16, F14, F2 6 S.D F8, -8(R1) ADD.D F20, F18, F2 7 S.D F12,-16(R1) DADDUI R1,R1,# S.D F16, 16(R1) BNE R1,R2,Loop S.D F20, 8(R1) Anshul Kumar, CSE IITD

Scheduling example with 5 issue proc
L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2 S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#-56 S.D F20, 24(R1) S.D F24, 16(R1) S.D F28, 8(R1) BNE R1,R2,Loop Anshul Kumar, CSE IITD

Scheduling Results cycles/iteration Straight forward scheduling 10
With instruction re-ordering 6 With loop unrolling (4 times) 7 Unrolling + re-ordering 3.5 Scheduling on 2 issue VLIW 2.5 Scheduling on 5 issue VLIW 1.3 Anshul Kumar, CSE IITD

Loop Level Parallelism
Dependences in context of a loop: Dependence within an iteration Dependence across iteration or loop carried dependence Example with no loop carried dependence for (i=1000; i>0; i--) x[i] = x[i] + s; There is dependence within an iteration. Anshul Kumar, CSE IITD

Example with loop carried dependence
for (i=1; i<=100; i++){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } Assume that arrays are distinct and non-overlapping S1 uses a value of A from previous iteration restricts overlapping of different iterations S2 uses a value of A from the same iteration restricts movement of instructions within iteration Anshul Kumar, CSE IITD

Another example for (i=1; i<=100; i++){
A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } S1 uses a value of computed by S2 in the previous iteration. Still, iterations can be parallelized. There is no cycle among dependences. A transformation can remove loop carried dependence. Anshul Kumar, CSE IITD

Transformed loop of previous example
A[1] = A[1] + B[1]; for (i=1; i<=99; i++){ B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } Now there is no loop carried dependence. The iterations can be parallelized, preserving dependence within the iteration. Anshul Kumar, CSE IITD

Dependence distance for (i=6; i<=100; i++){ B[i] = B[i-5] + B[i]; }
Dependence distance is 5. This gives an opportunity for parallelization in spite of loop carried dependence. Anshul Kumar, CSE IITD

Finding dependence Important for Analysis is complicated by
code scheduling loop parallelization removal of false dependences Analysis is complicated by pointers and passing of parameters by reference and consequent potential for aliasing use of complex expressions as indices of arrays Anshul Kumar, CSE IITD

Analysis with affine indices
for (i=m; i<=n; i++){ A[ai+b] = A[ci+d] + B[i]; } Can ai+b become equal to ci+d for some values of i within the range m to n? Difficult to determine in general. a,b,c,d may not be known at compile time. These could depend on other loop indices. When a,b,c,d are constants, we can use GCD test. Dependence implies: GCD(c,a) must divide d-b. Anshul Kumar, CSE IITD

Example with affine indices
for (i=1; i<=100; i++){ A[4*i+1] = A[6*i+4] + B[i]; } GCD(4,6) is 2 and 4-1 is 3. 2 does not divide 3. Therefore, indices can never have same values. values of 4*i+1 = 5,9,13,17,21,25,29,33,37,41,.... values of 6*i+4 = 10,16,22,28,34,40,46,52,58,.... Sometimes GCD(c,a) my divide d-b, but dependence may not exist. Anshul Kumar, CSE IITD

Reducing impact of dependent computations
Copy propagation DADDUI R1,R2,#4 DADDUI R1,R1,#4 Tree height reduction ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 DADDUI R1,R2,#8 (used in loop unrolling) ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 Anshul Kumar, CSE IITD

Software pipelining: symbolic loop unrolling
iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 iteration 6 Anshul Kumar, CSE IITD

Software pipelining example
Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop iteration i L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+1 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+2 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) Anshul Kumar, CSE IITD

Difficulties in software pipelining
Overheads - increased register requirement Register management required Loop body may be complex It may require several transformations before pipelining Anshul Kumar, CSE IITD

Global code scheduling
Consider regions of code which are larger than basic blocks Include multiple basic blocks and conditionals How to move code across branch and join points? Anshul Kumar, CSE IITD

Static scheduling and branch prediction
Static branch prediction is helpful with delayed branches static scheduling Methods Fixed prediction Opcode based prediction Address based prediction Profile driven prediction (misprediction ~10%, instructions between mispredictions ~100) Anshul Kumar, CSE IITD

Branch prediction and scheduling
LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1, L OR R4, R5, R6 DADDU R10, R4, R3 L: DADDU R7, R8, R9 A: move when branch is predicted as not taken and R4 not needed in taken path B: move when branch is predicted as taken and R7 not needed in taken path A B Anshul Kumar, CSE IITD

Global code scheduling
When can assignment to B be moved before the comparison? When can assignment to C be moved above the join point? above the comparison? A[i]=A[i]+B[i] A[i]=0? B[i]=... X C[i]=... predicted path Anshul Kumar, CSE IITD

Trace scheduling A[i]=A[i]+B[i] A[i]=0? trace exit B[i]=...
C[i]=... trace exit trace entry Trace scheduling A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry

Region for global scheduling
Trace linear path through code (with high probability) multiple entries and exits Superblock linear path with single entry, multiple exits Hyperblock superblock plus internal control flow Treegion tree with single entry, multiple exits Trace-2 loop free region Anshul Kumar, CSE IITD

Trace B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD

Superblock B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD

Superblock with tail duplication
10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 50.4 39.6 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 B5 24 20 B6 B6’ 50.4 5.6 4.4 39.6 Anshul Kumar, CSE IITD

Hyperblock 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 72 18 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 B6’ 72 8 2 18 Anshul Kumar, CSE IITD

Treegion 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 24 B5 B5’ 14 6 B6 B6’ B6” B6”’ 5.6 1.4 2.4 0.6 50.4 12.6 21.6 5.4 Anshul Kumar, CSE IITD

CSL718 : VLIW - Software Driven ILP

Similar presentations

Presentation on theme: "CSL718 : VLIW - Software Driven ILP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSL718 : VLIW - Software Driven ILP

Similar presentations

Presentation on theme: "CSL718 : VLIW - Software Driven ILP"— Presentation transcript:

Similar presentations

About project

Feedback