Download presentation
Presentation is loading. Please wait.
1
CSL718 : VLIW - Software Driven ILP
Compiler Support for Exposing and Exploiting ILP 1st Apr, 2006 Anshul Kumar, CSE IITD
2
Code Scheduling for VLIW
Objective is to move code around and form packets of concurrently executable instructions Two possibilities: Local work on a straight line piece of code (basic block), i.e., do not go across conditional branches Global code can move across conditional branches Loops need to be tackled in both cases Anshul Kumar, CSE IITD
3
Pipeline scheduling example
for (i=1000; i>0; i--) x[i] = x[i] + s; Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Anshul Kumar, CSE IITD
4
Latency due to data hazards
Producer instruction Consumer instruction Latency FP ALU op 3 Store double 2 Load double 1 Assume no structural hazards Anshul Kumar, CSE IITD
5
Straight forward scheduling
Loop: L.D F0, 0(R1) stall ADD.D F4, F0, F stall stall S.D F4, 0(R1) DADDUI R1, R1, # stall BNE R1, R2, Loop stall Anshul Kumar, CSE IITD
6
A better schedule Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2
ADD.D F4, F0, F stall BNE R1, R2, Loop S.D F4, 8(R1) Anshul Kumar, CSE IITD
7
Loop unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) 6
DADDUI R1, R1, #-32 BNE R1, R2, Loop /4=7 Anshul Kumar, CSE IITD
8
Re-scheduling Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1)
ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, -16(R1) BNE R1, R2, Loop S.D F16, -24(R1) /4=3.5 Anshul Kumar, CSE IITD
9
Limits to unrolling Decrease in amount of loop overhead amortized with each unroll Growth in code size Register renaming leads to register pressure Anshul Kumar, CSE IITD
10
Scheduling example with 2 issue proc
Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) ADD.D F4, F0, F L.D F14,-24(R1) ADD.D F8, F6, F L.D F18,-32(R1) ADD.D F12, F10, F2 5 S.D F4, 0(R1) ADD.D F16, F14, F2 6 S.D F8, -8(R1) ADD.D F20, F18, F2 7 S.D F12,-16(R1) DADDUI R1,R1,# S.D F16, 16(R1) BNE R1,R2,Loop S.D F20, 8(R1) Anshul Kumar, CSE IITD
11
Scheduling example with 5 issue proc
L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2 S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#-56 S.D F20, 24(R1) S.D F24, 16(R1) S.D F28, 8(R1) BNE R1,R2,Loop Anshul Kumar, CSE IITD
12
Scheduling Results cycles/iteration Straight forward scheduling 10
With instruction re-ordering 6 With loop unrolling (4 times) 7 Unrolling + re-ordering 3.5 Scheduling on 2 issue VLIW 2.5 Scheduling on 5 issue VLIW 1.3 Anshul Kumar, CSE IITD
13
Loop Level Parallelism
Dependences in context of a loop: Dependence within an iteration Dependence across iteration or loop carried dependence Example with no loop carried dependence for (i=1000; i>0; i--) x[i] = x[i] + s; There is dependence within an iteration. Anshul Kumar, CSE IITD
14
Example with loop carried dependence
for (i=1; i<=100; i++){ A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } Assume that arrays are distinct and non-overlapping S1 uses a value of A from previous iteration restricts overlapping of different iterations S2 uses a value of A from the same iteration restricts movement of instructions within iteration Anshul Kumar, CSE IITD
15
Another example for (i=1; i<=100; i++){
A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } S1 uses a value of computed by S2 in the previous iteration. Still, iterations can be parallelized. There is no cycle among dependences. A transformation can remove loop carried dependence. Anshul Kumar, CSE IITD
16
Transformed loop of previous example
A[1] = A[1] + B[1]; for (i=1; i<=99; i++){ B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } Now there is no loop carried dependence. The iterations can be parallelized, preserving dependence within the iteration. Anshul Kumar, CSE IITD
17
Dependence distance for (i=6; i<=100; i++){ B[i] = B[i-5] + B[i]; }
Dependence distance is 5. This gives an opportunity for parallelization in spite of loop carried dependence. Anshul Kumar, CSE IITD
18
Finding dependence Important for Analysis is complicated by
code scheduling loop parallelization removal of false dependences Analysis is complicated by pointers and passing of parameters by reference and consequent potential for aliasing use of complex expressions as indices of arrays Anshul Kumar, CSE IITD
19
Analysis with affine indices
for (i=m; i<=n; i++){ A[ai+b] = A[ci+d] + B[i]; } Can ai+b become equal to ci+d for some values of i within the range m to n? Difficult to determine in general. a,b,c,d may not be known at compile time. These could depend on other loop indices. When a,b,c,d are constants, we can use GCD test. Dependence implies: GCD(c,a) must divide d-b. Anshul Kumar, CSE IITD
20
Example with affine indices
for (i=1; i<=100; i++){ A[4*i+1] = A[6*i+4] + B[i]; } GCD(4,6) is 2 and 4-1 is 3. 2 does not divide 3. Therefore, indices can never have same values. values of 4*i+1 = 5,9,13,17,21,25,29,33,37,41,.... values of 6*i+4 = 10,16,22,28,34,40,46,52,58,.... Sometimes GCD(c,a) my divide d-b, but dependence may not exist. Anshul Kumar, CSE IITD
21
Reducing impact of dependent computations
Copy propagation DADDUI R1,R2,#4 DADDUI R1,R1,#4 Tree height reduction ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 DADDUI R1,R2,#8 (used in loop unrolling) ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 Anshul Kumar, CSE IITD
22
Software pipelining: symbolic loop unrolling
iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 iteration 6 Anshul Kumar, CSE IITD
23
Software pipelining example
Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop iteration i L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+1 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) iteration i+2 L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) Anshul Kumar, CSE IITD
24
Difficulties in software pipelining
Overheads - increased register requirement Register management required Loop body may be complex It may require several transformations before pipelining Anshul Kumar, CSE IITD
25
Global code scheduling
Consider regions of code which are larger than basic blocks Include multiple basic blocks and conditionals How to move code across branch and join points? Anshul Kumar, CSE IITD
26
Static scheduling and branch prediction
Static branch prediction is helpful with delayed branches static scheduling Methods Fixed prediction Opcode based prediction Address based prediction Profile driven prediction (misprediction ~10%, instructions between mispredictions ~100) Anshul Kumar, CSE IITD
27
Branch prediction and scheduling
LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1, L OR R4, R5, R6 DADDU R10, R4, R3 L: DADDU R7, R8, R9 A: move when branch is predicted as not taken and R4 not needed in taken path B: move when branch is predicted as taken and R7 not needed in taken path A B Anshul Kumar, CSE IITD
28
Global code scheduling
When can assignment to B be moved before the comparison? When can assignment to C be moved above the join point? above the comparison? A[i]=A[i]+B[i] A[i]=0? B[i]=... X C[i]=... predicted path Anshul Kumar, CSE IITD
29
Trace scheduling A[i]=A[i]+B[i] A[i]=0? trace exit B[i]=...
C[i]=... trace exit trace entry Trace scheduling A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry A[i]=A[i]+B[i] A[i]=0? B[i]=... C[i]=... trace exit trace entry
30
Region for global scheduling
Trace linear path through code (with high probability) multiple entries and exits Superblock linear path with single entry, multiple exits Hyperblock superblock plus internal control flow Treegion tree with single entry, multiple exits Trace-2 loop free region Anshul Kumar, CSE IITD
31
Trace B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD
32
Superblock B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 10 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 90 10 Anshul Kumar, CSE IITD
33
Superblock with tail duplication
10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 50.4 39.6 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 B5 24 20 B6 B6’ 50.4 5.6 4.4 39.6 Anshul Kumar, CSE IITD
34
Hyperblock 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 72 18 B1 30 70 B2 B3 70 30 B4 20 80 B5 20 B6 B6’ 72 8 2 18 Anshul Kumar, CSE IITD
35
Treegion 10 B1 B2 B3 B4 B5 B6 10 70 30 20 80 90 90 B1 30 70 B2 B3 70 30 B4 B4’ 14 6 56 24 B5 B5’ 14 6 B6 B6’ B6” B6”’ 5.6 1.4 2.4 0.6 50.4 12.6 21.6 5.4 Anshul Kumar, CSE IITD
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.