Download presentation
Presentation is loading. Please wait.
1
Compiler techniques for exposing ILP (cont)
2
Loop-Level Parallelism
Analysis at the source level Dependencies across iterations for (i=1000; i>0; i=i-1) x[i] = x[i] + s; for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; /* loop-carried dependence */ y[i+1] = y[i] + x[i+1]; }
3
Loop-Carried Dependences
for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; } Non-circular dependences x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];
4
Compiler support for ILP
Dependence analysis Finding dependences is important for: Good scheduling of code Determining loop-level parallelism Eliminating name dependencies Complexity Simple for scalar variable references Complex for pointers, array references Software pipelining Trace scheduling
5
Loop-level Parallelism
Primary focus of dependence analysis Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; loop-carried, recurrent, circular dependence } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }
6
Dependence Analysis Algorithms
Assume array indexes are affine (ai + b) GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) General graph cycle determination is NP a, b, c, and d may not be known at compile time
7
Software Pipelining Start-up Finish-up Software pipelined iteration
Iteration Iteration Iteration Iteration 3 Software pipelined iteration
8
Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1)
ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop
9
Trace (global-code) Scheduling
Find ILP across conditional branches Two-step process Trace selection Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches Trace compaction Squeeze the trace into a small number of wide instructions Preserve data and control dependences
10
Trace Selection LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4
A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else SW 0(R2), . . . J join Else: X Join: SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =
11
Summary of Compiler Techniques
Try to avoid dependence stalls Loop unrolling Reduce loop overhead Software pipelining Reduce single body dependence stalls Trace scheduling Reduce impact of other branches Compilers use a mix of three All techniques depend on prediction accuracy
12
Analyze this Analyze this for different values of X and Y
To evaluate different branch prediction schemes For compiler scheduling purposes add r1, r0, 1000 # all numbers in decimal add r2, r0, a # Base address of array a loop: andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11 even: addi r2, r2, 4 subi r1, r1, Y bnez r1, loop
13
Midterm Performance Solutions posted on website Range
16.5 to 45.5 out of 50 C+ B- B B+ A- A A+ If you have questions about your score please see me on Thursday during office hours
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.