Download presentation
Presentation is loading. Please wait.
1
Compiler Techniques for ILP
Instruction Level Parallelism Scheduling, Loop unrolling, Software pipelining
2
Software – Compiler ILP Techniques chapter 4
3
(ILP) Instruction level parallelism
Overlap unrelated instructions 17% are branches 5 instructions + 1 branch Must go Beyond single block to get more ILP Loop level parallelism one opportunity SW ch 3 HW (dynamic scheduling ch 4)
4
Example :: FP Loop Hazards & stalls To demo various techniques
Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot
5
FP Loop dependency Stalls
1 Loop: LD F0,0(R1) ;F0=vector element 2 stall ;latency between LD, ADDD 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall ; latency between ADDD, SD 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot 9 clocks: Rewrite to minimize stalls
6
Revised code to Minimize Stalls SD moved to slot 6, to separate from ADD
1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI 6 clocks Unroll loop to make code faster
7
Loop Unrolling Example
Before unrolling loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop for (i=0; i<N; i++) B[i] = A[i] + C; Unrolled inner loop 4 iterations at once for (i=0; i<N; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } after unrolling Loop unrolling reduces branches & code Allows aggressive pipeline scheduling if N not multiple of unrolling factor; handle with final cleanup loop
8
Scheduled Unrolled Code
Multiple units needed Unroll 4 ways Int1 Int 2 M1 M2 FP+ FPx ld f1 ld f2 ld f3 ld f4 add r1 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r2 bne loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop loop: Schedule 4 fadds / 11 cycles = 0.36 Ptrs r1, r2 now incremented by 32
9
Another ILP technique: Software Pipelining
For independent loop iterations, take instructions from different loop iterations reorganize loops each iteration made from instructions from different loop iterations Software Pipelining is a loop scheduling technique that extracts loop parallelism by overlapping the execution of several consecutive iterations. One of the drawbacks of software pipelining is its high register requirements, which increase with the number of functional units and their degree of pipelining... dependent operations within each iteration are separated. More efficient than unrolling. Register management tough. Supported in itanium.
10
Software Pipelining Example
Int1 Int 2 M1 M2 FP+ FPx Unroll 4 ways first ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop loop: iterate prolog epilog ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 add r1 How many FLOPS/cycle? 4 fadds / 4 cycles = 1
11
Software Pipelining vs. Loop Unrolling
Loop Unrolled Wind-down overhead performance Startup overhead time Loop Iteration Software Pipelined performance time Loop Iteration Software pipelining pays startup/wind-down costs only once per loop, not once per iteration
12
Register Renaming eliminates name dependencies Only RAWs remain
Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI,BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI,BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI,BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) ;drop DADDUI,BNE DADDUI R1,R1,#-32 BNE R1,R2,Loop
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.