Instruction Rescheduling and Loop-Unroll CS 286: Loop Unrolling Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2018 Dr. Hiroshi Fujinoki E-mail: hfujino@siue.edu Loop_Unroll/000
CS 286: Loop Unrolling Loop-unrolling Loop-unrolling is a technique to increase ILP for loop-structure Example A for-loop structure written in a high-level programming language for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } There are an array of integers, a[i], which has 1,000 elements Add a constant, 10, to every element in the array Loop_Unroll/001
CS 286: Loop Unrolling Assumptions Main Memory R1 a[0] a[1] a[999] R2 (High Address) (Low Address) FFFFFF 000000 R1 a[0] a[1] a[999] 8 bytes R2 F2 = 10 Loop_Unroll/002
CS 286: Loop Unrolling After the high-level programming language statements are compiled LOOP: LW F0, 0(R1) // F0 = Mem[R1] for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } ADDI F4, F0, F2 // F4 = F0 +F 2 SW F4, 0(R1) // Mem[R1] = F4 ADDI R1, R1, -8 // R1 = R1-8 BNE R1, R2, LOOP // R1R2 LOOP BNE = “Branch if NOT EQUAL” Data_Dependency/003
CS 286: Loop Unrolling After the high-level programming language statements are compiled LW F0, 0(R1) // F0 Mem[R1] LOOP: for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } ADDI F4, F0, F2 // F4 F0+F2 SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1R2 LOOP Data_Dependency/004
CS 286: Loop Unrolling Categorizing instruction types LW F0, 0(R1) // F0 Mem[R1] ADDI F4, F0, F2 // F4 F0+F2 SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: Conditional branch Data_Dependency/006
CS 286: Loop Unrolling Identifying all pipeline hazards LW F0, 0(R1) // F0 Mem[R1] ADDI F4, F0, F2 // F4 F0+F2 SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: RAW RAW WAR Control Hazard RAW Data_Dependency/007
CS 286: Loop Unrolling Determining stalled and flashed cycles How many cycles stalled or flashed due to RAW and Control hazard? # of stalls LW F0, 0(R1) // F0 Mem[R1] ADDI F4, F0, F2 // F4 F0+F2 SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: RAW LW Load 1 RAW ALU-OP 2 SW Store RAW ALU-OP 1 Branch Control Hazard (1 cycle flash) Data_Dependency/008
CS 286: Loop Unrolling Instruction issuing schedule w/ stalls and flash Cycle Issued LW F0, 0(R1) // F0 Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4 F0+F2 stall stall SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, -8 // R1 R1-8 stall BNE R1, R2, LOOP // R1R2 LOOP flash Data_Dependency/009
CS 286: Loop Unrolling Technique #4: Instruction Re-Scheduling Cycle Issued LW F0, 0(R1) // F0 Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4 F0+F2 stall stall SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, -8 // R1 R1-8 stall BNE R1, R2, LOOP // R1R2 LOOP flash Data_Dependency/0010
CS 286: Loop Unrolling Technique #4: Instruction Re-Scheduling Cycle Issued LW F0, 0(R1) // F0 Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4 F0+F2 Make sure to add 8! stall stall SW F4, 0(R1) // Mem[R1] F4 SW F4, 8(R1) // Mem[R1] F4 Loop Completed Here ADDI R1, R1, -8 // R1 R1-8 stall flash BNE R1, R2, LOOP // R1R2 LOOP Delayed-branch applied Data_Dependency/011
CS 286: Loop Unrolling Technique #5: Loop-Unrolling ADDI F4, F0, F2 SW F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2, LOOP LW F0, 0(R1) LOOP: stall flash We repeat this for 1,000 times Data_Dependency/012
CS 286: Loop Unrolling Technique #5: Loop-Unrolling ADDI F4, F0, F2 SW F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2, LOOP LW F0, 0(R1) LOOP1: stall flash ADDI F4, F0, F2 SW F4, 0(R1) LW F0, 0(R1) LOOP2: stall We repeat this for 1,000 times ADDI R1, R1, -8 BNE R1, R2, LOOP stall flash Merge Them Together Data_Dependency/013
CS 286: Loop Unrolling WAW Dependency (Pseudo Dependency) Technique #5: Loop-Unrolling = Name Dependency LW F0, 0(R1) LOOP1: LOOP2: ADD R1, R1, -8 BNE R1, R2, LOOP stall flash LW F0, 0(R1) LW F6, 8(R1) stall ADD F4, F0, F2 ADD F4, F0, F2 ADD F8, F6, F2 stall stall SW F4, 0(R1) SW F4, 0(R1) SW F8, 8(R1) ADD R1, R1, -8 BNE R1, R2, LOOP stall flash ADD R1, R1, -16 BNE R1, R2, LOOP stall flash Data_Dependency/014
CS 286: Loop Unrolling Technique #5: Loop-Unrolling LW F0, 0(R1) 3 4 5 6 7 8 9 10 11 LW F6, 8(R1) Previous: 10 Cycles 1,000 ADD F4, F0, F2 Now: 11 Cycles 500 ADD F8, F6, F2 stall SW F4, 0(R1) SW F8, 8(R1) ADD R1, R1, -16 BNE R1, R2, LOOP stall flash Data_Dependency/015
CS 286: Loop Unrolling Further Improvement Is further improvement possible? Combine instruction-scheduling (Technique 4) and Loop-unrolling More loop-unrolling Especially eliminate especially control hazards Further eliminate stalls But how many loop-unrolling should be performed? Data_Dependency/016
CS 286: Loop Unrolling How many loop-unrolling should be performed? Too many unrolling Loop size becomes too big Too few unrolling Stalls still exist The best unrolling Only enough to eliminate stalls How can we know the best unrolling if number of loops is unknown before run-time? Data_Dependency/017
Code Optimization Examples by Visual Studio 2010 Data_Dependency/018
CS 286 Computer Architecture & Organization Assumptions (Part 2) Numbers of stalled cycles for this CPU are defined as follow: Branch slot (for a conditional branch) = 1 cycle RAW dependency for integer ALU instructions = 1 cycle WRITE READ Instruction producing result Instruction using result Stalled cycles FP ALU operation Another FP ALU operation 3 FP ALU operation Store FP data 2 Load FP data FP ALU operation 1 Load FP data Store FP data (This table appears in page 304 of the textbook) Loop_Unroll/005