Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instruction Rescheduling and Loop-Unroll

Similar presentations


Presentation on theme: "Instruction Rescheduling and Loop-Unroll"— Presentation transcript:

1 Instruction Rescheduling and Loop-Unroll
CS 286: Loop Unrolling Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2018 Dr. Hiroshi Fujinoki Loop_Unroll/000

2 CS 286: Loop Unrolling Loop-unrolling
Loop-unrolling is a technique to increase ILP for loop-structure Example A for-loop structure written in a high-level programming language for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } There are an array of integers, a[i], which has 1,000 elements Add a constant, 10, to every element in the array Loop_Unroll/001

3 CS 286: Loop Unrolling Assumptions Main Memory R1 a[0] a[1] a[999] R2
(High Address) (Low Address) FFFFFF 000000 R1 a[0] a[1] a[999]    8 bytes R2 F2 = 10 Loop_Unroll/002

4 CS 286: Loop Unrolling After the high-level programming language statements are compiled LOOP: LW F0, 0(R1) // F0 = Mem[R1] for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } ADDI F4, F0, F2 // F4 = F0 +F 2 SW F4, 0(R1) // Mem[R1] = F4 ADDI R1, R1, // R1 = R1-8 BNE R1, R2, LOOP // R1R2 LOOP BNE = “Branch if NOT EQUAL” Data_Dependency/003

5 CS 286: Loop Unrolling After the high-level programming language statements are compiled LW F0, 0(R1) // F0  Mem[R1] LOOP: for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } ADDI F4, F0, F2 // F4  F0+F2 SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, // R1  R1-8 BNE R1, R2, LOOP // R1R2 LOOP Data_Dependency/004

6 CS 286: Loop Unrolling Categorizing instruction types
LW F0, 0(R1) // F0  Mem[R1] ADDI F4, F0, F2 // F4  F0+F2 SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, // R1  R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: Conditional branch Data_Dependency/006

7 CS 286: Loop Unrolling Identifying all pipeline hazards
LW F0, 0(R1) // F0  Mem[R1] ADDI F4, F0, F2 // F4  F0+F2 SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: RAW RAW WAR Control Hazard RAW Data_Dependency/007

8 CS 286: Loop Unrolling Determining stalled and flashed cycles
How many cycles stalled or flashed due to RAW and Control hazard? # of stalls LW F0, 0(R1) // F0  Mem[R1] ADDI F4, F0, F2 // F4  F0+F2 SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: RAW LW Load 1 RAW ALU-OP 2 SW Store RAW ALU-OP 1 Branch Control Hazard (1 cycle flash) Data_Dependency/008

9 CS 286: Loop Unrolling Instruction issuing schedule w/ stalls and flash Cycle Issued LW F0, 0(R1) // F0  Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4  F0+F2 stall stall SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, // R1  R1-8 stall BNE R1, R2, LOOP // R1R2 LOOP flash Data_Dependency/009

10 CS 286: Loop Unrolling Technique #4: Instruction Re-Scheduling
Cycle Issued LW F0, 0(R1) // F0  Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4  F0+F2 stall stall SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, // R1  R1-8 stall BNE R1, R2, LOOP // R1R2 LOOP flash Data_Dependency/0010

11 CS 286: Loop Unrolling Technique #4: Instruction Re-Scheduling
Cycle Issued LW F0, 0(R1) // F0  Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4  F0+F2 Make sure to add 8! stall stall SW F4, 0(R1) // Mem[R1]  F4 SW F4, 8(R1) // Mem[R1]  F4 Loop Completed Here ADDI R1, R1, // R1  R1-8 stall flash BNE R1, R2, LOOP // R1R2 LOOP Delayed-branch applied Data_Dependency/011

12 CS 286: Loop Unrolling Technique #5: Loop-Unrolling ADDI F4, F0, F2
SW F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2, LOOP LW F0, 0(R1) LOOP: stall flash We repeat this for 1,000 times Data_Dependency/012

13 CS 286: Loop Unrolling Technique #5: Loop-Unrolling ADDI F4, F0, F2
SW F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2, LOOP LW F0, 0(R1) LOOP1: stall flash ADDI F4, F0, F2 SW F4, 0(R1) LW F0, 0(R1) LOOP2: stall We repeat this for 1,000 times ADDI R1, R1, -8 BNE R1, R2, LOOP stall flash Merge Them Together Data_Dependency/013

14 CS 286: Loop Unrolling WAW Dependency (Pseudo Dependency)
Technique #5: Loop-Unrolling = Name Dependency LW F0, 0(R1) LOOP1: LOOP2: ADD R1, R1, -8 BNE R1, R2, LOOP stall flash LW F0, 0(R1) LW F6, 8(R1) stall ADD F4, F0, F2 ADD F4, F0, F2 ADD F8, F6, F2 stall stall SW F4, 0(R1) SW F4, 0(R1) SW F8, 8(R1) ADD R1, R1, -8 BNE R1, R2, LOOP stall flash ADD R1, R1, -16 BNE R1, R2, LOOP stall flash Data_Dependency/014

15 CS 286: Loop Unrolling Technique #5: Loop-Unrolling LW F0, 0(R1)
3 4 5 6 7 8 9 10 11 LW F6, 8(R1) Previous: 10 Cycles  1,000 ADD F4, F0, F2 Now: Cycles  500 ADD F8, F6, F2 stall SW F4, 0(R1) SW F8, 8(R1) ADD R1, R1, -16 BNE R1, R2, LOOP stall flash Data_Dependency/015

16 CS 286: Loop Unrolling Further Improvement
Is further improvement possible?  Combine instruction-scheduling (Technique 4) and Loop-unrolling  More loop-unrolling Especially eliminate especially control hazards Further eliminate stalls But how many loop-unrolling should be performed? Data_Dependency/016

17 CS 286: Loop Unrolling How many loop-unrolling should be performed?
Too many unrolling Loop size becomes too big Too few unrolling Stalls still exist The best unrolling Only enough to eliminate stalls How can we know the best unrolling if number of loops is unknown before run-time? Data_Dependency/017

18 Code Optimization Examples by Visual Studio 2010
Data_Dependency/018

19 CS 286 Computer Architecture & Organization
Assumptions (Part 2) Numbers of stalled cycles for this CPU are defined as follow: Branch slot (for a conditional branch) = 1 cycle RAW dependency for integer ALU instructions = 1 cycle WRITE READ Instruction producing result Instruction using result Stalled cycles FP ALU operation Another FP ALU operation 3 FP ALU operation Store FP data 2 Load FP data FP ALU operation 1 Load FP data Store FP data (This table appears in page 304 of the textbook) Loop_Unroll/005


Download ppt "Instruction Rescheduling and Loop-Unroll"

Similar presentations


Ads by Google