Download presentation
Presentation is loading. Please wait.
1
Instruction Rescheduling and Loop-Unroll
CS 286: Loop Unrolling Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2018 Dr. Hiroshi Fujinoki Loop_Unroll/000
2
CS 286: Loop Unrolling Loop-unrolling
Loop-unrolling is a technique to increase ILP for loop-structure Example A for-loop structure written in a high-level programming language for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } There are an array of integers, a[i], which has 1,000 elements Add a constant, 10, to every element in the array Loop_Unroll/001
3
CS 286: Loop Unrolling Assumptions Main Memory R1 a[0] a[1] a[999] R2
(High Address) (Low Address) FFFFFF 000000 R1 a[0] a[1] a[999] 8 bytes R2 F2 = 10 Loop_Unroll/002
4
CS 286: Loop Unrolling After the high-level programming language statements are compiled LOOP: LW F0, 0(R1) // F0 = Mem[R1] for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } ADDI F4, F0, F2 // F4 = F0 +F 2 SW F4, 0(R1) // Mem[R1] = F4 ADDI R1, R1, // R1 = R1-8 BNE R1, R2, LOOP // R1R2 LOOP BNE = “Branch if NOT EQUAL” Data_Dependency/003
5
CS 286: Loop Unrolling After the high-level programming language statements are compiled LW F0, 0(R1) // F0 Mem[R1] LOOP: for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } ADDI F4, F0, F2 // F4 F0+F2 SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, // R1 R1-8 BNE R1, R2, LOOP // R1R2 LOOP Data_Dependency/004
6
CS 286: Loop Unrolling Categorizing instruction types
LW F0, 0(R1) // F0 Mem[R1] ADDI F4, F0, F2 // F4 F0+F2 SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, // R1 R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: Conditional branch Data_Dependency/006
7
CS 286: Loop Unrolling Identifying all pipeline hazards
LW F0, 0(R1) // F0 Mem[R1] ADDI F4, F0, F2 // F4 F0+F2 SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: RAW RAW WAR Control Hazard RAW Data_Dependency/007
8
CS 286: Loop Unrolling Determining stalled and flashed cycles
How many cycles stalled or flashed due to RAW and Control hazard? # of stalls LW F0, 0(R1) // F0 Mem[R1] ADDI F4, F0, F2 // F4 F0+F2 SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: RAW LW Load 1 RAW ALU-OP 2 SW Store RAW ALU-OP 1 Branch Control Hazard (1 cycle flash) Data_Dependency/008
9
CS 286: Loop Unrolling Instruction issuing schedule w/ stalls and flash Cycle Issued LW F0, 0(R1) // F0 Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4 F0+F2 stall stall SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, // R1 R1-8 stall BNE R1, R2, LOOP // R1R2 LOOP flash Data_Dependency/009
10
CS 286: Loop Unrolling Technique #4: Instruction Re-Scheduling
Cycle Issued LW F0, 0(R1) // F0 Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4 F0+F2 stall stall SW F4, 0(R1) // Mem[R1] F4 ADDI R1, R1, // R1 R1-8 stall BNE R1, R2, LOOP // R1R2 LOOP flash Data_Dependency/0010
11
CS 286: Loop Unrolling Technique #4: Instruction Re-Scheduling
Cycle Issued LW F0, 0(R1) // F0 Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4 F0+F2 Make sure to add 8! stall stall SW F4, 0(R1) // Mem[R1] F4 SW F4, 8(R1) // Mem[R1] F4 Loop Completed Here ADDI R1, R1, // R1 R1-8 stall flash BNE R1, R2, LOOP // R1R2 LOOP Delayed-branch applied Data_Dependency/011
12
CS 286: Loop Unrolling Technique #5: Loop-Unrolling ADDI F4, F0, F2
SW F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2, LOOP LW F0, 0(R1) LOOP: stall flash We repeat this for 1,000 times Data_Dependency/012
13
CS 286: Loop Unrolling Technique #5: Loop-Unrolling ADDI F4, F0, F2
SW F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2, LOOP LW F0, 0(R1) LOOP1: stall flash ADDI F4, F0, F2 SW F4, 0(R1) LW F0, 0(R1) LOOP2: stall We repeat this for 1,000 times ADDI R1, R1, -8 BNE R1, R2, LOOP stall flash Merge Them Together Data_Dependency/013
14
CS 286: Loop Unrolling WAW Dependency (Pseudo Dependency)
Technique #5: Loop-Unrolling = Name Dependency LW F0, 0(R1) LOOP1: LOOP2: ADD R1, R1, -8 BNE R1, R2, LOOP stall flash LW F0, 0(R1) LW F6, 8(R1) stall ADD F4, F0, F2 ADD F4, F0, F2 ADD F8, F6, F2 stall stall SW F4, 0(R1) SW F4, 0(R1) SW F8, 8(R1) ADD R1, R1, -8 BNE R1, R2, LOOP stall flash ADD R1, R1, -16 BNE R1, R2, LOOP stall flash Data_Dependency/014
15
CS 286: Loop Unrolling Technique #5: Loop-Unrolling LW F0, 0(R1)
3 4 5 6 7 8 9 10 11 LW F6, 8(R1) Previous: 10 Cycles 1,000 ADD F4, F0, F2 Now: Cycles 500 ADD F8, F6, F2 stall SW F4, 0(R1) SW F8, 8(R1) ADD R1, R1, -16 BNE R1, R2, LOOP stall flash Data_Dependency/015
16
CS 286: Loop Unrolling Further Improvement
Is further improvement possible? Combine instruction-scheduling (Technique 4) and Loop-unrolling More loop-unrolling Especially eliminate especially control hazards Further eliminate stalls But how many loop-unrolling should be performed? Data_Dependency/016
17
CS 286: Loop Unrolling How many loop-unrolling should be performed?
Too many unrolling Loop size becomes too big Too few unrolling Stalls still exist The best unrolling Only enough to eliminate stalls How can we know the best unrolling if number of loops is unknown before run-time? Data_Dependency/017
18
Code Optimization Examples by Visual Studio 2010
Data_Dependency/018
19
CS 286 Computer Architecture & Organization
Assumptions (Part 2) Numbers of stalled cycles for this CPU are defined as follow: Branch slot (for a conditional branch) = 1 cycle RAW dependency for integer ALU instructions = 1 cycle WRITE READ Instruction producing result Instruction using result Stalled cycles FP ALU operation Another FP ALU operation 3 FP ALU operation Store FP data 2 Load FP data FP ALU operation 1 Load FP data Store FP data (This table appears in page 304 of the textbook) Loop_Unroll/005
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.