Download presentation
Presentation is loading. Please wait.
Published byBriana Neal Modified over 9 years ago
1
Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail: hfujino@siue.edu Loop_Unroll/000 CS 312 Computer Architecture & Organization
2
Loop_Unroll/001 Example A for-loop structure written in a high-level programming language for (i = 0; i < 1000; i++) { a[i] = a[i] + 10.19; } There are an array of floating-point number, a[i], which has 1,000 elements Add a constant, 10.19, to every element in the FP array Loop-unrolling Loop-unrolling is a technique to increase ILP for loop-structure CS 312 Computer Architecture & Organization
3
Loop_Unroll/002 Main Memory Assumptions (High Address) (Low Address) FFFFFF 000000 a[0] a[1] a[999] 8 bytes R1 R2 F2 = 10.19 CS 312 Computer Architecture & Organization
4
L.D F0, 0(R1) // F0 = Mem[R1] Data_Dependency/003 After the high-level programming language statements are compiled for (i = 0; i < 1000; i++) { a[i] = a[i] + 10.19; } ADD.D F4, F0, F2 // F4 = F0+F2 S.D F4, 0(R1) // Mem[R1] = F4 DADDUI R1, R1, -8 // R1 = R1-8 BNE R1, R2, LOOP // R1 R2 LOOP LOOP: L.D F0, 0(R1) // F0 = Mem[R1] ADD.D F4, F0, F2 // F4 = F0+F2 S.D F4, 0(R1) // Mem[R1] = F4 We focus on this loop structure BNE = “Branch if NOT EQUAL” CS 312 Computer Architecture & Organization
5
LOAD F0, 0(R1) // F0 Mem[R1] Data_Dependency/004 After the high-level programming language statements are compiled for (i = 0; i < 1000; i++) { a[i] = a[i] + 10.19; } ADD F4, F0, F2 // F4 F0+F2 STORE F4, 0(R1) // Mem[R1] F4 ADD R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1 R2 LOOP LOOP: CS 312 Computer Architecture & Organization
6
Loop_Unroll/005 Assumptions (Part 2) Branch slot (for a conditional branch) = 1 cycle RAW dependency for integer ALU instructions = 1 cycle Numbers of stalled cycles for this CPU are defined as follow: Instruction producing result Instruction using result Stalled cycles FP ALU operation Another FP ALU operation FP ALU operation Store FP data Load FP data FP ALU operation 3 2 1 Load FP data Store FP data 0 (This table appears in page 304 of the textbook) READ WRITE CS 312 Computer Architecture & Organization
7
Data_Dependency/006 Categorizing instruction types LOAD F0, 0(R1) // F0 Mem[R1] ADD F4, F0, F2 // F4 F0+F2 STORE F4, 0(R1) // Mem[R1] F4 ADD R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1 R2 LOOP LOOP: Floating-Point instructions Integer instructions Conditional branch CS 312 Computer Architecture & Organization
8
Data_Dependency/007 Identifying all pipeline hazards LOAD F0, 0(R1) // F0 Mem[R1] ADD F4, F0, F2 // F4 F0+F2 STORE F4, 0(R1) // Mem[R1] F4 ADD R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1 R2 LOOP LOOP: RAW WAR RAW Control Hazard CS 312 Computer Architecture & Organization
9
Data_Dependency/008 Determining stalled and flashed cycles L.D F0, 0(R1) // F0 Mem[R1] ADD.D F4, F0, F2 // F4 F0+F2 S.D F4, 0(R1) // Mem[R1] F4 DADDUI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1 R2 LOOP LOOP: FP Load FP ALU RAW 1 FP Store 2 RAW Int ALU 0 RAW Branch 1 # of stalls Control Hazard (1 cycle flash) How many cycles stalled or flashed due to RAW and Control hazard? CS 312 Computer Architecture & Organization
10
Data_Dependency/009 Instruction issuing schedule w/ stalls and flash ADD.D F4, F0, F2 // F4 F0+F2 S.D F4, 0(R1) // Mem[R1] F4 DADDUI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1 R2 LOOP L.D F0, 0(R1) // F0 Mem[R1] LOOP: stall Cycle Issued flash 1 2 3 4 5 6 7 8 9 10 CS 312 Computer Architecture & Organization
11
Data_Dependency/0010 Technique #4: Instruction Re-Scheduling ADD.D F4, F0, F2 // F4 F0+F2 S.D F4, 0(R1) // Mem[R1] F4 DADDUI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1 R2 LOOP L.D F0, 0(R1) // F0 Mem[R1] LOOP: stall Cycle Issued flash 1 2 3 4 5 6 7 8 9 10 CS 312 Computer Architecture & Organization
12
Data_Dependency/011 Technique #4: Instruction Re-Scheduling ADD.D F4, F0, F2 // F4 F0+F2 S.D F4, 0(R1) // Mem[R1] F4 DADDUI R1, R1, -8 // R1 R1-8 BNE R1, R2, LOOP // R1 R2 LOOP L.D F0, 0(R1) // F0 Mem[R1] LOOP: stall Cycle Issued stall flash 1 2 3 4 5 6 7 8 9 10 stall S.D F4, 8(R1) // Mem[R1] F4 Loop Completed Here Make sure to add 8! Delayed-branch applied CS 312 Computer Architecture & Organization
13
Data_Dependency/012 Technique #5: Loop-Unrolling ADD F4, F0, F2 STORE F4, 0(R1) ADD R1, R1, -8 BNE R1, R2, LOOP LOAD F0, 0(R1) LOOP: stall flash We repeat this for 1,000 times CS 312 Computer Architecture & Organization
14
Data_Dependency/013 Technique #5: Loop-Unrolling We repeat this for 1,000 times ADD F4, F0, F2 STORE F4, 0(R1) ADD R1, R1, -8 BNE R1, R2, LOOP LOAD F0, 0(R1) LOOP1: stall flash ADD F4, F0, F2 STORE F4, 0(R1) LOAD F0, 0(R1) LOOP2: stall ADD R1, R1, -8 BNE R1, R2, LOOP stall flash Merge Them Together CS 312 Computer Architecture & Organization
15
Data_Dependency/014 Technique #5: Loop-Unrolling ADD F4, F0, F2 STORE F4, 0(R1) LOAD F0, 0(R1) LOOP1: stall ADD R1, R1, -8 BNE R1, R2, LOOP stall flash ADD F4, F0, F2 STORE F4, 0(R1) LOAD F0, 0(R1) LOOP2: ADD R1, R1, -8 BNE R1, R2, LOOP stall flash LOAD F6, 8(R1) ADD F8, F6, F2 ADD R1, R1, -16 BNE R1, R2, LOOP stall flash STORE F8, 8(R1) WAW Dependency (Pseudo Dependency) = Name Dependency CS 312 Computer Architecture & Organization
16
Data_Dependency/015 Technique #5: Loop-Unrolling ADD F4, F0, F2 STORE F4, 0(R1) LOAD F0, 0(R1) LOOP1: stall LOAD F6, 8(R1) ADD F8, F6, F2 ADD R1, R1, -16 BNE R1, R2, LOOP stall flash STOTE F8, 8(R1) 1 2 3 4 5 6 7 8 9 10 11 Previous: 10 Cycles 1,000 Now: 11 Cycles 500 CS 312 Computer Architecture & Organization
17
Data_Dependency/016 Further Improvement Is further improvement possible? Especially eliminate especially control hazards Combine instruction-scheduling (Technique 4) and Loop-unrolling More loop-unrolling Further eliminate stalls But how many loop-unrolling should be performed? CS 312 Computer Architecture & Organization
18
Data_Dependency/017 How many loop-unrolling should be performed? Too many unrolling Too few unrolling The best unrolling Loop size becomes too big Stalls still exist Only enough to eliminate stalls How can we know the best unrolling if number of loops is unknown before run-time? Exercise 2.7 (p. 144 in ED4 (Exercise 4.4 in ED3) CS 312 Computer Architecture & Organization
19
Code Optimization Examples by Visual Studio 2010 Data_Dependency/018 CS 312 Computer Architecture & Organization
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.