Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

COMP4611 Tutorial 6 Instruction Level Parallelism

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

MIPS R3000 Subroutine Calls and Stack Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Compiler Techniques for ILP

CS5100 Advanced Computer Architecture Instruction-Level Parallelism

CS203 – Advanced Computer Architecture

CSCE430/830 Computer Architecture

Lecture 11: Advanced Static ILP

Lecture 12 Reorder Buffers

CSL718 : VLIW - Software Driven ILP

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Project Instruction Scheduler Assembler for DLX

Midterm 2 review Chapter

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CS 286 Computer Architecture & Organization

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Rescheduling and Loop-Unroll

Reducing pipeline hazards – three techniques

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki Loop_Unroll/000 CS 312 Computer Architecture & Organization

Loop_Unroll/001 Example A for-loop structure written in a high-level programming language for (i = 0; i < 1000; i++) { a[i] = a[i] ; } There are an array of floating-point number, a[i], which has 1,000 elements Add a constant, 10.19, to every element in the FP array Loop-unrolling Loop-unrolling is a technique to increase ILP for loop-structure CS 312 Computer Architecture & Organization

Loop_Unroll/002 Main Memory Assumptions (High Address) (Low Address) FFFFFF a[0] a[1] a[999]          8 bytes R1 R2 F2 = CS 312 Computer Architecture & Organization

L.D F0, 0(R1) // F0 = Mem[R1] Data_Dependency/003 After the high-level programming language statements are compiled for (i = 0; i < 1000; i++) { a[i] = a[i] ; } ADD.D F4, F0, F2 // F4 = F0+F2 S.D F4, 0(R1) // Mem[R1] = F4 DADDUI R1, R1, -8 // R1 = R1-8 BNE R1, R2, LOOP // R1  R2  LOOP LOOP: L.D F0, 0(R1) // F0 = Mem[R1] ADD.D F4, F0, F2 // F4 = F0+F2 S.D F4, 0(R1) // Mem[R1] = F4 We focus on this loop structure BNE = “Branch if NOT EQUAL” CS 312 Computer Architecture & Organization

LOAD F0, 0(R1) // F0  Mem[R1] Data_Dependency/004 After the high-level programming language statements are compiled for (i = 0; i < 1000; i++) { a[i] = a[i] ; } ADD F4, F0, F2 // F4  F0+F2 STORE F4, 0(R1) // Mem[R1]  F4 ADD R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1  R2  LOOP LOOP: CS 312 Computer Architecture & Organization

Loop_Unroll/005 Assumptions (Part 2) Branch slot (for a conditional branch) = 1 cycle RAW dependency for integer ALU instructions = 1 cycle Numbers of stalled cycles for this CPU are defined as follow: Instruction producing result Instruction using result Stalled cycles FP ALU operation Another FP ALU operation FP ALU operation Store FP data Load FP data FP ALU operation Load FP data Store FP data 0 (This table appears in page 304 of the textbook) READ WRITE CS 312 Computer Architecture & Organization

Data_Dependency/006 Categorizing instruction types LOAD F0, 0(R1) // F0  Mem[R1] ADD F4, F0, F2 // F4  F0+F2 STORE F4, 0(R1) // Mem[R1]  F4 ADD R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1  R2  LOOP LOOP: Floating-Point instructions Integer instructions Conditional branch CS 312 Computer Architecture & Organization

Data_Dependency/007 Identifying all pipeline hazards LOAD F0, 0(R1) // F0  Mem[R1] ADD F4, F0, F2 // F4  F0+F2 STORE F4, 0(R1) // Mem[R1]  F4 ADD R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1  R2  LOOP LOOP: RAW WAR RAW Control Hazard CS 312 Computer Architecture & Organization

Data_Dependency/008 Determining stalled and flashed cycles L.D F0, 0(R1) // F0  Mem[R1] ADD.D F4, F0, F2 // F4  F0+F2 S.D F4, 0(R1) // Mem[R1]  F4 DADDUI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1  R2  LOOP LOOP: FP Load FP ALU RAW 1 FP Store 2 RAW Int ALU 0 RAW Branch 1 # of stalls Control Hazard (1 cycle flash) How many cycles stalled or flashed due to RAW and Control hazard? CS 312 Computer Architecture & Organization

Data_Dependency/009 Instruction issuing schedule w/ stalls and flash ADD.D F4, F0, F2 // F4  F0+F2 S.D F4, 0(R1) // Mem[R1]  F4 DADDUI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1  R2  LOOP L.D F0, 0(R1) // F0  Mem[R1] LOOP: stall Cycle Issued flash CS 312 Computer Architecture & Organization

Data_Dependency/0010 Technique #4: Instruction Re-Scheduling ADD.D F4, F0, F2 // F4  F0+F2 S.D F4, 0(R1) // Mem[R1]  F4 DADDUI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1  R2  LOOP L.D F0, 0(R1) // F0  Mem[R1] LOOP: stall Cycle Issued flash CS 312 Computer Architecture & Organization

Data_Dependency/011 Technique #4: Instruction Re-Scheduling ADD.D F4, F0, F2 // F4  F0+F2 S.D F4, 0(R1) // Mem[R1]  F4 DADDUI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1  R2  LOOP L.D F0, 0(R1) // F0  Mem[R1] LOOP: stall Cycle Issued stall flash stall S.D F4, 8(R1) // Mem[R1]  F4 Loop Completed Here Make sure to add 8! Delayed-branch applied CS 312 Computer Architecture & Organization

Data_Dependency/012 Technique #5: Loop-Unrolling ADD F4, F0, F2 STORE F4, 0(R1) ADD R1, R1, -8 BNE R1, R2, LOOP LOAD F0, 0(R1) LOOP: stall flash We repeat this for 1,000 times CS 312 Computer Architecture & Organization

Data_Dependency/013 Technique #5: Loop-Unrolling We repeat this for 1,000 times ADD F4, F0, F2 STORE F4, 0(R1) ADD R1, R1, -8 BNE R1, R2, LOOP LOAD F0, 0(R1) LOOP1: stall flash ADD F4, F0, F2 STORE F4, 0(R1) LOAD F0, 0(R1) LOOP2: stall ADD R1, R1, -8 BNE R1, R2, LOOP stall flash Merge Them Together CS 312 Computer Architecture & Organization

Data_Dependency/014 Technique #5: Loop-Unrolling ADD F4, F0, F2 STORE F4, 0(R1) LOAD F0, 0(R1) LOOP1: stall ADD R1, R1, -8 BNE R1, R2, LOOP stall flash ADD F4, F0, F2 STORE F4, 0(R1) LOAD F0, 0(R1) LOOP2: ADD R1, R1, -8 BNE R1, R2, LOOP stall flash LOAD F6, 8(R1) ADD F8, F6, F2 ADD R1, R1, -16 BNE R1, R2, LOOP stall flash STORE F8, 8(R1) WAW Dependency (Pseudo Dependency) = Name Dependency CS 312 Computer Architecture & Organization

Data_Dependency/015 Technique #5: Loop-Unrolling ADD F4, F0, F2 STORE F4, 0(R1) LOAD F0, 0(R1) LOOP1: stall LOAD F6, 8(R1) ADD F8, F6, F2 ADD R1, R1, -16 BNE R1, R2, LOOP stall flash STOTE F8, 8(R1) Previous: 10 Cycles  1,000 Now: 11 Cycles  500 CS 312 Computer Architecture & Organization

Data_Dependency/016 Further Improvement Is further improvement possible? Especially eliminate especially control hazards  Combine instruction-scheduling (Technique 4) and Loop-unrolling  More loop-unrolling Further eliminate stalls But how many loop-unrolling should be performed? CS 312 Computer Architecture & Organization

Data_Dependency/017 How many loop-unrolling should be performed? Too many unrolling Too few unrolling The best unrolling Loop size becomes too big Stalls still exist Only enough to eliminate stalls How can we know the best unrolling if number of loops is unknown before run-time? Exercise 2.7 (p. 144 in ED4 (Exercise 4.4 in ED3) CS 312 Computer Architecture & Organization

Code Optimization Examples by Visual Studio 2010 Data_Dependency/018 CS 312 Computer Architecture & Organization