Instruction Rescheduling and Loop-Unroll

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /15/2013 Lecture 11: MIPS-Conditional Instructions Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Data Dependencies A dependency type that can cause a stall.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CS 300 – Lecture 6 Intro to Computer Architecture / Assembly Language Instructions.

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki

Computer Architecture CSE 3322 Lecture 3 Assignment: 2.4.1, 2.4.4, 2.6.1, , Due 2/3/09 Read 2.8.

5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.

CS 312 Computer Architecture & Organization

Pipelining – Loop unrolling and Multiple Issue

Compiler Techniques for ILP

CS 352H: Computer Systems Architecture

CS2100 Computer Organization

Concepts and Challenges

Computer Architecture & Operations I

CS5100 Advanced Computer Architecture Instruction-Level Parallelism

CS203 – Advanced Computer Architecture

CSCE430/830 Computer Architecture

CS 286: Memory Paging and Virtual Memory

CS170 Computer Organization and Architecture I

Pipelining review.

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Midterm 2 review Chapter

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CS 286 Computer Architecture & Organization

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Reducing pipeline hazards – three techniques

CSC3050 – Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Appendix C Practice Problem Set 1

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Process Synchronization

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Process Synchronization

CS 286 Computer Architecture & Organization

MIPS R3000 Subroutine Calls and Stack

Presentation transcript:

Instruction Rescheduling and Loop-Unroll CS 286: Loop Unrolling Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2018 Dr. Hiroshi Fujinoki E-mail: hfujino@siue.edu Loop_Unroll/000

CS 286: Loop Unrolling Loop-unrolling Loop-unrolling is a technique to increase ILP for loop-structure Example A for-loop structure written in a high-level programming language for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } There are an array of integers, a[i], which has 1,000 elements Add a constant, 10, to every element in the array Loop_Unroll/001

CS 286: Loop Unrolling Assumptions Main Memory R1 a[0] a[1] a[999] R2 (High Address) (Low Address) FFFFFF 000000 R1 a[0] a[1] a[999]    8 bytes R2 F2 = 10 Loop_Unroll/002

CS 286: Loop Unrolling After the high-level programming language statements are compiled LOOP: LW F0, 0(R1) // F0 = Mem[R1] for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } ADDI F4, F0, F2 // F4 = F0 +F 2 SW F4, 0(R1) // Mem[R1] = F4 ADDI R1, R1, -8 // R1 = R1-8 BNE R1, R2, LOOP // R1R2 LOOP BNE = “Branch if NOT EQUAL” Data_Dependency/003

CS 286: Loop Unrolling After the high-level programming language statements are compiled LW F0, 0(R1) // F0  Mem[R1] LOOP: for (i = 0; i < 1000; i++) { a[i] = a[i] + 10; } ADDI F4, F0, F2 // F4  F0+F2 SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1R2 LOOP Data_Dependency/004

CS 286: Loop Unrolling Categorizing instruction types LW F0, 0(R1) // F0  Mem[R1] ADDI F4, F0, F2 // F4  F0+F2 SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: Conditional branch Data_Dependency/006

CS 286: Loop Unrolling Identifying all pipeline hazards LW F0, 0(R1) // F0  Mem[R1] ADDI F4, F0, F2 // F4  F0+F2 SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: RAW RAW WAR Control Hazard RAW Data_Dependency/007

CS 286: Loop Unrolling Determining stalled and flashed cycles How many cycles stalled or flashed due to RAW and Control hazard? # of stalls LW F0, 0(R1) // F0  Mem[R1] ADDI F4, F0, F2 // F4  F0+F2 SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, -8 // R1  R1-8 BNE R1, R2, LOOP // R1R2 LOOP LOOP: RAW LW Load 1 RAW ALU-OP 2 SW Store RAW ALU-OP 1 Branch Control Hazard (1 cycle flash) Data_Dependency/008

CS 286: Loop Unrolling Instruction issuing schedule w/ stalls and flash Cycle Issued LW F0, 0(R1) // F0  Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4  F0+F2 stall stall SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, -8 // R1  R1-8 stall BNE R1, R2, LOOP // R1R2 LOOP flash Data_Dependency/009

CS 286: Loop Unrolling Technique #4: Instruction Re-Scheduling Cycle Issued LW F0, 0(R1) // F0  Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4  F0+F2 stall stall SW F4, 0(R1) // Mem[R1]  F4 ADDI R1, R1, -8 // R1  R1-8 stall BNE R1, R2, LOOP // R1R2 LOOP flash Data_Dependency/0010

CS 286: Loop Unrolling Technique #4: Instruction Re-Scheduling Cycle Issued LW F0, 0(R1) // F0  Mem[R1] LOOP: 1 2 3 4 5 6 7 8 9 10 stall ADDI F4, F0, F2 // F4  F0+F2 Make sure to add 8! stall stall SW F4, 0(R1) // Mem[R1]  F4 SW F4, 8(R1) // Mem[R1]  F4 Loop Completed Here ADDI R1, R1, -8 // R1  R1-8 stall flash BNE R1, R2, LOOP // R1R2 LOOP Delayed-branch applied Data_Dependency/011

CS 286: Loop Unrolling Technique #5: Loop-Unrolling ADDI F4, F0, F2 SW F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2, LOOP LW F0, 0(R1) LOOP: stall flash We repeat this for 1,000 times Data_Dependency/012

CS 286: Loop Unrolling Technique #5: Loop-Unrolling ADDI F4, F0, F2 SW F4, 0(R1) ADDI R1, R1, -8 BNE R1, R2, LOOP LW F0, 0(R1) LOOP1: stall flash ADDI F4, F0, F2 SW F4, 0(R1) LW F0, 0(R1) LOOP2: stall We repeat this for 1,000 times ADDI R1, R1, -8 BNE R1, R2, LOOP stall flash Merge Them Together Data_Dependency/013

CS 286: Loop Unrolling WAW Dependency (Pseudo Dependency) Technique #5: Loop-Unrolling = Name Dependency LW F0, 0(R1) LOOP1: LOOP2: ADD R1, R1, -8 BNE R1, R2, LOOP stall flash LW F0, 0(R1) LW F6, 8(R1) stall ADD F4, F0, F2 ADD F4, F0, F2 ADD F8, F6, F2 stall stall SW F4, 0(R1) SW F4, 0(R1) SW F8, 8(R1) ADD R1, R1, -8 BNE R1, R2, LOOP stall flash ADD R1, R1, -16 BNE R1, R2, LOOP stall flash Data_Dependency/014

CS 286: Loop Unrolling Technique #5: Loop-Unrolling LW F0, 0(R1) 3 4 5 6 7 8 9 10 11 LW F6, 8(R1) Previous: 10 Cycles  1,000 ADD F4, F0, F2 Now: 11 Cycles  500 ADD F8, F6, F2 stall SW F4, 0(R1) SW F8, 8(R1) ADD R1, R1, -16 BNE R1, R2, LOOP stall flash Data_Dependency/015

CS 286: Loop Unrolling Further Improvement Is further improvement possible?  Combine instruction-scheduling (Technique 4) and Loop-unrolling  More loop-unrolling Especially eliminate especially control hazards Further eliminate stalls But how many loop-unrolling should be performed? Data_Dependency/016

CS 286: Loop Unrolling How many loop-unrolling should be performed? Too many unrolling Loop size becomes too big Too few unrolling Stalls still exist The best unrolling Only enough to eliminate stalls How can we know the best unrolling if number of loops is unknown before run-time? Data_Dependency/017

Code Optimization Examples by Visual Studio 2010 Data_Dependency/018

CS 286 Computer Architecture & Organization Assumptions (Part 2) Numbers of stalled cycles for this CPU are defined as follow: Branch slot (for a conditional branch) = 1 cycle RAW dependency for integer ALU instructions = 1 cycle WRITE READ Instruction producing result Instruction using result Stalled cycles FP ALU operation Another FP ALU operation 3 FP ALU operation Store FP data 2 Load FP data FP ALU operation 1 Load FP data Store FP data (This table appears in page 304 of the textbook) Loop_Unroll/005