Chapter 2 Instruction-Level Parallelism and Its Exploitation

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Instruction-Level Parallelism (ILP)
CS 6461: Computer Architecture Instruction Level Parallelism
COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.
COMP4211 (Seminar) Intro to Instruction-Level Parallelism
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Computer Architecture Principles Dr. Mike Frank
Concepts and Challenges
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 2017年
CSCE430/830 Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Chapter 2 Instruction-Level Parallelism and Its Exploitation

See Subset of the MIPS64 Instructions in the back cover of the textbook. CSCE 614 Fall 2009

Instruction-Level Parallelism (ILP) Instructions are evaluated in parallel. Pipelining Two approaches to exploiting ILP Dynamic & Hardware-dependent (chapter 2) Intel Pentium Series, Athlon, MIPS R10000/12000, Sun UltraSPARC III, PowerPC, … Static & Software-dependent (Appendix A, Appendix G) IA-64, Intel Itanium, embedded processors CSCE 614 Fall 2009

Ideal CPI + Structural stalls + Data hazard stalls + Control stalls Pipeline CPI = Ideal CPI + Structural stalls + Data hazard stalls + Control stalls CSCE 614 Fall 2009

Techniques to Decrease Pipeline CPI (p. 67) Forwarding and Bypassing Delayed Branches and Simple Branch Scheduling Basic Dynamic Scheduling (Scoreboarding) Dynamic Scheduling with Renaming Branch Prediction Issuing Multiple Instructions per Cycle Hardware Speculation Dynamic Memory Disambiguation CSCE 614 Fall 2009

Techniques to Decrease Pipeline CPI (p. 67) Loop Unrolling Basic Compiler Pipeline Scheduling Compiler Dependence Analysis, Software Pipelining, Trace Scheduling Hardware Support for Compiler Speculation CSCE 614 Fall 2009

Data Dependences If two instructions are parallel, they can execute simultaneously. If two instructions are dependent, they must be executed in order. How to determine an instruction is dependent on anther instruction? CSCE 614 Fall 2009

Data Dependences Data dependences (True data dependences) Name dependences Control dependences An instruction j is data dependent on instruction i if either i produces a result that may be used by j, or j is data dependent on instruction k, and k is data dependent on i. CSCE 614 Fall 2009

Loop: L.D F0, 0(R1) ;F0=array element ADD.D F4, F0, F2 ;add scalar in F2 S.D F4, 0(R1) ;store the result DADDIU R1, R1, #-8 ;decrement pointer 8 bytes BNE R1, R2, LOOP ;branch R1 != R2 Floating-point data dependences Integer data dependence CSCE 614 Fall 2009

Data Dependences The order must be preserved for correct execution. If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped. Data dependence between DADDIU and BNE => Branch test for the MIPS pipeline in the ID stage (2nd stage). CSCE 614 Fall 2009

Pipelined Datapath CSCE 614 Fall 2009

Data Dependences A dependence indicates the possibility of a hazard, determines the order in which results must be calculated, and sets an upper bound on how much parallelism can be possibly be exploited. CSCE 614 Fall 2009

How to Overcome a Dependence Maintaining the dependence but avoiding a hazard Code scheduling (by the compiler or by the hardware) Eliminating a dependence by transforming the code CSCE 614 Fall 2009

Name Dependence Occurs when two instructions use the same register or memory location (name), but there is no flow of data between the instructions associated with that name. When i precedes j in program order: Antidependence: Instruction j writes a register or memory location that instruction i reads. Output dependence: Instructions i and j write the same register or memory location. No value transmitted between instructions. CSCE 614 Fall 2009

Register Renaming Instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. (Especially for register operands) Statically by a compiler or dynamically by the hardware. CSCE 614 Fall 2009

Hazards A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap during execution, caused by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. CSCE 614 Fall 2009

3 Data Hazards The goal of S/W and H/W Techniques in our course is to preserve the program order only where it affects the outcome of the program to maximize ILP. When instruction i occurs before instruction j in program order, RAW (Read after Write): j tries to read a source before i writes it. WAW (Write after Write): j tries to write an operand before it is written by i. WAR (Write after Read): j tries to write a destination before it is read by i. CSCE 614 Fall 2009

Control Dependences Caused by branch instructions An instruction that is control dependent on a branch cannot be moved before the branch. An instruction that is not control dependent on a branch cannot be moved after the branch. CSCE 614 Fall 2009

Control dependence is not the critical property that must be preserved. We can violate the control dependences, if we can do so without affecting the correctness of the program. (e.g. branch prediction) CSCE 614 Fall 2009

Basic Compiler Techniques for Exposing ILP

Goal: to keep a pipeline full. To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. Goal: to keep a pipeline full. CSCE 614 Fall 2009

Basic Pipeline Scheduling and Loop Unrolling

Latencies Inst. producing result Inst. using result Latency in cycles FP ALU op Another FP op 3 Store double 2 Load double 1 Latency: the number of clock cycles needed to avoid a stall between a producer and a consumer Branch: 1, Integer ALU op – branch: 1 Integer load: 1 Integer ALU - integer ALU: 0 CSCE 614 Fall 2009 Functional units are fully pipelined or replicated. => No structural hazard

Example for ( i = 1000; i > 0; i = i – 1) x[i] = x[i] + s; MIPS Assembly code Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, # -8 BNE R1, R2, LOOP CSCE 614 Fall 2009

Without Any Scheduling Clock cycle issued Loop: L.D F0, 0(R1) 1 stall 2 ADD.D F4, F0, F2 3 stall 4 stall 5 S.D F4, 0(R1) 6 DADDIU R1, R1, # -8 7 stall 8 BNE R1, R2, LOOP 9 stall 10 CSCE 614 Fall 2009

With Scheduling Clock cycle issued Loop: L.D F0, 0(R1) 1 DADDIU R1, R1, # -8 2 ADD.D F4, F0, F2 3 stall 4 BNE R1, R2, LOOP 5 S.D F4, 8(R1) 6 delayed branch not trivial CSCE 614 Fall 2009

The actual work of operating on the array element takes 3 cycles (load, add, store). The remaining 3 cycles Loop overhead (DADDIU, BNE) Stall To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhead instructions. => Loop Unrolling CSCE 614 Fall 2009

Reducing Loop Overhead Loop Unrolling Simple scheme for increasing the number of instructions relative to the branch and overhead instructions Simply replicates the loop body multiple times, adjusting the loop termination code. Improves scheduling It allows instructions from different iterations to be scheduled together. Uses different registers for each iteration. CSCE 614 Fall 2009

Unrolled Loop (No Scheduling) Clock cycle issued Loop: L.D F0, 0(R1) 1 2 ADD.D F4, F0, F2 3 4 5 S.D F4, 0(R1) 6 L.D F6, -8(R1) 7 8 ADD.D F8, F6, F2 9 10 11 S.D F8, -8(R1) 12 L.D F10, -16(R1) 13 14 ADD.D F12, F10, F2 15 16 17 S.D F12, -16(R1) 18 L.D F14, -24(R1) 19 20 ADD.D F16, F14, F2 21 22 23 S.D F16, -24(R1) 24 DADDIU R1, R1, # -32 25 26 BNE R1, R2, LOOP 27 28 DADDIU and BNE dropped CSCE 614 Fall 2009

Loop Unrolling Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer. Unrolling improves the performance of the loop by eliminating overhead instructions. CSCE 614 Fall 2009

Loop Unrolling (Scheduling) Clock cycle issued Loop: L.D F0, 0(R1) 1 L.D F6, -8(R1) 2 L.D F10, -16(R1) 3 L.D F14, -24(R1) 4 ADD.D F4, F0, F2 5 ADD.D F8, F6, F2 6 ADD.D F12, F10, F2 7 ADD.D F16, F14, F2 8 S.D F4, 0(R1) 9 S.D F8, -8(R1) 10 DADDIU R1, R1, # -32 11 S.D F12, 16(R1) 12 BNE R1, R2, LOOP 13 S.D F16, 8(R1) 14 CSCE 614 Fall 2009

Summary The key to most hardware and software ILP techniques is to know when and how the ordering among instructions may be changed. This process must be performed in a methodical fashion either by a compiler or by hardware. CSCE 614 Fall 2009

To obtain the final unrolled code, Determine that it is legal to move the S.D after the DADDIU and BNE, and find the amount to adjust the S.D offset. Determine that unrolling the loop will be useful by finding that the loop iterations are independent, except for the loop maintenance code. Use different registers to avoid unnecessary constraints. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code. CSCE 614 Fall 2009

Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield the same result as the original code. CSCE 614 Fall 2009

Loop Unrolling I (Unoptimized, No Delayed Branch) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 L.D F0, 0(R1) DADDIU R1, R1, # -8 BNE R1, R2, LOOP By symbolically computing the intermediate value of R1 CSCE 614 Fall 2009

Loop Unrolling I (Unoptimized, No Delayed Branch) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, -8(R1) S.D F4, -8(R1) L.D F0, -16(R1) S.D F4, -16(R1) L.D F0, -24(R1) S.D F4, -24(R1) DADDIU R1, R1, # -32 BNE R1, R2, LOOP Remove name dependences using Register Renaming name dependence true dependence CSCE 614 Fall 2009

Loop Unrolling II (Register Renaming) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDIU R1, R1, # -32 BNE R1, R2, LOOP true dependence CSCE 614 Fall 2009

With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel. Problem: potential shortfall in registers Register pressure It arises because scheduling code to increase ILP causes the number of live values to increase. It may not be possible to allocate all the live values to registers. The combination of unrolling and aggressive scheduling can cause this problem. CSCE 614 Fall 2009

Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively. CSCE 614 Fall 2009