Exploiting ILP with Software Approaches

Slides:



Advertisements
Similar presentations
Instruction-Level Parallelism
Advertisements

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Branch Hazards and Static Branch Prediction Techniques
CS 352H: Computer Systems Architecture
Concepts and Challenges
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers The Processor
CS 704 Advanced Computer Architecture
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Exploiting ILP with Software Approaches

Outline Basic Compiler Techniques for Exposing ILP Static Branch Prediction Static Multiple Issue: The VLIW Approach Hardware Support for Exposing More Parallelism at Compiler Time H.W verses S.W Solutions

4.1 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid a pipeline stall, a dependent instruction must be separated from a source instruction. A compiler’s ability to perform this scheduling depends both the amount of ILP available in the program On the latencies of the functional units in the pipeline.

Basic Pipeline Scheduling and Loop Unrolling (contd..) Idea – find sequences of unrelated instructions (no hazard) that can be overlapped in the pipeline to exploit ILP A dependent instruction must be separated from the source instruction by a distance in clock cycles equal to latency of the source instruction to avoid stall Latencies of FP operations used

Consider adding a scalar s to a vector for (i=1000; i > 0; i=i-1) x[i] = x[i] + s Loop: L.D F0,0( R1 ) ;F0=vector element ADD.D F4,F0,F2 ;add scalar from F2 S.D F4, 0(R1), ;store result DADDUI R1,R1,#-8 ;decrement pointer 8B (DW) BNE R1, R2,Loop ;branch R1!=R2 Assume R2 is pre-computed, so that 8(R2) is the last element to operate on

Unscheduled Loop Clock Cycle Issued Loop: L.D F0,0( R1 ) 1 stall 2 ADD.D F4,F0,F2 3 4 5 S.D F4, 0(R1) 6 DADDUI R1,R1,#-8 7 8 BNE R1, R2, Loop 9 10

Scheduled Loop Clock Cycle Issued Loop: L.D F0,0( R1 ) 1 DADDUI R1,R1,#8 2 ADD.D F4,F0,F2 3 stall 4 BNE R1, R2, Loop 5 S.D F4, 8(R1) 6 The latency between ADD.D and SD is 2 Overhead At minimum 6 cycles are necessary to execute this sequence. Why? R1 has been modified (not trivial!!!) Common View: S.Di depends on DADDUIi-1 and therefore can’t be moved Smarter View: DADDUI is immediate, so solution exists

Basic pipeline scheduling and Unrolling To eliminate 3 clock cycles we need to get more operations within the loop relative to the number of overhead instructions. A Simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code. Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together.

Loop Unrolling – Make Body Fat Three of the six instructions are overhead Get more operations within loop relative to # of overhead instructions Loop unrolling Replicate the loop body multiple times and adjust the loop termination code Basic Idea Take n loop bodies and concatenate them into 1 basic block Adjust the new termination code Let’s say n was 4 Then modify the R1 pointer in the example by 4x of what it was before Savings - 4 BNE’s + 4 DADDUI’s  just one of each Hence 75% improvement

Summary of the Loop Unrolling and Scheduling We will look at the variety of H/W and S/W techniques that allows us to take advantage of instructions-level-parallelism to fully utilize the potential of the functional units in a processor. The key to most of these techniques is to know when and how the ordering among instructions may be changed. This process must be performed in a methodical fashion either by a compiler or by hardware.

Use different registers to avoid unnecessary constraints. To obtain the final unrolled code we had to make the following decisions and transformations. Determine if it is legal to move the instructions and adjust the offset. Determine the unrolling loop would be useful by finding if the loop iterations were independent. Use different registers to avoid unnecessary constraints. Eliminate the extra tests and branches. Determine that the loads and stores in the unrolling loop can be interchanged. Schedule the code, preserving any dependences needed. Key requirement: an understanding of how an instruction depends on another and how the instructions can be changed or reordered given the dependences

Limitation of Gains of Loop Unrolling Amount of loop overhead amortized with each unroll Unroll 4 times – 2 out 14 CC are overhead  0.5 CC per iteration Unrolled 8 times  0.25 CC per iteration Growth in code size Large code size is not good for embedded computer Large code size may increase cache miss rate Potential shortfall in registers that is created by aggressive unrolling and scheduling Register pressure

4.2 Static Branch Prediction

Static Branch Prediction Static branch predictors are sometimes used in processors where the expectations is that branch behavior is highly predictable at compile time. Static prediction can also be used to assist dynamic predictors.

Static Branch Prediction: Using Compiler Technology How to statically predict branches? To perform some optimization we need to predict the branch statically when we compile the program. There are several methods to statically predict branch behavior. To predict a branch was taken This scheme has an average misprediction rate that is equal to the untaken branch frequency. Choosing backward-going branches to be taken and forward –going branches to be not taken. For some programs and compilation systems, the frequency of forward taken branches may be significantly less than 50%. And this scheme will do better than just predicting all branches taken.

Static Branch Prediction: Using Compiler Technology Profile-based predictor: use profile information collected from earlier runs Simplest is the Basket bit idea Easily extends to use more bits Definite win for some regular applications

Static Branch Prediction: Using Compiler Technology Useful for Scheduling instructions when the branch delays are exposed by the architecture (either delayed or canceling branches) Assisting dynamic predictors Determining which code paths are more frequent, a key step in code scheduling

4.3 Static Multiple Issue: VLIW

Overview Compiler does most of the work of finding and scheduling instructions for parallel execution Superscalar processors decide on the fly how many instructions issue. A statically scheduled superscalar must check for any dependences between instructions in the issue packet Any instruction already in the pipeline. A statically scheduled superscalar requires significant compiler assistance to achieve good performance. In contrast, A dynamically scheduled superscalar requires less compiler assistance, but significant hardware costs.

Overview (Contd..) An alternative to superscalar approach is to rely on compiler technology to Minimize the potential hazard stall Actually format the instructions in a potential issue packet so that HW need not check explicitly for dependencies. Compiler ensures… Dependences within the issue packet cannot be present – or – Indicate when a dependence may occur. Compiler technology offers potential advantage of simpler hardware while still exhibiting good performance through extensive compiler technology. Better architectural approach was named VLIW (Very Long Instruction Word)

Basic VLIW A VLIW uses multiple, independent functional units A VLIW packages multiple independent operations into one very long instruction The burden for choosing and packaging independent operations falls on compiler HW in a superscalar makes these issue decisions is unneeded This advantage increases as the maximum issue rate grows Here we consider a VLIW processor might have instructions that contain 5 operations, including 1 integer (or branch), 2 FP, and 2 memory references Depend on the available FUs and frequency of operation

Basic VLIW (Cont.) VLIW depends on enough parallelism for keeping FUs busy This parallelism is uncovered by Loop unrolling and then code scheduling with in the single larger loop body. If the unrolling generates straight-line code, the local scheduling techniques, which operate on a single basic block, can be used. If finding and exploiting the parallelism requires scheduling code across branches, more complex global scheduling algorithm must be used.

VLIW Problems – Technical Increase in code size Ambitious loop unrolling Whenever instruction are not full, the unused FUs translate to waste bits in the instruction encoding An instruction may need to be left completely empty if no operation can be scheduled Clever encoding or compress/decompress

VLIW Problems – Logistical Synchronous VS. Independent FUs Early VLIW – all FUs must be kept synchronized A stall in any FU pipeline may cause the entire processor to stall Recent VLIW – FUs operate more independently Compiler is used to avoid hazards at issue time Hardware checks allow for unsynchronized execution once instructions are issued.

VLIW Problems – Logistical Binary code compatibility Code sequence makes use of both the instruction set definition and the detailed pipeline structure (FUs and latencies) Need migration between successive implementations, or between implementations  recompliation Solution Object-code translation or emulation Temper the strictness of the approach so that binary compatibility is still feasible

Advantages of Superscalar over VLIW Old codes still run Like those tools you have that came as binaries HW detects whether the instruction pair is a legal dual issue pair If not they are run sequentially Little impact on code density Don’t need to fill all of the can’t issue here slots with NOP’s Compiler issues are very similar Still need to do instruction scheduling anyway Dynamic issue hardware is there so the compiler does not have to be too conservative

4.4 Hardware Support for Exposing More Parallelism at Compiler Time

Hardware Support for Exposing More Parallelism at Compiler Time When the behavior of branches is not well known, compiler techniques alone may not be able to uncover much ILP. In such cases, the control dependences may severely limit the amount of parallelism that can be exploited. Potential dependences between memory reference instructions could prevent code movement. Here we have several techniques that can help overcome such limitations.

(Condt..) An extension of the instruction set to include conditional or predicated instructions. Such instructions can be used to eliminate branches, converting a control dependences into a data dependences Potentially improving the performance To enhance the ability of the compiler to speculatively move code over branches, while still preserving the exception behavior. The hardware speculation schemes provided for supporting reordering loads and stores.

Conditional or Predicated Instructions the concept behind conditional instructions is quite simple: An instruction refers to a condition, which is evaluated as part of the execution. Many new architectures include some form of conditional instructions. Most common example of such instruction is conditional move. Which moves a value from one registers to another if the condition is true.

Conditional or Predicated Instructions (contd..) Other variants Conditional loads and stores ALPHA, MIPS, SPARC, PowerPC, and P6 all have simple conditional moves Effect is to eliminating simple branches changes a control dependence into a data dependence

Condition Instruction Limitations Precise Exceptions If an exception happens prior to conditional evaluation, it must be carried through the pipe Simple for register accesses but consider a memory protection violation or a page fault Long conditional sequences – If-then with a big then body If the task to be done is complex, better to evaluate the condition once. Conditional instructions are most useful when the condition can be evaluated early If data dependence in determining the condition  help less

Condition Instruction Limitations (Cont.) Wasted resource Conditional instructions consume real resources Tends to work well in the superscalar case Our simple 2-way model  Even if no conditional instruction, other resource is wasted anyway Cycle-time or CPI Issues Conditional instructions are more complex Danger is that they may consume more cycles or a longer cycle time Note that the utility is mainly to correct short control flaws Hence use may not be for the common case Things better not slow down for the real common case to support the uncommon case

Compiler Speculation with HW support As we saw earlier, many programs have branches that can be accurately produced at compile time either From the program structure (or) By using a profile. In such cases, the compiler may want to speculate either To improve the scheduling (or) To increase the issue rate. Predicted instructions provide one method to speculate, but they are really more useful When control dependences can be completely eliminated by if conversion.

Compiler Speculation with HW support (Condt..) In many cases, we would like to move speculated instructions Do conditional things in advance of the branch (and before the condition evaluation) Nullify them if the branch goes the wrong way Also implies the need to nullify exception behavior as well Limits Exceptions can not cause any destructive activity

To Speculate Ambitiously… To speculate ambitiously requires 3 capabilities: Ability of the compiler to find instructions that can be speculatively moved and not affect the program data flow Ability of HW to ignore exceptions in speculated instructions, until we know that such exceptions should really occur Ability of HW to speculatively interchange loads and stores, or stores and stores, which may have address conflicts

HW Support for Preserving Exception Behavior How to make sure that a mis-predicted speculated instruction (SI) can not cause an exception Four methods that have been supporting speculation without an exception. HW and OS cooperatively ignore exceptions for Speculative instructions (SI) SI that never raise exceptions are used, and checks are introduced to determine when an exception should occur A set of status bits called Poison bits are attached to the result registers written by SI when SI cause exceptions. The poison bits cause a fault when a normal instruction attempts to use the register A mechanism to indicate that an instruction is speculative, and HW buffers the instruction result until it is certain that the instruction is no longer speculative

Exception Types Indicate a program error and normally cause termination Memory protection violation… Should not be handled for SI when misprediction Exceptions cannot be taken until we know the instruction is no longer speculative Handled and normally resumed Page fault… Can be handled for SI just if they are normal instructions Only have negative performance effect when misprediction

HW-SW Cooperation for Speculation Return an undefined value for any terminating exception The program is allowed to continue, but almost generate incorrect results If the excepting instruction is not speculative  program in error If the excepting instruction is speculativeprogram correct but speculative result will simply be unused (No harm) Never cause a correct program to fail, no matter how much speculation An incorrect program, which formerly might have received a terminating exception, will get an incorrect result Acceptable if the compiler can also generate a normal version of the program (no speculate, and receive a terminating exception)

4.5 HW Versus SW Speculation Mechanisms To speculate extensively, we must be able to disambiguate memory reference HW speculation works better when control flow is unpredictable, and when HW branch prediction is superior to SW branch prediction done at compiler time HW speculation maintains a completely precise exception model for SI HW speculation does not require compensation or bookkeeping code, needed by ambitious SW speculation

HW Versus SW Speculation Mechanisms (Cont.) HW speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementation of an architecture HW speculation require complex and additional HW resources Some designers have tried to combine the dynamic and compiler-based approaches to achieve the best of each