CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
EENG449b/Savvides Lec /24/04 March 24, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
Instructor: Morris Lancaster
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Compiler techniques for exposing ILP (cont)
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Presentation transcript:

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614 Lec 6.2 To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. Goal: to keep a pipeline full.

CPSC614 Lec 6.3 Latencies Inst. producing result Inst. using resultLatency in cycles FP ALU opAnother FP op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Branch: 1, Integer ALU op – branch: 1 Integer load: 1 Integer ALU - integer ALU: 1

CPSC614 Lec 6.4 Example for ( i = 1000; i > 0; i = i – 1) x[i] = x[i] + s; Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, # -8 BNER1, R2, LOOP

CPSC614 Lec 6.5 Without any Scheduling Clock cycle issued Loop:L.DF0, 0(R1)1 stall2 ADD.DF4, F0, F23 stall4 stall5 S.DF4, 0(R1)6 DADDIUR1, R1, # -87 stall8 BNER1, R2, LOOP9 stall10

CPSC614 Lec 6.6 With Scheduling Clock cycle issued Loop:L.DF0, 0(R1)1 DADDIUR1, R1, # -82 ADD.DF4, F0, F23 stall4 BNER1, R2, LOOP5 S.DF4, 8(R1)6 not trivial delayed branch

CPSC614 Lec 6.7 The actual work of operating on the array element takes 3 (load, add, store). The remaining 3 cycles –Loop overhead (DADDIU, BNE) –Stall To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhead instructions.

CPSC614 Lec 6.8 Reducing Loop Overhead Loop Unrolling –Simple scheme for increasing the number of instructions relative to the branch and overhead instructions –Simply replicates the loop body multiple times, adjusting the loop termination code. –Improves scheduling »It allows instructions from different iterations to be scheduled together. –Uses different registers for each iteration.

CPSC614 Lec 6.9 Unrolled Loop (No Scheduling) Clock cycle issued Loop:L.DF0, 0(R1)12 ADD.DF4, F0, F234 5 S.DF4, 0(R1)6 L.DF6, -8(R1)78 ADD.DF8, F6, F S.DF8, -8(R1)12 L.DF10, -16(R1)1314 ADD.DF12, F10, F S.DF12, -16(R1)18 L.DF14, -24(R1)1920 ADD.DF16, F14, F S.DF16, -24(R1)24 DADDIUR1, R1, # BNER1, R2, LOOP2728

CPSC614 Lec 6.10 Loop Unrolling Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer. Unrolling improves the performance of the loop by eliminating overhead instructions.

CPSC614 Lec 6.11 Loop Unrolling (Scheduling) Clock cycle issued Loop:L.DF0, 0(R1)1 L.DF6, -8(R1)2 L.DF10, -16(R1)3 L.DF14, -24(R1)4 ADD.DF4, F0, F25 ADD.DF8, F6, F26 ADD.DF12, F10, F27 ADD.DF16, F14, F28 S.DF4, 0(R1)9 S.DF8, -8(R1)10 DADDIUR1, R1, # S.DF12, 16(R1)12 BNER1, R2, LOOP13 S.DF16, 8(R1)14

CPSC614 Lec 6.12 Summary Goal: To know when and how the ordering among instructions may be changed. This process must be performed in a methodical fashion either by a compiler or by hardware.

CPSC614 Lec 6.13 To obtain the final unrolled code, –Determine that it is legal to move the S.D after the DADDIU and BNE, and find the amount to adjust the S.D offset. –Determine that unrolling the loop will be useful by finding that the loop iterations are independent, except for the loop maintenance code. –Use different registers to avoid unnecessary constraints. –Eliminate the extra test and branch instructions and adjust the loop termination and iteration code.

CPSC614 Lec 6.14 –Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address. –Schedule the code, preserving any dependences needed to yield the same result as the original code.

Loop Unrolling I (No Delayed Branch) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) L.DF0, -8(R1) ADD.DF4, F0, F2 S.DF4, -8(R1) L.DF0, -16(R1) ADD.DF4, F0, F2 S.DF4, -16(R1) L.DF0, -24(R1) ADD.DF4, F0, F2 S.DF4, -24(R1) DADDIUR1, R1, # -32 BNER1, R2, LOOP name dependence true dependence

Loop Unrolling II (Register Renaming) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) L.DF6, -8(R1) ADD.DF8, F6, F2 S.DF8, -8(R1) L.DF10, -16(R1) ADD.DF12, F10, F2 S.DF12, -16(R1) L.DF14, -24(R1) ADD.DF16, F14, F2 S.DF16, -24(R1) DADDIUR1, R1, # -32 BNER1, R2, LOOP true dependence

CPSC614 Lec 6.17 With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel. –Potential shortfall in registers Register pressure –It arises because scheduling code to increase ILP causes the number of live values to increase. It may not be possible to allocate all the live values to registers. –The combination of unrolling and aggressive scheduling can cause this problem.

CPSC614 Lec 6.18 Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively.

CPSC614 Lec 6.19 Unrolling with Two-Issue Loop:L.DF0, 0(R1)1 L.DF6, -8(R1)2 L.DF10, -16(R1)ADD.D F4, F0, F23 L.DF14, -24(R1)ADD.D F8, F6, F24 L.DF18, -32(R1)ADD.D F12, F10, F25 S.DF4, 0(R1)ADD.D F16, F14, F26 S.DF8, -8(R1)ADD.D F20, F18, F27 S.DF12, -16(R1)8 DADDIU R1, R1, #-409 S.DF16, 16(R1)10 BNER1, R2, LOOP11 S.DF20, 8(R1)12

CPSC614 Lec 6.20 Static Branch Prediction Static branch predictors are sometimes used in processors where the expectation is that branch behavior is highly predictable at compile time.

CPSC614 Lec 6.21 Static Branch Prediction Predict a branch taken –Simplest –Average misprediction rate for SPEC: 34% (9% ~ 59%) Predict on the basis of branch direction –backward-going branches: taken –forward-going branches: not taken –Unlikely to generate an overall misprediction rate of less than 30% ~ 40%.

CPSC614 Lec 6.22 Static Branch Prediction Predict branches on the basis of profile information collected from earlier runs. –An individual branch is often highly biased toward taken or untaken. (bimodally distributed) –Changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction.

CPSC614 Lec 6.23 VLIW Very Long Instruction Word: –Rely on compiler technology to minimize the potential data hazard stalls. –Actually format the instructions in a potential issue packet so that the hardware need not check explicitly for dependences. –Wide instructions with multiple operations per instruction. (64, 128 bits or more) –Intel IA-64 architecture

CPSC614 Lec 6.24 Basic VLIW Approach VLIWs use multiple, independent functional units. A VLIW packages the multiple operations into one very long instruction. The hardware in a superscalar for multiple issue is unnecessary. Uses loop unrolling, scheduling …

CPSC614 Lec 6.25 Local Scheduling: Scheduling the code within a single basic block. Global Scheduling: scheduling code across branches –much more complex Trace Scheduling: Section 4.5 Figure 4.5 VLIW instructions

CPSC614 Lec 6.26 Problems Increase in code size Wasted functional units –In the previous example, only about 60% of the functional units were used.

CPSC614 Lec 6.27 Detecting and Enhancing Loop-level Parallelism Loop level parallelism : source level ILP : machine level code after compliation for (i= 1000; i< 0; i--) x[i] = x[i] + s

CPSC614 Lec 6.28 Advanced Compiler Support for Exposing and Exploiting ILP for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */ }

CPSC614 Lec 6.29 Loop-Carried Dependence Data accesses in later iterations are dependent on data values produced in earlier iterations. for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */ } Loop-Carried Dependences This dependence forces successive iterations of this loop to execute in series.

CPSC614 Lec 6.30 Does a loop-carried dependence mean there is no parallelism??? Consider: for (i=0; i< 8; i=i+1) { A = A + C[i]; /* S1 */ } Could compute: “Cycle 1”: temp0 = C[0] + C[1]; temp1 = C[2] + C[3]; temp2 = C[4] + C[5]; temp3 = C[6] + C[7]; “Cycle 2”: temp4 = temp0 + temp1; temp5 = temp2 + temp3; “Cycle 3”: A = temp4 + temp5; Relies on associative nature of “+”.

CPSC614 Lec 6.31 for ( i = 1; i <= 100; i ++) { A[i] = A[i] + B[i]; /* S1 */ B[i + 1] = C[i] + D[i]; /* S2 */ } Loop-Carried Dependence Despite this loop-carried dependence, this loop can be made parallel.

CPSC614 Lec 6.32 A[1] = A[1] + B[1]; for ( i = 1; i <= 99; i ++) { B[i + 1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100];

CPSC614 Lec 6.33 Recurrence A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding. Detecting a recurrence can be important –Some architectures (especially vector computer) have special support for executing recurrences. –Some recurrences can be the source of a reasonable amount of parallelism.

CPSC614 Lec 6.34 for ( i = 2; i <= 100; i = i + 1) Y[i] = Y[i – 1] + Y[i]; Dependence distance: 1 for ( i = 6; i <= 100; i = i + 1) Y[i] = Y[i – 5] + Y[i]; Dependence distance: 5 The larger the distance, the more potential parallelism can be obtained by unrolling the loop.

CPSC614 Lec 6.35 Finding Dependences Determining whether a dependence actually exists => NP-Complete Dependence Analysis –Basic tool for detecting loop-level parallelism –Applies only under a limited set of circumstances. –Greatest common divisor (GCD) test, points-to analysis, interprocedural analysis, …

CPSC614 Lec 6.36 Eliminating Dependent Computation Algebraic Simplifications of Expressions Copy propagation –Eliminates operations that copy values. DADDIUR1, R2, #4 DADDIUR1, R1, #4 DADDIUR1, R2, #8

CPSC614 Lec 6.37 Eliminating Dependent Computation Tree Height Reduction –Reduces the height of the tree structure representing a computation. ADDR1, R2, R3 ADDR4, R1, R6 ADDR8, R4, R7 ADDR1, R2, R3 ADDR4, R6, R7 ADDR8, R1, R4

CPSC614 Lec 6.38 Eliminating Dependent Computation Recurrences sum = sum + x1 + x2 + x3 + x4 + x5 sum = (sum + x1) + (x2 + x3) + (x4 + x5)

CPSC614 Lec 6.39 Software Pipelining Technique for reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop. By choosing instructions from different iterations, dependent computations are separated from one another by an entire loop body.

CPSC614 Lec 6.40 Software Pipelining Counterpart to what Tomasulo ’ s algorithm does in hardware Software pipelining symbolically unrolls the loop and then selects instructions from each iteration. Start-up code before the loop and finish- up code after the loop required.

CPSC614 Lec 6.41 Software Pipelining

CPSC614 Lec 6.42 Software Pipelining - Example Show a software-pipelined version of the following loop. Omit the start-up and finish-up code. Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 BNER1, R2, Loop

CPSC614 Lec 6.43 Software Pipelining Software pipelining consumes less code space. Loop unrolling reduces the overhead of the loop (branch, counter update code). Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.

CPSC614 Lec 6.44

CPSC614 Lec 6.45 Hw support for more parallelism at compile time Conditional Instructions Predicated instructions Extension of instruction set Conditional instruction: an instruction that refers a condition, which is evaluated as part of the instruction execution –Condition is true: executed normally –False: no-op –ex) conditional move

CPSC614 Lec 6.46 Example if (A == 0) { S = T; } BNEZ R1, L ADDUR2, R3, R0 L: CMOVZR2, R3, R1 R1=A, R2=S, R3=T conditional move only if the third operand is equal to zero

CPSC614 Lec 6.47 Conditional moves are used to change a control dependence into a data dependence. Handling multiple branches per cycle is complex. => Conditional moves provide a way of reducing branch pressure. A conditional move can often eliminate a branch that is hard to predict, increasing the potential gain.