Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 CS 201 Compiler Construction Machine Code Generation.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Control-Flow Graphs & Dataflow Analysis CS153: Compilers Greg Morrisett.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Partial Redundancy Elimination Guo, Yao.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
Improving Code Generation Honors Compilers April 16 th 2002.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
1 Code Generation Part II Chapter 8 (1 st ed. Ch.9) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Code Optimization.
CS203 – Advanced Computer Architecture
Instruction Scheduling for Instruction-Level Parallelism
Instruction Scheduling Hal Perkins Summer 2004
Instruction Scheduling Hal Perkins Winter 2008
Code Optimization Overview and Examples Control Flow Graph
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Scheduling Hal Perkins Autumn 2005
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Code Generation Part II
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 11: Machine-Dependent Optimization
CSc 453 Final Code Generation
COMPUTER ORGANIZATION AND ARCHITECTURE
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

Architecture-dependent optimizations Functional units, delay slots and dependency analysis

RISC architectures The pipeline structure of modern architectures requires careful instruction scheduling. If instruction I 1 creates a value, there may be a latency that has to elapse before another instruction I 2 can use this value If an instruction awaits the result of a previous computation, the pipeline may have to stall until the result becomes available A branch instruction is time-consuming and affects the contents of the instruction cache. Execution cannot start at destination before one or more cycles have elapsed.

Instruction scheduling Purpose: minimize stalls and delays, fill delay slots with useful computations, minimize execution time of basic block. Tool: dependency analysis. Uncover legal reorderings of instructions, available parallelism in basic blocks and beyond Applications: Filling delay slots is important for all programs Dependency analysis is critical for reordering of loop computations on vector processors and others

Dependence relations A data dependence is a constraint that arises from the flow of data between statements. Violating a data dependence by reordering may lead to incorrect results. If S1 sets a value that S2 uses, this is flow dependence or true dependence between S1 and S2 If S1 uses some variable’s value and S2 sets it, there is an antidependence between them. If both S1 and S2 set the value of some variable, there is an output dependence between them. If both S1 and S2 read the value of some variable there is an input dependence between then. This does not impose an ordering.

The dependence DAG of a basic block There is an edge in the dependence dag if: I 1 writes a register or location that I 2 uses: I 1 f  I 2 I 1 uses a register or location that I 2 changes: I 1 a  I 2 I 1 and I 2 write to the same register or location: I 1 o  I 2 I 1 and I 2 exhibit a structural hazard: a load followed by a store cannot be interchanged unless the addresses are known to be distinct: X := A[I]; A[J] := Y; -- cannot interchange, X might get Y  if there is an edge between I 1 and I 2, I 2 must not start executing until I 1 has executed for some number of cycles.

Example 1: R3 = [R15] 2: R4 = [R15 + 4] 3: R2 = R3 – R4 -- needs R4: stall one cycle 4: R5 = [R12] 5: R12 = R : R6 = R3 * R5 7: [R15+4] = R3 8: R5 = R

Contention for resources Functional unit is pipelined, consists of multiple resources. Instructions through the pipeline may conflict on use of resources. Eg: floating-point unit on MIPS: A: Mantissa add E: Exception test M: Multiplier first stage N: Multiplier second stage R: Adder Round S: Operand shift U: unpack Add uses successively U, S and A, A and R, R and S. (4 cycles) Mul uses U, M, M, M, M, M and A, R. (7 cycles) Conflict depends on relative starting time of two instructions. Edges in dependency graph are labelled with latencies (>= 1).

Branch scheduling Important use of dependency graph: fill delay slots (branch takes two cycles to reach destination) R2 = [R1] R2 = [R1] R3 = [R1+4] R3 = [R1+4] R4 = R2 + R3 (stall) R5 = R2 -1 R5 = R2 -1 goto L1 goto L1 R4 = R2 + R3 nop

Conditional jumps and delay slots Instruction in delay slot is executed while jump is in progress. What if jump is not taken? Need mechanism to annull instruction. Branch prediction: assume target is known, fill delay slot with first instruction in target block If both destinations start with same instruction, ideal choice for delay slot Good heuristics for loops: assume that a backwards conditional jump is usually taken. Move first instruction in loop to delay slot for branch at end Call instruction has delay slot: fill with parameter push

A greedy algorithm: list scheduling Finding optimal schedule for DAG is NP-complete Simple algorithm is O (N 2 ) at worst, usually linear Roots of DAG are instructions without predecessors First pass: from leaves to roots: compute latest possible starting time for each instruction to end of block For leaf: execution time of instruction For inner node: maximum delay imposed by successors E.g. if I n is followed by I m, I m can start at T – 4, and there is a latency of 2 between I n and I m, then I n must start by T – 6.

List scheduling: second pass Second pass: from roots to leaves: schedule instructions with the greatest slack (farthest from block end) and that can start as early as possible from now. At each step: D1: candidates with the largest remaining delay D2: candidates with the earliest possible starting time (computed from starting time of their predecessors) Choose from D1 if unique, else from D2 if unique, else use heuristics: Choose earliest starting time, or Choose instruction that uses least used pipeline, or Choose instruction that frees register.

Procedure integration: inlining Calls make optimizations harder. There is a large payoff to local optimizations over large basic blocks: inlining subprogram bodies is often very effective: It exposes the values of the actuals in the body It creates larger basic blocks It saves the cost of the call Can be done at the tree level or at the RTL level. In both cases it can enable other optimizations. Possible disadvantages: code size increase, debugging is harder

Inlining as a tree transformation Treat body of subprogram as a generic unit Each inlined body needs its own local variables Global references are captured at the point of definition Inlining works like instantiation: replace formals with actuals, complete analysis and expansion of inserted body Replace multiple return statements where needed Introduce temporary to hold return value of function

Name capture: recognize global entities function memo (x : integer) return integer is local : integer := x **2; begin Saved := Saved + local + x; -- Saved is global return Saved; end memo; … Val := memo (15); Becomes declare local : integer := 15 ** 2; -- each inlining has its own result : integer; -- maybe superfluous if context is assignment begin Saved := Saved + local + 15l; -- Saved is the same entity in all inlinings result := Saved; end; Val := result;

Handling return statements Subprogram needs a label to serve a single exit point. In a function: identify target of result, or create temporary for it; replace return with assignment to target, followed by goto to exit label In a procedure: replace return with goto to exit label Optimizations if body of function is single return statement and context is assignment, can replace right-hand side with expression If procedure has no return statement, exit label is superfluous

Parameter passing If actual is an expression, it is evaluated once: create temporary in block and replace formal with temporary Val2 := memo (x * f (z)); Becomes: declare C1 : integer := x * f (z); local : integer := C1 ** 2; begin Saved := Saved + local + C1; Val2 := Saved; -- context is assignment end;

Parameter passing: variables An in-out parameter is a location, cannot create a temporary for it: must use a renaming declaration. procedure incr (x : in out integer) is begin x := x + 1; end; … incr (a (i)); Becomes declare c1 : integer renames a (i); begin c1 := c1 + 1; end;

Context includes more than global names semantics of inlined call must be identical to original program If constraint checks are not suppressed in the body, they must not be suppressed in the inlined block, even if suppressed at the point of call. Status of constraint checks is part of closure of body to inline, must be applied when analyzing inlined block

Specialized inlining: loop unrolling Create successive copies of body of loop: saves tests, makes bigger basic block, increases instruction level parallelism for j in 1.. N loop loop_body; end loop; Becomes for k in 1.. N / r loop -- unroll r times loop_ body loop_body [j -> j +1] -- replace loop variable for each unrolling … loop_ body [j -> j +r -1] end; for k in N / r +1.. N loop loop_body end loop; -- leftover iterations