Instruction Scheduling Hal Perkins Autumn 2005

Slides:

Advertisements

Similar presentations

Register Allocation Consists of two parts: Goal : minimize spills

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Register Allocation Zach Ma.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 CS 201 Compiler Construction Machine Code Generation.

11/21/2002© 2002 Hal Perkins & UW CSEO-1 CSE 582 – Compilers Instruction Scheduling Hal Perkins Autumn 2002.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Improving Code Generation Honors Compilers April 16 th 2002.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.

Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

10/1/2015© Hal Perkins & UW CSEG-1 CSE P 501 – Compilers Intermediate Representations Hal Perkins Autumn 2009.

Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.

1 Code Generation Part II Chapter 8 (1 st ed. Ch.9) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

Instruction Selection and Scheduling. The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate.

Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

2/22/2016© Hal Perkins & UW CSEP-1 CSE P 501 – Compilers Register Allocation Hal Perkins Winter 2008.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Register Allocation Hal Perkins Autumn 2009

Morgan Kaufmann Publishers

CS203 – Advanced Computer Architecture

Pipeline Implementation (4.6)

Local Instruction Scheduling

Instruction Scheduling for Instruction-Level Parallelism

Intermediate Representations Hal Perkins Autumn 2011

Register Allocation Hal Perkins Autumn 2011

Instruction Scheduling Hal Perkins Summer 2004

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Instruction Scheduling: Beyond Basic Blocks

Instruction Scheduling Hal Perkins Winter 2008

Local Instruction Scheduling — A Primer for Lab 3 —

CS 201 Compiler Construction

How to improve (decrease) CPI

Optimizing Transformations Hal Perkins Winter 2008

Register Allocation Hal Perkins Summer 2004

Register Allocation Hal Perkins Autumn 2005

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Static Code Scheduling

Lecture 15: Code Generation - Instruction Selection

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 16: Register Allocation

Instruction Scheduling: Beyond Basic Blocks

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 17: Register Allocation via Graph Colouring

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Code Generation Part II

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Target Code Generation

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Instruction Scheduling Hal Perkins Autumn 2011

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Instruction Scheduling Hal Perkins Autumn 2005 CSE P 501 – Compilers Instruction Scheduling Hal Perkins Autumn 2005 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Agenda Instruction scheduling issues – latencies List scheduling Credits: Adapted from slides by Keith Cooper, Rice University 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Issues Many operations have non-zero latencies Modern machines can issue several operations per cycle Loads & Stores may or may not block  may be slots after load/store for other useful work Branch costs vary Branches on modern processors typically have delay slots (or on x86, delay slots in underlying execution engine – try to generate code that doesn’t get in the way of exploiting these) GOAL: Scheduler should reorder instructions to hide latencies and take advantage of delay slots 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Some Idealized Latencies Operation Cycles LOAD 3 STORE ADD 1 MULT 2 SHIFT BRANCH 0 TO 8 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Example: w = w*2*x*y*z; Simple schedule Loads early 1 LOAD r1 <- w 4 ADD r1 <- r1,r1 5 LOAD r2 <- x 8 MULT r1 <- r1,r2 9 LOAD r2 <- y 12 MULT r1 <- r1,r2 13 LOAD r2 <- z 16 MULT r1 <- r1,r2 18 STORE w <- r1 21 r1 free 2 registers, 20 cycles Loads early 1 LOAD r1 <- w 2 LOAD r2 <- x 3 LOAD r3 <- y 4 ADD r1 <- r1,r1 5 MULT r1 <- r1,r2 6 LOAD r2 <- z 7 MULT r1 <- r1,r3 9 MULT r1 <- r1,r2 11 STORE w <- r1 14 r1 is free 3 registers, 13 cycles 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Instruction Scheduling Problem Given a code fragment for some machine and latencies for each operation, reorder to minimize execution time Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Do this efficiently 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Precedence Graph Nodes n are operations Attributes of each node type – kind of operation delay – latency If node n2 uses the result of node n1, there is an edge e = (n1,n2) in the graph 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Example Graph Code a LOAD r1 <- w b ADD r1 <- r1,r1 c LOAD r2 <- x d MULT r1 <- r1,r2 e LOAD r2 <- y f MULT r1 <- r1,r2 g LOAD r2 <- z h MULT r1 <- r1,r2 i STORE w <- r1 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Schedules (1) A correct schedule S maps each node n into a non-negative integer representing its cycle number, and S (n ) >= 0 for all nodes n (obvious) If (n1,n2) is an edge, then S(n1)+delay(n1) <= S(n2) For each type t there are no more operations of type t in any cycle than the target machine can issue 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Schedules (2) The length of a schedule S, denoted L(S) is L(S) = maxn ( S(n )+delay(n ) ) The goal is to find the shortest possible correct schedule Other possible goals: minimize use of registers, power, space, … 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Constraints Main points Collectively this makes scheduling NP-complete All operands must be available Multiple operations can be ready at any given point Moving operations can lengthen register lifetimes Moving uses near definitions can shorten register lifetimes Operations can have multiple predecessors Collectively this makes scheduling NP-complete Local scheduling is the simpler case Straight-line code Consistent, predictable latencies 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Algorithm Overview Build a precedence graph P Compute a priority function over the nodes in P (typical: longest latency-weighted path) Use list scheduling to construct a schedule, one cycle at a time Use queue of operations that are ready At each cycle Chose a ready operation and schedule it Update ready queue Rename registers to avoid false dependencies and conflicts 2/25/2019 © 2002-05 Hal Perkins & UW CSE

List Scheduling Algorithm Cycle = 1; Ready = leaves of P; Active = empty; while (Ready and/or Active are not empty) if (Ready is not empty) remove an op from Ready; S(op) = Cycle; Active = Active  op; Cycle++; for each op in Active if (S(op) + delay(op) <= Cycle) remove op from Active; for each successor s of op in P if (s is ready – i.e., all operands available) add s to Ready 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Example Code a LOAD r1 <- w b ADD r1 <- r1,r1 c LOAD r2 <- x d MULT r1 <- r1,r2 e LOAD r2 <- y f MULT r1 <- r1,r2 g LOAD r2 <- z h MULT r1 <- r1,r2 i STORE w <- r1 2/25/2019 © 2002-05 Hal Perkins & UW CSE

Variations Backward list scheduling Work from the root to the leaves Schedules instructions from end to beginning of the block In practice, try both and pick the result that minimizes costs Little extra expense since the precedence graph and other information can be reused Global scheduling and loop scheduling Extend basic idea in more aggressive compilers 2/25/2019 © 2002-05 Hal Perkins & UW CSE