Static Code Scheduling

Slides:



Advertisements
Similar presentations
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
EECS 583 – Class 11 Instruction Scheduling University of Michigan October 12, 2011.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Chapter Six.
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
COMP 740: Computer Architecture and Implementation
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Appendix C Pipeline implementation
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Instruction Scheduling for Instruction-Level Parallelism
Lecture 6: Advanced Pipelines
Out of Order Processors
Instruction Scheduling Hal Perkins Summer 2004
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Instruction Scheduling Hal Perkins Winter 2008
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
Computer Architecture
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Chapter Six.
Advanced Computer Architecture
Chapter Six.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
EECS 583 – Class 11 Instruction Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
September 20, 2000 Prof. John Kubiatowicz
Lecture 4: Advanced Pipelines
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Morgan Kaufmann Publishers The Processor
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
EECS 583 – Class 12 Superblock Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

Static Code Scheduling CS 671 April 1, 2008

Code Scheduling Scheduling or reordering instructions to improve performance and/or guarantee correctness Important for dynamically-scheduled architectures Crucial (assumed!) for statically-scheduled architectures, e.g. VLIW or EPIC Takes into account anticipated latencies Machine-specific, performed later in the optimization pass How does this contrast with our earlier exploration of code motion?

Why Must the Compiler Schedule? Many machines are pipelined and expose some aspects of pipelining to the user (compiler) Examples: Branch delay slots! Memory-access delays Multi-cycle operations Some machines don’t have scheduling hardware

Example Assume loads take 2 cycles and branches have a delay slot. instruction start time r2  [r1] r3  [r1+4] r4  r2 + r3 r5  r2 + 1 goto L1 nop

Example Assume loads take 2 cycles and branches have a delay slot. instruction start time r2  [r1] r3  [r1+4] r5  r2 + 1 goto L1 r4  r2 + r3

Code Scheduling Strategy Get resources operating in parallel Integer data path Integer multiply / divide hardware FP adder, multiplier, divider Method Fill with computations that do not require result or same hardware resources Drawbacks Highly hardware dependent Start Op Use Op Try to fill

Scheduling Approaches Local Branch scheduling Basic-block scheduling Global Cross-block scheduling Software pipelining Trace scheduling Percolation scheduling Basic block and branch scheduling – simplest approach – 10% speedup Cross-block scheduling considers a tree of blocks at once and may move instructions from one block to another Software pipelining operates specifically on loops – 2X speedup Trace and percolation scheduling are global approaches that work well for high-degree superscalar and VLIW All are among the last optimizations (except last two, which enable other transformations)

Branch Scheduling Two problems: Branches often take some number of cycles to complete Can be a delay between a compare b and its associated branch A compiler will try to fill these slots with valid instructions (rather than nop) Delay slots – present in PA-RISC, SPARC, MIPS Condition delay – PowerPC, Pentium

Recall from Architecture… IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

Control Hazards Taken Branch IF ID EX MA WB IF --- --- --- --- Instr + 1 Branch Target IF ID EX MA WB IF ID EX MA WB Branch Target + 1

Data Dependences If two operations access the same register, they are dependent Types of data dependences Flow Output Anti r1 = r2 + r3 r2 = r5 * 6 r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6

Data Hazards Memory latency: data not ready lw R1,0(R2) IF ID EX MA WB stall EX MA WB add R3,R1,R4

Data Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 IF ID EX EX MA WB IF ID stall EX EX MA WB addf R3,R3,R4 Assumes floating point ops take 2 execute cycles

Multi-cycle Instructions Scheduling is particularly important for multi-cycle operations Alpha instructions > 1 cycle latency (partial list) mull (32-bit integer multiply) 8 mulq (64-bit integer multiply) 16 addt (fp add) 4 mult (fp multiply) 4 divs (fp single-precision divide) 10 divt (fp double-precision divide) 23

Avoiding data hazards Move loads earlier and stores later (assuming this does not violate correctness) Other stalls may require more sophisticated re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) How can we do this in a systematic way??

Example: Without Scheduling Start Time Code lw r1, w add r1,r1,r1 lw r2,x mult r1,r1,r2 lw r2,y lw r2,z sw r1, a Assume: memory instrs take 3 cycles mult takes 2 cycles (to have result in register) rest take 1 cycle ____cycles

Basic Block Dependence DAGS Nodes - instructions Edges - dependence between I1 and I2 When we cannot determine whether there is a dependence, we must assume there is one a) lw R2, (R1) b) lw R3, (R1) 4 c) R4  R2 + R3 d) R5  R2 - 1 a b 2 2 2 d c

Example – Build the DAG Code a lw r1, w b add r1,r1,r1 c load r2,x d mult r1,r1,r2 e load r2,y f g load r2,z h i sw r1, a Assume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle

Creating a schedule Create a DAG of dependences Determine priority Schedule instructions with Ready operands Highest priority Heuristics: If multiple possibilities, fall back on other priority functions

Operation Priority Priority – Need a mechanism to decide which ops to schedule first (when you have choices) Common priority functions Height – Distance from exit node Give priority to amount of work left to do Slackness – inversely proportional to slack Give priority to ops on the critical path Register use – priority to nodes with more source operands and fewer destination operands Reduces number of live registers Uncover – high priority to nodes with many children Frees up more nodes Original order – when all else fails

Computing Priorities Height(n) = exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency

Example – Determine Height and CP Code a lw r1, w b add r1,r1,r1 c lw r2,x d mult r1,r1,r2 e lw r2,y f g lw r2,z h i sw r1, a Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle Critical path: _______

Example – List Scheduling Code a lw r1, w b add r1,r1,r1 c lw r2,x d mult r1,r1,r2 e lw r2,y f g lw r2,z h i sw r1, a start Schedule _____cycles

Scheduling vs. Register Allocation Code a lw r1  (r12) b lw r2  (r12+4) c r1  r1+r2 d stw (r12)  r1 e lw r1  (r12+8) f lw r2  (r12+12) g r2  r1+r2

Register Renaming Code a lw r1  (r12) b lw r2  (r12+4) c r3  r1+r2 stw (r12)  r3 e lw r4  (r12+8) f lw r5  (r12+12) g r6  r4+r5

VLIW Very Long Instruction Word Compiler determines exactly what is issued every cycle (before the program is run) Schedules also account for latencies All hardware changes result in a compiler change Usually embedded systems (hence simple HW) Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)

Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c = a + b d = a - b e = a * b ld j = [x] nop g = c + d h = c - d nop ld k = [y] nop nop nop i = j * c ld f = [z] br g

Multi-Issue Scheduling Example Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, non-pipelined ALU = 1 cycle 2m 3m 5m 4 6 9 8 10 7m 1 RU_map Schedule time ALU MEM 1 2 3 4 5 6 7 8 9 time Ready Placed 1 2 3 4 5 6 7 8 9

Earliest Latest Sets Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, pipelined ALU = 1 cycle 1m 2m 3 4m 5 6 7 8 9m 10

List Scheduling Algorithm Build dependence graph, calculate priority Add all ops to UNSCHEDULED set time = 0 while (UNSCHEDULED is not empty) time++ READY = UNSCHEDULED ops whose incoming deps have been satisfied Sort READY using priority function For each op in READY (highest to lowest priority) op can be scheduled at current time? (resources free?) Yes: schedule it, op.issue_time = time Mark resources busy in RU_map relative to issue time Remove op from UNSCHEDULED/READY sets No: continue

Improving Basic Block Scheduling Loop unrolling – creates longer basic blocks Register renaming – can change register usage in blocks to remove immediate reuse of registers Summary Static scheduling complements (or replaces) dynamic scheduling by the hardware