Instruction Scheduling combining scheduling with allocation Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Slides:



Advertisements
Similar presentations
Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.
Advertisements

Turing Machines January 2003 Part 2:. 2 TM Recap We have seen how an abstract TM can be built to implement any computable algorithm TM has components:
Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.
Load Balancing Parallel Applications on Heterogeneous Platforms.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
ECE 667 Synthesis and Verification of Digital Circuits
Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), See also § of EaC2e. 1COMP 512,
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 Code generation Our book's target machine (appendix A): opcode source1, source2, destination add r1, r2, r3 addI r1, c, r2 loadI c, r2 load r1, r2 loadAI.
1 CS 201 Compiler Construction Machine Code Generation.
PROBLEM SOLVING AND SEARCH
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Evaluating Heuristics for the Fixed-Predecessor Subproblem of Pm | prec, p j = 1 | C max.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,
Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Intermediate Representations Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Register Allocation (via graph coloring)
Instruction Selection, II Tree-pattern matching Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in.
Improving Code Generation Honors Compilers April 16 th 2002.
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
Combining Scheduling & Allocation Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Instruction Selection and Scheduling. The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate.
Operator Strength Reduction C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students.
Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.
Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
Lecture 3: Uninformed Search
Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.
Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.
Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.
Definition-Use Chains
Introduction to Optimization
Local Register Allocation & Lab Exercise 1
Games with Chance Other Search Algorithms
Local Instruction Scheduling
Introduction to Optimization
Instruction Scheduling Hal Perkins Summer 2004
Intermediate Representations
Introduction to Code Generation
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Instruction Scheduling: Beyond Basic Blocks
Instruction Scheduling Hal Perkins Winter 2008
Local Instruction Scheduling — A Primer for Lab 3 —
CS 201 Compiler Construction
Instruction Selection, II Tree-pattern matching
Intermediate Representations
Local Register Allocation & Lab Exercise 1
Introduction to Optimization
Instruction Scheduling: Beyond Basic Blocks
Instruction Scheduling Hal Perkins Autumn 2005
Lecture 17: Register Allocation via Graph Colouring
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

Instruction Scheduling combining scheduling with allocation Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011 COMP 512, Rice University1

2 Combining Scheduling & Allocation Sometimes, combining two optimizations can produce solutions that cannot be obtained by solving them independently Requires bilateral interactions between optimizations Click and Cooper, Combining Analyses, Combining Optimizations, TOPLAS 17(2), March Combining two optimizations can be a challenge ( SCCP ) Scheduling & allocation are a classic example Scheduling changes variable lifetimes Renaming in the allocator changes dependences Spilling changes the underlying code false dependences

COMP 512, Rice University3 Many authors have tried to combine allocation & scheduling Underallocate to leave room for the scheduler Can result in underutilization of registers Preallocate to use all registers Can create false dependences Solving the problems together can produce solutions that cannot be obtained by solving them independently See Click and Cooper, Combining Analyses, Combining Optimizations, TOPLAS 17(2), March In general, these papers try to combine global allocators with local or regional schedulers an algorithmic mismatch Combining Scheduling & Allocation Before we go there, a long digression about how much improvement we might expect …

COMP 512, Rice University4 Quick Review of Local Scheduling Given a sequence of machine operations, reorder the operations so that Data dependences are respected Execution time is minimized Demand for registers is kept below k Vocabulary: An operation is an indivisible command An instruction is a set of operations that issue in the same cycle A dependence graph is constructed to represent necessary delays (Nodes are operations; edges show the flow of values; edge weights represent operation latencies)

COMP 512, Rice University5 Scheduling Example Many operations have non-zero latencies Modern machines can issue several operations per cycle Execution time is order-dependent ( and has been since the 60s ) Assumed latencies (conservative) Operation Cycles load3 store3 loadI1 add1 mult2 fadd1 fmult2 shift 1 branch 0 to 8 Loads & stores may or may not block > Non-blocking fill those issue slots Branch costs vary with path taken Branches typically have delay slots > Fill slots with unrelated operations > Percolates branch upward Scheduler should hide the latencies List scheduling is dominant algorithm

COMP 512, Rice University6 Example w w * 2 * x * y * z Simple scheduleSchedule loads early 2 registers, 20 cycles3 registers, 13 cycles Reordering operations for speed is called instruction scheduling

COMP 512, Rice University7 Instruction Scheduling (The Abstract View) To capture properties of the code, build a dependence graph G Nodes n G are operations with type(n) and delay(n) An edge e = (n 1,n 2 ) G if & only if n 2 uses the result of n 1 The Code a b c d e f g h i The Dependence Graph

COMP 512, Rice University8 Instruction Scheduling (Definitions) A correct schedule S maps each n N into a non-negative integer representing its cycle number, and 1. S(n) 0, for all n N, obviously 2. If (n 1,n 2 ) E, S(n 1 ) + delay(n 1 ) S(n 2 ) 3. For each type of operation ( functional unit ), t, there are no more operations of type t in any cycle than the target machine can issue The length of a schedule S, denoted L(S), is L(S) = max n N (S(n) + delay(n)) The goal is to find the shortest possible correct schedule. S is time-optimal if L(S) L(S 1 ), for all other schedules S 1 A schedule might also be optimal in terms of registers or power, or ….

COMP 512, Rice University9 Whats so difficult? Critical Points All operands must be available when an operation issues Multiple operations can be ready (& often are …) Moving an operation can lengthen register lifetimes or it can shorten register lifetimes Operands can have multiple predecessors (not SSA ) Together, these issues make scheduling hard ( NP-C omplete) Local scheduling is the simple case Restricted to straight-line code Consistent and predictable latencies

COMP 512, Rice University10 Instruction Scheduling The big picture 1. Build a dependence graph, D 2. Compute a priority function over the nodes in D 3. Use list scheduling to construct a schedule, 1 cycle at a time a. Use a queue of operations that are ready b. At each cycle I. Choose a ready operation and schedule it II. Update the ready queue Local list scheduling The dominant algorithm for twenty years A greedy, heuristic, local technique

COMP 512, Rice University11 Local List Scheduling Cycle 1 Ready leaves of D Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S( op ) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S( op ) + delay( op ) Cycle) then remove op from Active for each successor s of op in D if (s is ready) then Ready Ready s Removal in priority order op has completed execution If successors operands are ready, put it on Ready Can improve efficiency by using a set of Queues (1 more than maximum delay on target machine) see 412 notes

COMP 512, Rice University12 Scheduling Example 1. Build the dependence graph The Code a b c d e f g h i The Dependence Graph

COMP 512, Rice University13 Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path The Code a b c d e f g h i The Dependence Graph

COMP 512, Rice University14 Scheduling Example 1. Build the dependence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling r1 1) a: addr1,r1 r14) b: r2 2) c: multr1,r2 r15) d: r3 3) e: multr1,r3 r17) r26) g: multr1,r2 r1 9) h: 11) i:storeAIr1 The Code a b c d e f g h i The Dependence Graph New register name used

COMP 512, Rice University15 Local Scheduling As long as we stay within a single block List scheduling does well Problem is hard, so tie-breaking matters More descendants in dependence graph Prefer operation with a last use over one with none Breadth first makes progress on all paths Tends toward more ILP & fewer interlocks Depth first tries to complete uses of a value Tends to use fewer registers Classic work on this is Gibbons & Muchnick (PLDI 86)

COMP 512, Rice University16 Local Scheduling Forward and backward can produce different results cbr cmpstore 1 store 2 store 3 store 4 store 5 add 1 add 2 add 3 add 4 addI loadI 1 lshiftloadI 2 loadI 3 loadI 4 Block from SPEC benchmark go Operation loadloadIaddaddIstorecmp Latency Latency to the cbr Subscript to identify

COMP 512, Rice University17 Local Scheduling Int Mem 1 loadI 1 lshift 2 loadI 2 loadI 3 3 loadI 4 add 1 4 add 2 add 3 5 add 4 addIstore 1 6 cmpstore 2 7 store 3 8 store 4 9 store cbr ForwardScheduleForwardSchedule Int Mem 1 loadI 4 2 addIlshift 3 add 4 loadI 3 4 add 3 loadI 2 store 5 5 add 2 loadI 1 store 4 6 add 1 store 3 7 store 2 8 store cmp 12 cbr 13 BackwardScheduleBackwardSchedule Using latency to root as the priority

COMP 512, Rice University18 Local Scheduling Priority function strongly affects properties of result Longest latency-weighted path biases result toward finishing the long paths as soon as possible execution speed is paramount may use more registers than the minimum Depth-first approach can reduce demand for registers minimize lifetimes of values Sethi-Ullman numbering, extended to DAGs

COMP 512, Rice University19 Iterative Repair Scheduling The Problem List scheduling has dominated field for 20 years Anecdotal evidence both good & bad, little solid evidence No intuitive paradigm for how it works It works well, but will it work well in the future ? Is there room for improvement? ( e.g., with allocation? ) Our Idea Try more powerful algorithms from other domains Look for better schedules Look for understanding of the solution space This led us to iterative repair scheduling

COMP 512, Rice University20 Iterative Repair Scheduling The Algorithm Start from some approximation to a schedule ( bad or broken ) Find & prioritize all cycles that need repair ( tried 6 schemes ) Either resource or data constraints Perform the needed repairs, in priority order Break ties randomly Reschedule dependent operations, in random order Evaluation function on repair can reject the repair ( try another ) Iterate until repair list is empty Repeat this process many times to explore the solution space Keep the best result ! Randomization & restart is a fundamental theme of our recent work Iterative repair works well on many kinds of scheduling problems. Scheduling cargo for the space shuttle Typical problems in the literature involve 10s or 100s of repairs We used it with millions of repairs

COMP 512, Rice University21 Iterative Repair Scheduling How does iterative repair do versus list scheduling? Found many schedules that used fewer registers Found very few faster schedules Were disappointed with the results Began a study of the properties of scheduling problems Iterative repair, itself, doesnt justify the additional costs Can we identify schedulers where it will win? Can we learn about the properties of scheduling problems ? And about the behavior of list scheduling... Hopeful sign for this lecture

COMP 512, Rice University22 Methodology Looked at blocks & extended blocks in benchmarks Used randomized version of backward & forward list scheduling If non-optimal, used IR to find its best schedule ( simple tests ) Checked these results against an IP formulation using CPLEX The Results List scheduling does quite well on a conventional uniprocessor Over 92% of blocks scheduled optimally for speed Over 73% of extended blocks scheduled optimally for speed Instruction Scheduling Study

COMP 512, Rice University23 Scheilkes RBF Algorithm for Local Scheduling Relying on randomization & restart, we can smooth the behavior of classic list scheduling algorithms Schielkes RBF algorithm Run 5 passes of forward list scheduling and 5 passes of backward list scheduling Break each tie randomly Keep the best schedule Shortest time to completion Other metrics are possible ( shortest time + fewest registers ) In practice, this approach does very well Reuses the dependence graph Randomized Backward & Forward My algorithm of choice for list scheduling …

COMP 512, Rice University24 Methodology Looked at blocks & extended blocks in benchmarks Applied the RBF algorithm & tested result for optimality If non-optimal, used IR to find its best schedule Checked these results against an IP formulation using CPLEX The Results List scheduling does quite well on a conventional uniprocessor Over 92% of blocks scheduled optimally for speed Over 73% of extended blocks scheduled optimally for speed Instruction Scheduling Study

COMP 512, Rice University25 Methodology Repeated same experiment with randomly-generated blocks Generated over 85,000 random blocks of 10, 20, & 50 ops The Results List scheduling finds optimal schedules over 80% of the time Plotted % non-optimal against available ILP Peak is around 2.8 for 1 functional unit and 4.7 for 2 units Worst-case schedule length over critical path length IR scheduler usually found optimal schedules for these harder problems use it when list scheduling fails Instruction Scheduling Study 3 compute months on a pair of UltraSparc workstations

COMP 512, Rice University26 Non-optimal list schedules (%) versus available parallelism 1 functional unit, randomly generated blocks of 10, 20, 50 ops At the peak, compiler should apply other techniques Measure parallelism in list scheduler Invoke stronger techniques when high-probability of payoff How Well Does List Scheduling Do? Most codes fall here, unless the compiler transforms them for ILP. If the compiler transforms the code, it should avoid this area! From Phil Schielkes thesis

COMP 512, Rice University27 Instruction Scheduling Study The Lessons In general, list scheduling does well Use randomized version, both directions, a few trials To find hard problems, measure average length of ready queue If it falls in the hard region Check resulting schedule for interlocks or holes If non-optimal, run the IR scheduler If transforming for parallelism, avoid hitting the hard range CPLEX had a hard time with the easy blocks Too many optimal solutions to check out

COMP 512, Rice University28 Combining Allocation & Scheduling The Problem Well-understood that the problems are intricately related Previous work under-allocates or under-schedules Except Goodman & Hsu Our Approach Formulate an iterative repair framework Moves for scheduling, as before Moves to decrease register pressure or to spill Allows fair competition in a combined attack Grows out of search for novel techniques from other areas

COMP 512, Rice University29 Combining Allocation & Scheduling The Details Run IR scheduler & keep the schedule with lowest pressure Start with ALAP schedule rather than ASAP schedule Reject repair that increases maximum pressure Cycle with pressure > k triggers pressure repair Identify ops that reduce pressure & move one Lower threshold for k seems to help Ran it against the classic method Schedule, allocate, schedule ( using Briggs allocator )

COMP 512, Rice University30 Combining Allocation & Scheduling The Results Many opportunities to lower pressure 12% of basic blocks 33% of extended blocks This can produce faster code Best case was 41.3% Average case, 16 regs, was 5.4% Average case, 32 regs, was 3.5% ( whole applications ) This approach finds faster codes that spill fewer values It is competing against a very good global allocator Rematerialization catches many of the same effects Knowing that new solutions exist does not ensure that they are better solutions! This work confirms years of suspicion, while providing an effective, albeit nontraditional, technique The opportunity is present, but the IR scheduler is still quite slow …

COMP 512, Rice University31 Sethi-Ullman Numbering Two pass algorithm Number each subtree in the expression 1 if n is a leaf label(n) = max(label(left child),label(right child) if labels are unequal label(left child) + 1 if labels are equal Labels correspond to registers need to evaluate that subtree Use numbers to guide evaluation At each node, generate more demanding subtree first 1. generate code for larger label 2. store result in a temporary registers 3. generate code for smaller label 4. generate code for node This approach minimizes register use for a given tree

COMP 512, Rice University32 Sethi Ullman Numbering Of course, code shape matters Deep trees use fewer registers than broad trees a + b + c + d cdab Evaluates in 2 registers Less ILP Evaluates in 3 registers More ILP

COMP 512, Rice University33 Balancing Speed and Register Pressure Goodman & Hsu proposed a novel scheme Context: debate about prepass versus postpass scheduling Problem: tradeoff between allocation & scheduling Solution: Schedule for speed until fewer than Threshold registers Schedule for registers until more than Threshold registers Details: for speed means one of the latency-weighted priorities for registers means an incremental adaptation of SU scheme James R. Goodman and Wei-Chung Hsu, Code Scheduling and Register Allocation in Large Basic Blocks, Proceedings of the 2 nd International Conference on Supercomputing, St. Malo, France, 1988, pages

COMP 512, Rice University34 Local Scheduling & Register Allocation List scheduling is a local, incremental algorithm Decisions made on an operation-by-operation basis Use local (basic-block level) metrics Need a local, incremental register-allocation algorithm Bests algorithm, called bottom-up local in EaC To free a register, evict the value with furthest next use Uses local (basic-block level) metrics Combining these two algorithms leads to a fair, local algorithm for the combined problem, called IRIS Idea is due to Dae-Hwan Kim & Hyuk-Jae Lee Can use a non-local eviction heuristic ( new twist on Bests alg. ) D-H. Kim and H-J. Lee, Integrated Instruction Scheduling and Fine-Grain Register Allocation for Embedded Processors, in Embedded Computer Systems: Architectures, Modeling, and Simulation, LNCS 4017, pages

COMP 512, Rice University35 Original Code for Local List Scheduling Cycle 1 Ready leaves of D Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S( op ) Cycle Active Active op Cycle Cycle + 1 update the Ready queue Paraphrasing from earlier slide…

COMP 512, Rice University36 The Combined Algorithm Sketch of the algorithm Cycle 1 Ready leaves of D Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready make operands available in registers allocate a register for target S( op ) Cycle Active Active op Cycle Cycle + 1 update the Ready queue Reload Live on Exit values, if necessary Keep a list of free registers On last use, put register back on free list To free register, store value used farthest in the future