Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Slides:

Advertisements

Similar presentations

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), See also § of EaC2e. 1COMP 512,

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

1 Code generation Our book's target machine (appendix A): opcode source1, source2, destination add r1, r2, r3 addI r1, c, r2 loadI c, r2 load r1, r2 loadAI.

1 CS 201 Compiler Construction Machine Code Generation.

CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,

Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Instruction Scheduling, III Software Pipelining Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

Intermediate Representations Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

1 Handling nested procedures Method 1 : static (access) links –Reference to the frame of the lexically enclosing procedure –Static chains of such links.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Register Allocation (via graph coloring)

Instruction Selection, II Tree-pattern matching Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Improving Code Generation Honors Compilers April 16 th 2002.

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Instruction Selection Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.

Combining Scheduling & Allocation Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

Local Register Allocation & Lab Exercise 1 Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

Instruction Selection and Scheduling. The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate.

Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.

Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

Instruction Selection, Part I Selection via Peephole Optimization Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

1 March 16, March 16, 2016March 16, 2016March 16, 2016 Azusa, CA Sheldon X. Liang Ph. D. Azusa Pacific University, Azusa, CA 91702, Tel: (800)

Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

Local Register Allocation & Lab Exercise 1

Local Instruction Scheduling

Instruction Scheduling for Instruction-Level Parallelism

Lecture 6: Advanced Pipelines

Instruction Scheduling Hal Perkins Summer 2004

Intermediate Representations

Introduction to Code Generation

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Instruction Scheduling: Beyond Basic Blocks

Instruction Scheduling Hal Perkins Winter 2008

Local Instruction Scheduling — A Primer for Lab 3 —

Code Shape III Booleans, Relationals, & Control flow

Intermediate Representations

The Last Lecture COMP 512 Rice University Houston, Texas Fall 2003

Static Code Scheduling

Local Register Allocation & Lab Exercise 1

Lecture 16: Register Allocation

Instruction Scheduling: Beyond Basic Blocks

Instruction Scheduling Hal Perkins Autumn 2005

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Instruction Scheduling Hal Perkins Autumn 2011

Scheduling Chapter 10 Optimizing Compilers for Modern Architectures.

Presentation transcript:

Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. COMP 412 FALL 2010

Comp 412, Fall What Makes Code Run Fast? Many operations have non-zero latencies Modern machines can issue several operations per cycle Execution time is order-dependent ( and has been since the 60’s ) Operation Cycles load3 store3 loadI1 add1 mult2 fadd1 fmult2 shift 1 branch 0 to 8 Operation Cycles load3 store3 loadI1 add1 mult2 fadd1 fmult2 shift 1 branch 0 to 8 Loads & stores may or may not block on issue > Non-blocking  fill those issue slots Branch costs vary with path taken Branches typically have delay slots > Fill slots with unrelated operations > Percolates branch upward Scheduler should hide the latencies Loads & stores may or may not block on issue > Non-blocking  fill those issue slots Branch costs vary with path taken Branches typically have delay slots > Fill slots with unrelated operations > Percolates branch upward Scheduler should hide the latencies Lab 3 will build a local scheduler Assumed latencies for example on next slide.

Comp 412, Fall Example w  w * 2 * x * y * z Simple scheduleSchedule loads early 2 registers, 20 cycles3 registers, 13 cycles Reordering operations for speed is called instruction scheduling

ALU Characteristics This data is surprisingly hard to measure accurately Value-dependent behavior Context-dependent behavior Compiler behavior —Have seen gcc underallocate & inflate operation costs with memory references (spills) —Have seen commercial compiler generate 3 extra ops per divide raising effective cost by 3 Difficult to reconcile measured reality with the data in the Manuals (e.g. integer divide on Nehalem) Intel E5530 operation latencies InstructionCost 64 bit integer subtract1 64 bit integer multiply3 64 bit integer divide41 Double precision add3 Double precision subtract3 Double precision multiply5 Double precision divide22 Single precision add3 Single precision subtract3 Single precision multiply4 Single precision divide14 Xeon E5530 uses the Nehalem microarchitecture, as does I7 3Comp 412, Fall 2010

4 Instruction Scheduling (Engineer’s View) The Problem Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time The Concept Scheduler slow code fast code Machine description The Task Produce correct code Minimize wasted cycles Avoid spilling registers Operate efficiently

Comp 412, Fall Instruction Scheduling (The Abstract View) To capture properties of the code, build a precedence graph G Nodes n  G are operations with type(n) and delay(n) An edge e = (n 1,n 2 )  G if & only if n 2 uses the result of n 1 The Code a b c d e f g h i The Precedence Graph

Comp 412, Fall Instruction Scheduling (Definitions) A correct schedule S maps each n  N into a non-negative integer representing its cycle number, and 1. S(n) ≥ 0, for all n  N, obviously 2. If (n 1,n 2 )  E, S(n 1 ) + delay(n 1 ) ≤ S(n 2 ) 3. For each type t, there are no more operations of type t in any cycle than the target machine can issue The length of a schedule S, denoted L(S), is L(S) = max n  N (S(n) + delay(n)) The goal is to find the shortest possible correct schedule. S is time-optimal if L(S) ≤ L(S 1 ), for all other schedules S 1 A schedule might also be optimal in terms of registers, power, or space….

Comp 412, Fall Instruction Scheduling (What’s so difficult?) Critical Points All operands must be available Multiple operations can be ready Moving operations can lengthen register lifetimes Placing uses near definitions can shorten register lifetimes Operands can have multiple predecessors Together, these issues make scheduling hard ( NP-C omplete) Local scheduling is the simple case Restricted to straight-line code Consistent and predictable latencies

Comp 412, Fall Instruction Scheduling: The Big Picture 1. Build a precedence graph, P 2. Compute a priority function over the nodes in P 3. Use list scheduling to construct a schedule, 1 cycle at a time a. Use a queue of operations that are ready b. At each cycle I. Choose the highest priority ready operation & schedule it II.Update the ready queue Local list scheduling The dominant algorithm for thirty years A greedy, heuristic, local technique Editorial: My one complaint about the literature on scheduling is that it does not provide us with an analogy that gives us leverage on the problem in the way that coloring does with allocation. List scheduling is just an algorithm. It doesn’t provide us with any intuitions about the problem. Editorial: My one complaint about the literature on scheduling is that it does not provide us with an analogy that gives us leverage on the problem in the way that coloring does with allocation. List scheduling is just an algorithm. It doesn’t provide us with any intuitions about the problem. *

Comp 412, Fall Local List Scheduling Cycle  1 Ready  leaves of P Active  Ø while (Ready  Active  Ø ) if (Ready  Ø ) then remove an op from Ready S(op)  Cycle Active ¬ Active  op Cycle  Cycle + 1 for each op  Active if (S(op) + delay(op) ≤ Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready  Ready  s Cycle  1 Ready  leaves of P Active  Ø while (Ready  Active  Ø ) if (Ready  Ø ) then remove an op from Ready S(op)  Cycle Active ¬ Active  op Cycle  Cycle + 1 for each op  Active if (S(op) + delay(op) ≤ Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready  Ready  s Removal in priority order op has completed execution If successor’s operands are “ready”, add it to Ready

Comp 412, Fall Scheduling Example 1.Build the precedence graph The Code a b c d e f g h i The Precedence Graph

Comp 412, Fall Scheduling Example 1.Build the precedence graph 2.Determine priorities: longest latency-weighted path The CodeThe Precedence Graph a b c d e f g h i

Comp 412, Fall Scheduling Example 1.Build the precedence graph 2.Determine priorities: longest latency-weighted path 3.Perform list scheduling Used a new register name  r1 1) a: addr1,r1  r14) b:  r2 2) c: multr1,r2  r15) d:  r3 3) e: multr1,r3  r17) f:  r26) g: multr1,r2  r1 9) h: 11) i:storeAIr1  The CodeThe Precedence Graph a b c d e f g h i

Improving the Efficiency of Scheduling Updating the Ready queue is the most expensive part of this algorithm Can create a separate queue for each cycle —Add op to queue for cycle when it will complete —Naïve implementation needs too many queues Can use fewer queues —Number of queues bounded by maximum operation latency —Use them in a cyclic (or modulo) fashion Detailed algorithm is given at the end of this set of slides (go look on your own time) Updating the Ready queue is the most expensive part of this algorithm Can create a separate queue for each cycle —Add op to queue for cycle when it will complete —Naïve implementation needs too many queues Can use fewer queues —Number of queues bounded by maximum operation latency —Use them in a cyclic (or modulo) fashion Detailed algorithm is given at the end of this set of slides (go look on your own time) Comp 412, Fall Cycle  1 Ready  leaves of P Active  Ø while (Ready  Active  Ø ) if (Ready  Ø ) then remove an op from Ready S(op)  Cycle Active ¬ Active  op Cycle  Cycle + 1 for each op  Active if (S(op) + delay(op) ≤ Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready  Ready  s Cycle  1 Ready  leaves of P Active  Ø while (Ready  Active  Ø ) if (Ready  Ø ) then remove an op from Ready S(op)  Cycle Active ¬ Active  op Cycle  Cycle + 1 for each op  Active if (S(op) + delay(op) ≤ Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready  Ready  s

Comp 412, Fall More List Scheduling List scheduling breaks down into two distinct classes Variations on list scheduling Prioritize critical path(s) Schedule last use as soon as possible Depth first in precedence graph (minimize registers) Breadth first in precedence graph (minimize interlocks) Prefer operation with most successors Forward list scheduling Start with available operations Work forward in time Ready  all operands available Forward list scheduling Start with available operations Work forward in time Ready  all operands available Backward list scheduling Start with no successors Work backward in time Ready  latency covers uses Backward list scheduling Start with no successors Work backward in time Ready  latency covers uses

Comp 412, Fall Local Scheduling Forward and backward can produce different results Block from SPEC benchmark “go” cbr cmpstore 1 store 2 store 3 store 4 store 5 add 1 add 2 add 3 add 4 addI loadI 1 lshiftloadI 2 loadI 3 loadI Latency to the cbr Subscript to identify

Comp 412, Fall Local Scheduling ForwardScheduleForwardSchedule BackwardScheduleBackwardSchedule Using “latency to root” as the priority function

My Favorite List Scheduling Variant Schielke’s RBF Algorithm Recognize that forward & backward both have their place Recognize that tie-breaking matters & that, while we can rationalize various tie-breakers, we do not understand them —Part of approximating the solution to an NP-complete problem The Algorithm Run forward scheduler 5 times, breaking ties randomly Run backward scheduler 5 time, breaking ties randomly Keep the best schedule RBF means “randomized backward & forward” Comp 412, Fall The implementation must be scrupulous about randomizing all choices.

Comp 412, Fall How Good is List Scheduling? Schielke’s Study Non-optimal list schedules (%) versus available parallelism 1 functional unit, randomly generated blocks of 10, 20, 50 ops 85,000 randomly generated blocks RBF found optimal schedules for > 80% Peak difficulty (for RBF) is around 2.8 Tie-breaking matters because it affects the choice when queue has > 1 element 3 compute months of work

Comp 412, Fall Lab 3 Implement two schedulers for basic blocks in ILOC One must be list scheduling with specified priority Other can be a second priority, or a different algorithm Same ILOC subset as in lab 1 (plus NOP ) Different execution model —Two asymmetric functional units —Latencies different from lab 1 —Simulator different from lab 1 I will post a lab handout later today. Some of these specifics may change. I will update the posted lecture notes to match the actual assignment. I will post a lab handout later today. Some of these specifics may change. I will update the posted lecture notes to match the actual assignment.

Comp 412, Fall Lab 3 — Specific questions 1. Are we in over our heads? No. The hard part of this lab should be trying to get good code. The programming is not bad; you can reuse some stuff from lab 1. You have all the specifications & tools you need. Jump in and start programming. 2. How many registers can we use? Assume that you have as many registers as you need. If the simulator runs out, we’ll get more. Rename registers to avoid false dependences & conflicts. 3. What about a store followed by a load? If you can show that the two operations must refer to different memory locations, the scheduler can overlap their execution. Otherwise, the store must complete before the load issues. 4. What about test files? We will put some test files online on OwlNet. We will test your lab on files you do not see. We have had problems (in the past) with people optimizing their labs for the test data. (Not that you would do that!)

Comp 412, Fall Lab 3 - Hints Begin by renaming registers to eliminate false dependencies Pay attention to loads and stores —When can they be reordered? Understand the simulator Inserting NOPs may be necessary to get correct code Tiebreakers can make a difference

Comp 412, Fall Detailed Scheduling Algorithm I for each n  N do begin count[n] := 0; earliest[n] = 0 end for each (n1,n2)  E do begin count[n2] := count[n2] + 1; successors[n1] := successors[n1]  {n2}; end for i := 0 to MaxC – 1 do W[i] :=  ; Wcount := 0; for each n  N do if count[n] = 0 then begin W[0] := W[0]  {n}; Wcount := Wcount + 1; end c := 0;// c is the cycle number cW := 0;// cW is the number of the worklist for cycle c instr[c] :=  ; for each n  N do begin count[n] := 0; earliest[n] = 0 end for each (n1,n2)  E do begin count[n2] := count[n2] + 1; successors[n1] := successors[n1]  {n2}; end for i := 0 to MaxC – 1 do W[i] :=  ; Wcount := 0; for each n  N do if count[n] = 0 then begin W[0] := W[0]  {n}; Wcount := Wcount + 1; end c := 0;// c is the cycle number cW := 0;// cW is the number of the worklist for cycle c instr[c] :=  ; Idea: Keep a collection of worklists W[c], one per cycle —We need MaxC = max delay + 1 such worklists —Precedence graph is (N,E) Code:

Comp 412, Fall Detailed Scheduling Algorithm II while Wcount > 0 do begin while W[cW] =  do begin c := c + 1; instr[c] :=  ; cW := mod(cW+1,MaxC); end nextc := mod(c+1,MaxC); while W[cW] ≠  do begin select and remove an arbitrary instruction x from W[cW]; if  free issue units of type(x) on cycle c then begin instr[c] := instr[c]  {x}; Wcount := Wcount - 1; for each y  successors[x] do begin count[y] := count[y] – 1; earliest[y] := max(earliest[y], c+delay(x)); if count[y] = 0 then begin loc := mod(earliest[y],MaxC); W[loc] := W[loc]  {y}; Wcount := Wcount + 1; end else W[nextc] := W[nextc]  {x}; end while Wcount > 0 do begin while W[cW] =  do begin c := c + 1; instr[c] :=  ; cW := mod(cW+1,MaxC); end nextc := mod(c+1,MaxC); while W[cW] ≠  do begin select and remove an arbitrary instruction x from W[cW]; if  free issue units of type(x) on cycle c then begin instr[c] := instr[c]  {x}; Wcount := Wcount - 1; for each y  successors[x] do begin count[y] := count[y] – 1; earliest[y] := max(earliest[y], c+delay(x)); if count[y] = 0 then begin loc := mod(earliest[y],MaxC); W[loc] := W[loc]  {y}; Wcount := Wcount + 1; end else W[nextc] := W[nextc]  {x}; end Priority