Scheduling Chapter 10 Optimizing Compilers for Modern Architectures.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

ECE 667 Synthesis and Verification of Digital Circuits

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

CSCI 4717/5717 Computer Architecture

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Multiscalar processors

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

Concepts and Challenges

Local Instruction Scheduling

Instruction Scheduling for Instruction-Level Parallelism

Instruction Scheduling Hal Perkins Summer 2004

CS 201 Compiler Construction

Instruction Scheduling Hal Perkins Winter 2008

Local Instruction Scheduling — A Primer for Lab 3 —

Register Pressure Guided Unroll-and-Jam

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Static Code Scheduling

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Scheduling Hal Perkins Autumn 2005

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Scheduling Hal Perkins Autumn 2011

Introduction to Optimization

Scheduling Chapter 10 Optimizing Compilers for Modern Architectures.

Presentation transcript:

Scheduling Chapter 10 Optimizing Compilers for Modern Architectures

Introduction We shall discuss: Straight line scheduling Trace Scheduling Kernel Scheduling (Software Pipelining) Vector Unit Scheduling Cache coherence in coprocessors

Optimizing Compilers for Modern Architectures Introduction Scheduling: Mapping of parallelism within the constraints of limited available parallel resources Best Case Scenario: All the uncovered parallelism can be exploited by the machine In general, we must sacrifice some execution time to fit a program within the available resources Our goal: Minimize the amount of execution time sacrificed

Optimizing Compilers for Modern Architectures Introduction Variants of the scheduling problem: Instruction scheduling: Specifying the order in which instructions will be executed Vector unit scheduling: Make most effective use of the instructions and capabilities of a vector unit. Requires pattern recognition and synchronization minimization Will concentrate on instruction scheduling (fine grained parallelism)

Optimizing Compilers for Modern Architectures Introduction Categories of processors supporting fine-grained parallelism: VLIW Superscalar processors

Optimizing Compilers for Modern Architectures Introduction Scheduling in VLIW and Superscalar architectures: Order instruction stream so that as many function units as possible are being used on every cycle Standard approach: Emit a sequential stream of instructions Reorder this sequential stream to utilize available parallelism Reordering must preserve dependences

Optimizing Compilers for Modern Architectures Introduction Issue: Creating a sequential stream must consider available resources. This may create artificial dependences a = b + c + d + e One possible sequential stream: add a, b, c add a, a, d add a, a, e And, another: add r1, b, c add r2, d, e add a, r1, r2

Optimizing Compilers for Modern Architectures Fundamental conflict in scheduling Fundamental conflict in scheduling: If the original instruction stream takes into account available resources, will create artificial dependences If not, then there may not be enough resources to correctly execute the stream

Optimizing Compilers for Modern Architectures Machine Model Machine contains a number of issue units Issue unit has an associated type and a delay I k j denotes the j th unit of type k Number of units of type k = m k Total number of issue units: M = where, l = number of issue-unit types in the machine

Optimizing Compilers for Modern Architectures Machine Model We will assume a VLIW model Goal of compiler: select set of M instructions for each cycle such that the number of instructions of type k is m k Note that code can be generated easily for an equivalent superscalar machine

Optimizing Compilers for Modern Architectures Straight Line Graph Scheduling Scheduling a basic block: Use a dependence graph G = (N, E, type, delay) N: set of instructions in the code Each n N has a type, type(n), and a delay, delay(n) (n 1, n 2 ) E iff n 2 must wait completion of n 1 due to a shared register. (True, anti, and output dependences)

Optimizing Compilers for Modern Architectures Straight Line Graph Scheduling A correct schedule is a mapping, S, from vertices in the graph to nonnegative integers representing cycle numbers such that: 1.S(n) 0 for all n N, 2.If (n 1,n 2 ) E, S(n 1 ) + delay(n 1 ) S(n 2 ), and 3.For any type t, no more than m t vertices of type t are mapped to a given integer. The length of a schedule, S, denoted L(S) is defined as: L(S) = (S(n) + delay(n)) Goal of straight-line scheduling: Find a shortest possible correct schedule. A straight line schedule is said to be optimal if: L(S) L(S 1 ), correct schedules S 1

Optimizing Compilers for Modern Architectures List Scheduling Use variant of topological sort: Maintain a list of instructions which have no predecessors in the graph Schedule these instructions This will allow other instructions to be added to the list

Optimizing Compilers for Modern Architectures List Scheduling Algorithm for list scheduling: Schedule an instruction at the first opportunity after all instructions it depends on have completed count array determines how many predecessors are still to be scheduled earliest array maintains the earliest cycle on which the instruction can be scheduled Maintain a number of worklists which hold instructions to be scheduled for a particular cycle number. How many worklists are required?

Optimizing Compilers for Modern Architectures List Scheduling How shall we select instructions from the worklist? Random selection Selection based on other criteria: Worklists are priority queues. Highest Level First (HLF) heuristic schedules more critical instructions first

Optimizing Compilers for Modern Architectures List Scheduling Algorithm I for each n N do begin count[n] := 0; earliest[n] = 0 end for each (n1,n2) E do begin count[n2] := count[n2] + 1; successors[n1] := successors[n1] {n2}; end for i := 0 to MaxC – 1 do W[i] := ; Wcount := 0; for each n N do if count[n] = 0 then begin W[0] := W[0] {n}; Wcount := Wcount + 1; end c := 0;// c is the cycle number cW := 0;// cW is the number of the worklist for cycle c instr[c] := ; Idea: Keep a collection of worklists W[c], one per cycle We need MaxC = max delay + 1 such worklists Code:

Optimizing Compilers for Modern Architectures List Scheduling Algorithm II while Wcount > 0 do begin while W[cW] = do begin c := c + 1; instr[c] := ; cW := mod(cW+1,MaxC); end nextc := mod(c+1,MaxC); while W[cW] do begin select and remove an arbitrary instruction x from W[cW]; if free issue units of type(x) on cycle c then begin instr[c] := instr[c] {x}; Wcount := Wcount - 1; for each y successors[x] do begin count[y] := count[y] – 1; earliest[y] := max(earliest[y], c+delay(x)); if count[y] = 0 then begin loc := mod(earliest[y],MaxC); W[loc] := W[loc] {y}; Wcount := Wcount + 1; end else W[nextc] := W[nextc] {x}; end Priority

Optimizing Compilers for Modern Architectures Trace Scheduling Problem with list scheduling: Transition points between basic blocks Must insert enough instructions at the end of a basic block to ensure that results are available on entry into next basic block Results in significant overhead! Alternative to list scheduling: trace scheduling Trace: is a collection of basic blocks that form a single path through all or part of the program Trace Scheduling schedules an entire trace at a time Traces are chosen based on their expected frequencies of execution Caveat: Cannot schedule cyclic graphs. Loops must be unrolled

Optimizing Compilers for Modern Architectures Trace Scheduling Three steps for trace scheduling: Selecting a trace Scheduling the trace Inserting fixup code

Optimizing Compilers for Modern Architectures Inserting fixup code

Optimizing Compilers for Modern Architectures Trace Scheduling Trace scheduling avoids moving operations above splits or below joins unless it can prove that other instructions will not be adversely affected

Optimizing Compilers for Modern Architectures Trace Scheduling Trace scheduling will always converge However, in the worst case, a very large amount of fixup code may result Worst case: operations increase to O(n e n )

Optimizing Compilers for Modern Architectures Straight-line Scheduling: Conclusion Issues in straight-line scheduling: Relative order of register allocation and instruction scheduling Dealing with loads and stores: Without sophisticated analysis, almost no movement is possible among memory references

Optimizing Compilers for Modern Architectures Kernel Scheduling Drawback of straight-line scheduling: Loops are unrolled. Ignores parallelism among loop iterations Kernel scheduling: Try to maximize parallelism across loop iterations

Optimizing Compilers for Modern Architectures Kernel Scheduling Schedule a loop in three parts: a kernel: includes code that must be executed on every cycle of the loop a prolog: which includes code that must be performed before steady state can be reached an epilog, which contains code that must be executed to finish the loop once the kernel can no longer be executed The kernel scheduling problem seeks to find a minimal-length kernel for a given loop Issue: loops with small iteration counts?

Optimizing Compilers for Modern Architectures Kernel Scheduling: Software Pipelining A kernel scheduling problem is a graph: G = (N, E, delay, type, cross) where cross (n 1, n 2 ) defined for each edge in E is the number of iterations crossed by the dependence relating n 1 and n 2 Temporal movement of instructions through loop iterations Software Pipelining: Body of one loop iteration is pipelined across multiple iterations.

Optimizing Compilers for Modern Architectures Software Pipelining A solution to the kernel scheduling problem is a pair of tables (S,I), where: the schedule S maps each instruction n to a cycle within the kernel the iteration I maps each instruction to an iteration offset from zero, such that: S[n 1 ] + delay(n 1 ) S[n 2 ] + (I[n 2 ] – I[n 1 ] + cross(n 1,n 2 )) L k (S) for each edge (n 1,n 2 ) in E, where: L k (S) is the length of the kernel for S. L k (S) = (S[n])

Optimizing Compilers for Modern Architectures Software Pipelining Example: ld r1,0 ld r2,400 fld fr1, c l0 fld fr2,a(r1) l1 fadd fr2,fr2,fr1 l2 fst fr2,b(r1) l3 air1,r1,8 l4 compr1,r2 l5 blel0 A legal schedule: 10: fld fr2,a(r1)ai r1,r1,8 Floating Pt. comp r1,r2 fst fr3,b-16(r1)ble l0 fadd fr3,fr2,fr1 IntegerLoad/Store

Optimizing Compilers for Modern Architectures Software Pipelining ld r1,0 ld r2,400 fld fr1, c l0 fld fr2,a(r1) l1 fadd fr2,fr2,fr1 l2 fst fr2,b(r1) l3 air1,r1,8 l4 compr1,r2 l5 blel0 S[10] = 0; I[l0] = 0; S[l1] = 2; I[l1] = 0; S[l2] = 2; I[l2] = 1; S[l3] = 0; I[l3] = 0; S[l4] = 1; I[l4] = 0; S[l5] = 2; I[l5] = 0; 10: fld fr2,a(r1)ai r1,r1,8 Floating Pt. comp r1,r2 fst fr3,b-16(r1)ble l0 fadd fr3,fr2,fr1 IntegerLoad/Store

Optimizing Compilers for Modern Architectures Software Pipelining Have to generate epilog and prolog to ensure correctness Prolog: ld r1,0 ld r2,400 fld fr1, c p 1 fld fr2,a(r1); ai r1,r1,8 p 2 comp r1,r2 p 3 beq e1; fadd fr3,fr2,fr1 Epilog: e 1 nop e 2 nop e 3 fst fr3,b-8(r1)

Optimizing Compilers for Modern Architectures Software Pipelining Let N be the loop upper bound. Then, the schedule length L(S) is given by: L(S) = N L k (S) + (S[n] + delay(n) + (I[n] - 1) L k (S)) Minimizing the length of kernel minimizes the length of the schedule

Optimizing Compilers for Modern Architectures Kernel Scheduling Algorithm Is there an optimal kernel scheduling algorithm? Try to establish lower bound on how well scheduling can do: how short can a kernel be? Based on available resources Based on data dependences

Optimizing Compilers for Modern Architectures Kernel Scheduling Algorithm Resource usage constraint: No recurrence in the loop #t: number of instructions in each iteration that must issue in a unit of type t L k (S) (EQN 10.7) We can always find a schedule S, such that L k (S) =

Optimizing Compilers for Modern Architectures Software Pipelining Algorithm procedure loop_schedule(G, L, S, I) topologically sort G; for each instruction x in G in topological order do begin earlyS := 0; earlyI := 0; for each predecessor y of x in G do thisS := S[y] + delay(y); thisI := I[y]; if thisS L then begin thisS := mod(thisS,L); thisI := thisI + ceil(thisI/L); end if thisI > earlyI or thisS> earlyS then begin earlyI := thisI; earlyS := thisS; end starting at cycle earlyS, find the first cycle c 0 where the resource needed by x is available,wrapping to the beginning of the kernel if necessary; S[x] := c 0 ; if c 0 < earlyS then I[x]:= earlyI+1; else I[x]:= earlyI; end end min_loop_schedule

Optimizing Compilers for Modern Architectures Software Pipelining Algorithm l0lda,x(i) l1aia,a,1 l2aia,a,1 l3aia,a,1 l4sta,x(i) Memory1Integer1Integer2Integer3Memory2 10: S=0; I=010: S=0; I=110: S=0; I=210: S=0; I=310: S=0; I=4

Optimizing Compilers for Modern Architectures Cyclic Data Dependence Constraint Given a cycle of dependences (n 1, n 2, …, n k ): L k (S) Right hand side is called the slope of the recurrence L k (S) MAX c (EQN 10.10)

Optimizing Compilers for Modern Architectures Kernel Scheduling Algorithm procedure kernel_schedule(G, S, I) use the all-pairs shortest path algorithm to find the cycle in the schedule graph G with the greatest slope; designate all cycles with this slope as critical cycles; mark every instruction in the G that is on a critical cycle as a critical instruction; compute the lower bound LB for the loop as the maximum of the slope of the critical recurrence given byEquation and the hardware constraint as given in Equation 10.7 N := the number of instructions in the original loop body; let G0 be G with all cycles broken by eliminating edges into the earliest instruction in the cycle within the loop body;

Optimizing Compilers for Modern Architectures Kernel Scheduling Algorithm failed := true; for L := LB to N while failed do begin // try to schedule the loop to length L loop_schedule(G0, L, S, I); // test to see if the schedule succeeded allOK := true; for each dependence cycle C while allOK do begin for each instruction v that is a part of C while allOK do begin if I[v] > 0 then allOK := false; else if v is the last instruction in the cycle C andv0 is the first instruction in the cycle and mod(S[v] + delay(v), L) > S[v0] then allOK = false; end if allOK then failed := false; end end kernel_schedule

Optimizing Compilers for Modern Architectures Prolog Generation Prolog: range(S) = (I[n]) + 1 range = r = number of iterations executed for all instructions corresponding to a single instruction in the original loop to issue To get loop into steady state (priming the pipeline): Lay out (r -1) copies of the kernel Any instruction with I[n] = i > r -1 replaced by no-op in the first i copies Use list scheduling to schedule the prolog

Optimizing Compilers for Modern Architectures Epilog Generation After last iteration of kernel, r - 1 iterations are required to wind down However, must also account for last instructions to complete to ensure all hazards outside the loop are accommodated Additional time required: S = ( (( I[n] - 1)L k (S) + S[n] + delay(n)) - rL k (S)) + Length of epilog: (r - 1) L k (S) + S

Optimizing Compilers for Modern Architectures Software Pipelining: Conclusion Issues to consider in software pipelining: Increased register pressure: May have to resort to spills Control flow within loops: Use If-conversion or construct control dependences Schedule control flow regions using a non-pipelining approach and treat those areas as black boxes when pipelining

Optimizing Compilers for Modern Architectures Vector Unit Scheduling Chaining: vload t1, a vload t2, b vadd t3, t1, t2 vstore t3, c 192 cycles without chaining 66 cycles with chaining Proximity within instructions required for hardware to identify opportunities for chaining

Optimizing Compilers for Modern Architectures Vector Unit Scheduling vloada,x(i) vloadb,y(i) vaddt1,a,b vloadc,z(i) vmult2,c,t1 vmult3,a,b vaddt4,c,t3 Rearranging: vloada,x(i) vloadb,y(i) vaddt1,a,b vmult3,a,b vloadc,z(i) vmult2,c,t1 vaddt4,c,t3 2 load, 1 addition, 1 multiplication pipe

Optimizing Compilers for Modern Architectures Vector Unit Scheduling Chaining problem solved by weighted fusion algorithm: Variant of fusion algorithm seen in Chapter 8 Takes into consideration resource constraints of machine (number of pipes) Weights are recomputed dynamically: For instance, if an addition and a subtraction is selected for chaining, then a load that is an input to both the addition and subtraction will be given a higher weight after fusion

Optimizing Compilers for Modern Architectures Vector Unit Scheduling vload a,x(i) vload b,y(i) vadd t1,a,b vload c,z(i) vmul t2,c,t1 vmul t3,a,b vadd t4,c,t3

Optimizing Compilers for Modern Architectures Vector Unit Scheduling vloada,x(i) vloadb,y(i) vaddt1,a,b vmult3,a,b vloadc,z(i) vmult2,c,t1 vaddt4,c,t3 After Fusion

Optimizing Compilers for Modern Architectures Co-processors Co-processor can access main memory, but cannot see the cache Cache coherence problem Solutions: Special set of memory synchronization operations Stall processor on reads and writes (waits) Minimal number of waits essential for fast execution Use data dependence to insert these waits Positioning of waits important to reduce number of waits

Optimizing Compilers for Modern Architectures Co-processors Algorithm to insert waits: Make a single pass starting from the beginning of the block Note source of edges When target reached, insert wait Produces minimum number of waits in absence of control flow Minimizing waits in presence of control flow is NP Complete. Compiler must use heuristics

Optimizing Compilers for Modern Architectures Conclusion We looked at: Straight line scheduling: For basic blocks Trace Scheduling: Across basic blocks Kernel Scheduling: Exploit parallelism across loop iterations Vector Unit Scheduling Issues in cache coherence for coprocessors