Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Slides:



Advertisements
Similar presentations
Scheduling Chapter 10 Optimizing Compilers for Modern Architectures.
Advertisements

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Computer Architecture Instruction-Level Parallel Processors
CSCI 4717/5717 Computer Architecture
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Generic Software Pipelining at the Assembly Level Markus Pister
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Henk Corporaal TUEindhoven 2009
Local Instruction Scheduling
Instruction Scheduling for Instruction-Level Parallelism
Instruction Scheduling Hal Perkins Summer 2004
Instruction Scheduling Hal Perkins Winter 2008
Local Instruction Scheduling — A Primer for Lab 3 —
Siddhartha Chatterjee Spring 2008
Henk Corporaal TUEindhoven 2011
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Scheduling Hal Perkins Autumn 2005
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Instruction Scheduling Hal Perkins Autumn 2011
Scheduling Chapter 10 Optimizing Compilers for Modern Architectures.
Presentation transcript:

Chapter 10 Scheduling Presented by Vladimir Yanovsky

The goals Scheduling: Mapping of parallelism within the constraints of limited available parallel resources In general, we must sacrifice some parallelism to fit a program within the available resources Our goal: Minimize the amount of parallelism sacrificed/maximize utilization of the resources

Lecture Outline –Straight line scheduling –Trace Scheduling –Loops: Kernel Scheduling (Software Pipelining) –Vector unit scheduling

Scheduling - Motivation Transistor sizes have shrank. Can be exploited by: 1.Several processors on the same silicone. 2.Multiple identical execution units. The more parallelisms allows the processor, the more important scheduling is.

Processor Types Superscalar Multiple functional units controlled and scheduled by the hardware. VLIW (Very Large Instruction Word) Scheduled by the compiler

VLIW vs Superscalar Compatibility Capability of run-time adjustments (branches & cache misses) Design simplicity Global view of the program

Scheduling – standard approach Scheduling in VLIW and Superscalar architectures: –Receive a sequential stream of instructions –Reorder this sequential stream to utilize available parallelism –Reordering must preserve dependences Our model for this talk is VLIW

Reuse Constrains Need to execute: a = b + c + d + e One possible sequential stream: add a, b, c add a, a, d add a, a, e And, another: add r1, b, c add r2, d, e add a, r1, r2

Fundamental Problem Fundamental conflict in scheduling: –If the original instruction stream takes into account available resources it will create artificial dependences –If not, then there may not be enough resources to correctly execute the stream Who should be earlier, register allocation or scheduling?

Processor Model VLIW type Processor contains a number of issue units Issue unit has an associated type and a delay Purpose: to select set of instructions for each cycle such that the number of instructions of each type is not greater than the number of execution units of this type

Straight Line Scheduling Scheduling a basic block: receives a dependence graph G = (N, E, type, delay) –N: set of instructions in the code –E: (n 1, n 2 )  E iff n 2 must wait completion of n 1 due to a dependency – Each n  N has a type, type(n), and a delay, delay(n).

Straight Line Scheduling A correct schedule is a mapping, S, from vertices in the graph to nonnegative integers representing cycle numbers such that: 1.If (n 1,n 2 )  E, S(n 1 ) + delay(n 1 )  S(n 2 ), i.e. deps satisfied 2.Hardware constraints are satisfied. The length of a schedule, S, denoted L(S) is defined as: L(S) = max n (S(n) + delay(n)) Goal of straight-line scheduling: Find a shortest possible correct schedule.

List Scheduling Use variant of topological sort: –Maintain a list of instructions which have no unscheduled predecessors in the graph –Schedule these instructions –This will allow other instructions to be added to the list –Repeat until all instructions are scheduled

List Scheduling We maintain two arrays: –count determines for each instruction how many predecessors are still to be scheduled –earliest array maintains the earliest cycle on which the instruction can be scheduled. Maintain a number of worklists which hold instructions to be scheduled for a particular cycle number. All their predecessors are scheduled.

List Scheduling - Initialization for each n  N do begin count[n] := 0; earliest[n] = 0 end for each (n1,n2)  E do begin count[n2] := count[n2] + 1; successors[n1] := successors[n1]  {n2}; end for i := 0 to MaxC – 1 do W[i] :=  ; //MaxC max(delay)+1 Wcount := 0; //The number of ready instructions for each n  N do if count[n] = 0 then begin //No dependencies W[0] := W[0]  {n}; Wcount := Wcount + 1; end c := 0;// c is the cycle number cW := 0; // cW is the number of the worklist for cycle c instr[c] :=  ;

List Scheduling Algorithm while Wcount > 0 do begin while W[cW] =  do begin c := c + 1; instr[c] :=  ; cW := mod(cW+1,MaxC); end nextc := mod(c+1,MaxC); //next cycle while W[cW] ≠  do begin select and remove an arbitrary instruction x from W[cW]; if  free issue units of type(x) on cycle c then begin instr[c] := instr[c]  {x}; Wcount := Wcount - 1; for each y  successors[x] do begin count[y] := count[y] – 1; earliest[y] := max(earliest[y], c+delay(x)); if count[y] = 0 then begin loc := mod(earliest[y],MaxC); W[loc] := W[loc]  {y}; Wcount := Wcount + 1; end else W[nextc] := W[nextc]  {x}; //x could not be scheduled For each unused unit insert stall end Priority

Finding the critical path for each n  N do begin count[n] := 0; remaining[n] := delay(n); end for each (n1,n2)  E do begin count[n1] := count[n1] + 1; //count[n]==0 iff nothing depends on n predecessors[n2] := predecessors[n2]  {n1}; end W := ∅ ; for each n  N do if count[n] = 0 then W := W  {n};//init: W-inst without deps while W ≠ ∅ do begin select and remove an arbitrary instruction x from W; for each y  predecessors[x] do begin count[y] := count[y] – 1; remaining[y] := max(remaining[y], remaining[x]+delay(y)); if count[y] = 0 then W := W  {y}; end

Problems of list scheduling Previous basic block must complete before the next is started. Cannot schedule loops.

Trace Scheduling Exploit parallelism between several basic blocks. Trace: is a collection of basic blocks that form a single path through all or part of the program. CFG without loops

Trace Scheduling Scheduling j=j+1 i=i+2 if e1 i = i + 2 is moved below the split – inserted fixup code

Trace Scheduling Trace scheduling algorithm: 1.Select a trace based on profiling information. 2.Schedule the trace using basic block scheduler adding dependencies from the splits/joints to the upstream/downstream instructions respectively. 3.Insert a fixup code. 4.Remove the scheduled trace from the CFG 5.If CFG not empty Goto 1

Trace & line scheduling - conclusions 1.Problem with line & trace scheduling – cannot schedule loops effectively. Must unroll loops to have more “meat” for work. 2.Trace scheduling increases code size by inserting fixup code, may lead to exponential code increase. 3.Need up-to-date memory dependencies information to do anything about moving memory accesses.

Kernel Scheduling Moves instructions not only in space but also in time – across iterations. Allows to better exploit parallelism between loop iterations.

Kernel Scheduling problem A kernel scheduling problem is a graph: G = (N, E, delay, type, cross) where cross (n 1, n 2 ) defined for each edge in E is the number of iterations crossed by the dependence relating n 1 and n 2

Software Pipelining Example: ld r1,0 ld r2,400 fld fr1, c l0 fld fr2,a(r1) l1 fadd fr2,fr2,fr1 l2 fst fr2,b(r1) l3 air1,r1,8 l4 compr1,r2 l5 blel0 A legal schedule: 10: fld fr2,a(r1)ai r1,r1,8 Floating Pt. comp r1,r2 fst fr3,b-16(r1)ble l0 fadd fr3,fr2,fr1 IntegerLoad/Store fld fadd fst 2 3

Software Pipelining ld r1,0 ld r2,400 fld fr1, c l0 fld fr2,a(r1) l1 fadd fr2,fr2,fr1 l2 fst fr2,b(r1) l3 air1,r1,8 l4 compr1,r2 l5 blel0 S[10] = 0; I[l0] = 0; S[l1] = 2; I[l1] = 0; S[l2] = 2; I[l2] = 1; S[l3] = 0; I[l3] = 0; S[l4] = 1; I[l4] = 0; S[l5] = 2; I[l5] = 0; 10: fld fr2,a(r1)ai r1,r1,8 Floating Pt. comp r1,r2 fst fr3,b-16(r1)ble l0 fadd fr3,fr2,fr1 IntegerLoad/Store

Software Pipelining Have to generate epilog and prolog to ensure correctness Prolog: ld r1,0 ld r2,400 fld fr1, c p 1 fld fr2,a(r1); ai r1,r1,8 p 2 comp r1,r2 p 3 beq e1; fadd fr3,fr2,fr1 Epilog: e 1 nop e 2 nop e 3 fst fr3,b-8(r1)

Kernel Scheduling A solution to the kernel scheduling problem is a pair of tables (S,I), where: –the schedule S maps each instruction n to a cycle within the kernel –the iteration I maps each instruction to an iteration offset from zero, such that: S[n 1 ] + delay(n 1 )  S[n 2 ] + (I[n 2 ] – I[n 1 ] + cross(n 1,n 2 )) L k (S) for each edge (n 1,n 2 ) in E, where: L k (S) = max n (S(n)) is the length of the kernel for S. Another name for kernel’s length is II – initiation interval

Kernel scheduling - intuition S[n 1 ] + delay(n 1 )  S[n 2 ] + (I[n 2 ] – I[n 1 ] + cross(n 1,n 2 )) L k (S) Instructions with I[n] = 0 are running in the “current” iteration. If I[n]>0 this means that the instruction is delayed by I[n] iterations. Even if n 1 has large delay, n 2 can be moved to a later iteration instead of forcing it to be scheduled in the cycle S[n 1 ] + delay(n 1 )

Resource Constrains Resource usage constraint: –No recurrence in the loop –#t: number of instructions in each iteration that must issue in a unit of type t L k (S)  We can always find a schedule S, such that L k (S) =

Kernel Scheduling for each instruction x in G in topological order do begin earlyS := 0; earlyI := 0; for each predecessor y of x in G do thisS := S[y] + delay(y); thisI := I[y]; if thisS ≥ L then begin thisS := mod(thisS,L); thisI := thisI + ; end if thisI > earlyI or ((thisI = earlyI) && (thisS>earlyS)) then begin earlyI := thisI; earlyS := thisS; end starting at cycle earlyS, find the first cycle c0 where the resource needed by x is available, wrapping to the beginning of the kernel if necessary; S[x] := c0; if c0 < earlyS then I[x] := earlyI +1 else I[x] := earlyI; //Wrapped over kernel end

Software Pipelining Example l0lda,x(i) l1aia,a,1 l2aia,a,1 l3aia,a,1 l4sta,x(i) Memory1Integer1Integer2Integer3Memory2 l0: S=0; I=0l1: S=0; I=1l2: S=0; I=2l3: S=0; I=3l4: S=0; I=4 2 memory units, 3 integer units. II=1 is enough. Each time next instruction is pushed to the next iteration.

Register Pressure l0 ld a0,x(i) l1 ai a1,a0,1 l2 ai a2,a1,1 l3 ai a3,a2,1 l4 st a3,x(i) 1.The same register a cannot be used in 4 different iterations running simultaneously. 2.Need to store register’s value for each overlapping iterations and rename them cyclically after each iteration. 3. Issue 2 can be solved by unrolling with renaming though this will increase code size

Prolog & Epilog 1.Current iteration when entering the kernel is 5. 2.I(Stage A)=0, that is we execute Stage A in the same iteration as initially. 3.I(Stage B) = 1, i.e. Stage B is always delayed to the next iteration. 4.Prolog: StageA1; StageB1,StageA2;StageC1,StageB2,StageA3…

Prolog & Epilog generation Prolog: for k = 0 to max n (I(n))-1 lay out the kernel replacing all n s.t. I(n)>k by NO-OP Epilog: for to k=1 to max n (I(n)) lay out the kernel replacing all n s.t. I(n)<k by NO-OP Compact both using list schedule.

Recurrences Given a recurrence (n 1, n 2, …, n k ): L k (S)  –Right hand side is called the slope of the recurrence. Nominator is the number of cycles it takes to complete all the computations of the recurrence, denominator is the number iterations available to do this. –L k (S)  MAX c

Kernel Scheduling – General Case 1.Compute MII to be the maximum of resource constraint and the maximum slope. 2.II=MII 3.Remove an edge from every recurrence. 4.Schedule(II) using the simple kernel scheduling algorithm. 5.If failed (dependency of any removed edge is violated), increase II and got 4.

Kernel Scheduling - Conclusions Handling control flow is difficult. May use hardware support for predicated execution or handling the “control flow regions” as black boxes. Increased register pressure may limit only to single basic block inner loops anyway. Benefits from unrolling with renaming.

Vector Unit Scheduling A vector instruction involves the execution of many scalar instructions Much of the benefit from the pipelining is already achieved Still, something can be done

Chaining Chaining: vload t1, a vload t2, b vadd t3, t1, t2 vstore t3, c Two load units Each operation takes 64 cycles 192 cycles without chaining 66 cycles with chaining Proximity within instructions required for hardware to identify opportunities for chaining

Chaining rearranging vloada,x(i) vloadb,y(i) vaddt1,a,b vloadc,z(i) vmult2,c,t1 vmult3,a,b vaddt4,c,t3 Rearranging: vloada,x(i) vloadb,y(i) vaddt1,a,b vmult3,a,b vloadc,z(i) vmult2,c,t1 vaddt4,c,t3 2 load, 1 addition, 1 multiplication pipe

Instruction fusion vload a,x(i) vload b,y(i) vadd t1,a,b vload c,z(i) vmul t2,c,t1 vmult3,a,b vadd t4,c,t3

Instruction fusion – cont. vloada,x(i) vloadb,y(i) vaddt1,a,b vmult3,a,b vloadc,z(i) vmult2,c,t1 vaddt4,c,t3 After Fusion

The End!