EECS 583 – Class 14 Modulo Scheduling Reloaded University of Michigan October 31, 2011.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
ECE 667 Synthesis and Verification of Digital Circuits
CSCI 4717/5717 Computer Architecture
EECS 583 – Class 13 Software Pipelining University of Michigan October 24, 2011.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Instruction-Level Parallelism (ILP)
Instruction Scheduling, III Software Pipelining Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.
Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints Mikhail Smelyanskiy, Scott Mahlke, Edward Davidson Department of EECS University.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
EECS 583 – Lecture 15 Machine Information, Scheduling a Basic Block University of Michigan March 5, 2003.
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat
Static Optimizations (aka: the complier) Dr. Mark Brehob EECS 470.
Generic Software Pipelining at the Assembly Level Markus Pister
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
EECS 583 – Class 15 Register Allocation University of Michigan November 2, 2011.
EECS 583 – Class 11 Instruction Scheduling University of Michigan October 12, 2011.
5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.
Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.
CBP 2005Comp 3070 Computer Architecture1 Last Time … All instructions the same length We learned to program MIPS And a bit about Intel’s x86 Instructions.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Principles and Modulo Scheduling Algorithm
Ph.D. in Computer Science
Benjamin Goldberg, Emily Crutcher NYU
Henk Corporaal TUEindhoven 2009
Lecture 11: Advanced Static ILP
EECS 583 – Class 13 Software Pipelining
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
EECS 583 – Class 15 Register Allocation
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Henk Corporaal TUEindhoven 2011
Instruction Level Parallelism (ILP)
EECS 583 – Class 14 Modulo Scheduling Reloaded
EECS 583 – Class 13 Modulo Scheduling
Static Code Scheduling
EECS 583 – Class 11 Instruction Scheduling
EECS 583 – Class 12 Superblock Scheduling Software Pipelining Intro
EECS 583 – Class 4 If-conversion
EECS 583 – Class 15 Register Allocation
EECS 583 – Class 12 Superblock Scheduling Intro. to Loop Scheduling
How to improve (decrease) CPI
EECS 583 – Class 12 Superblock Scheduling
Loop-Level Parallelism
Static Scheduling Techniques
Presentation transcript:

EECS 583 – Class 14 Modulo Scheduling Reloaded University of Michigan October 31, 2011

- 1 - Announcements + Reading Material v Project proposal – Due Friday Nov 4 »1 from each group: names, paragraph summarizing what you plan to do v Today’s class reading »"Code Generation Schema for Modulo Scheduled Loops", B. Rau, M. Schlansker, and P. Tirumalai, MICRO-25, Dec v Next reading – Last class before research stuff! »“Register Allocation and Spilling Via Graph Coloring,” G. Chaitin, Proc SIGPLAN Symposium on Compiler Construction, 1982.

- 2 - A B A C B A D C B A … D C B A D C B D C D Review: A Software Pipeline ABCDABCD Loop body with 4 ops Prologue - fill the pipe Epilogue - drain the pipe Kernel – steady state time Steady state: 4 iterations executed simultaneously, 1 operation from each iteration. Every cycle, an iteration starts and finishes when the pipe is full.

- 3 - Loop Prolog and Epilog Prolog Epilog Kernel Only the kernel involves executing full width of operations Prolog and epilog execute a subset (ramp-up and ramp-down) II = 3

- 4 - A0 A1 B0 A2 B1 C0 A B C D Bn Cn-1 Dn-2 Cn Dn-1 Dn Separate Code for Prolog and Epilog ABCDABCD Loop body with 4 ops Prolog - fill the pipe Kernel Epilog - drain the pipe Generate special code before the loop (preheader) to fill the pipe and special code after the loop to drain the pipe. Peel off II-1 iterations for the prolog. Complete II-1 iterations in epilog

- 5 - Removing Prolog/Epilog Prolog Epilog Kernel II = 3 Disable using predicated execution Execute loop kernel on every iteration, but for prolog and epilog selectively disable the appropriate operations to fill/drain the pipeline

- 6 - Kernel-only Code Using Rotating Predicates A0 A1 B0 A2 B1 C0 A B C D Bn Cn-1 Dn-2 Cn Dn-1 Dn P[0]P[1]P[2]P[3] … A if P[0] B if P[1] C if P[2] D if P[3] A---AB--ABC-ABCD…-BCD--CD---DA---AB--ABC-ABCD…-BCD--CD---D P referred to as the staging predicate

- 7 - Modulo Scheduling Architectural Support v Loop requiring N iterations »Will take N + (S – 1) where S is the number of stages v 2 special registers created »LC: loop counter (holds N) »ESC: epilog stage counter (holds S) v Software pipeline branch operations »Initialize LC = N, ESC = S in loop preheader »All rotating predicates are cleared »BRF.B.B.F  While LC > 0, decrement LC and RRB, P[0] = 1, branch to top of loop u This occurs for prolog and kernel  If LC = 0, then while ESC > 0, decrement RRB and write a 0 into P[0], and branch to the top of the loop u This occurs for the epilog

- 8 - Execution History With LC/ESC LCESCP[0]P[1]P[2]P[3] A AB ABC ABCD BCD CD D A if P[0]; B if P[1]; C if P[2]; D if P[3]; P[0] = BRF.B.B.F; LC = 3, ESC = 3 /* Remember 0 relative!! */ Clear all rotating predicates P[0] = 1 4 iterations, 4 stages, II = 1, Note –1 iterations of kernel executed

- 9 - Review: Modulo Scheduling Process v Use list scheduling but we need a few twists »II is predetermined – starts at MII, then is incremented »Cyclic dependences complicate matters  Estart/Priority/etc.  Consumer scheduled before producer is considered u There is a window where something can be scheduled! »Guarantee the repeating pattern v 2 constraints enforced on the schedule »Each iteration begin exactly II cycles after the previous one »Each time an operation is scheduled in 1 iteration, it is tentatively scheduled in subsequent iterations at intervals of II  MRT used for this

Review: ResMII Example resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r : r2 = r : p1 = cmpp (r1 < r9) 7: brct p1 Loop ALU: used by 2, 4, 5, 6  4 ops / 2 units = 2 Mem: used by 1, 3  2 ops / 1 unit = 2 Br: used by 7  1 op / 1 unit = 1 ResMII = MAX(2,2,1) = 2 Concept: If there were no dependences between the operations, what is the the shortest possible schedule?

Review: RecMII Example 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r : r2 = r : p1 = cmpp (r1 < r9) 7: brct p1 Loop ,0 0,03,0 2,0 1,1 0,0 4  4: 1 / 1 = 1 5  5: 1 / 1 = 1 4  1  4: 1 / 1 = 1 5  3  5: 1 / 1 = 1 RecMII = MAX(1,1,1,1) = 1 Then, MII = MAX(ResMII, RecMII) MII = MAX(2,1) = 2 Concept: If there were infinite resources, what is the fewest number of cycles between initiation of successive iterations?

Review: Priority Function Height-based priority worked well for acyclic scheduling, makes sense that it will work for loops as well Acyclic: Height(X) = 0, if X has no successors MAX ((Height(Y) + Delay(X,Y)), otherwise for all Y = succ(X) Cyclic: HeightR(X) = 0, if X has no successors MAX ((HeightR(Y) + EffDelay(X,Y)), otherwise for all Y = succ(X) EffDelay(X,Y) = Delay(X,Y) – II*Distance(X,Y)

Calculating Height ,0 1,1 2,2 1.Insert pseudo edges from all nodes to branch with latency = 0, distance = 0 (dotted edges) 2.Compute II, For this example assume II = 2 3.HeightR(4) = 0 4.HeightR(3) = 0 H(4) + EffDelay(3,4) = – 0*II = 0 H(2) + EffDelay(3,2) = – 2*II = 0 MAX(0,0) = 0 5.HeightR(2) = 2 H(3) + EffDelay(2,3) = – 0 * II = 2 H(4) + EffDelay(2,4) = – 0 * II = 0 MAX(2,0) = 0 6.HeightR(1) = 5 H(2) + EffDelay(1,2) = – 0 * II = 5 H(4) + EffDelay(1,4) = – 0 * II = 0 MAX(5,0) = 5 2,0 0,0

The Scheduling Window E(Y) = 0, if X is not scheduled MAX (0, SchedTime(X) + EffDelay(X,Y)), otherwise With cyclic scheduling, not all the predecessors may be scheduled, so a more flexible earliest schedule time is: MAX for all X = pred(Y) Latest schedule time(Y) = L(Y) = E(Y) + II – 1 Every II cycles a new loop iteration will be initialized, thus every II cycles the pattern will repeat. Thus, you only have to look in a window of size II, if the operation cannot be scheduled there, then it cannot be scheduled. where EffDelay(X,Y) = Delay(X,Y) – II*Distance(X,Y)

Implementing Modulo Scheduling - Driver v compute MII v II = MII v budget = BUDGET_RATIO * number of ops v while (schedule is not found) do »iterative_schedule(II, budget) »II++ v Budget_ratio is a measure of the amount of backtracking that can be performed before giving up and trying a higher II

Modulo Scheduling – Iterative Scheduler v iterative_schedule(II, budget) »compute op priorities »while (there are unscheduled ops and budget > 0) do  op = unscheduled op with the highest priority  min = early time for op (E(Y))  max = min + II – 1  t = find_slot(op, min, max)  schedule op at time t u /* Backtracking phase – undo previous scheduling decisions */ u Unschedule all previously scheduled ops that conflict with op  budget--

Modulo Scheduling – Find_slot v find_slot(op, min, max) »/* Successively try each time in the range */ »for (t = min to max) do  if (op has no resource conflicts in MRT at t) u return t »/* Op cannot be scheduled in its specified range */ »/* So schedule this op and displace all conflicting ops */ »if (op has never been scheduled or min > previous scheduled time of op)  return min »else  return MIN(1 + prev scheduled time of op, max)

Modulo Scheduling Example 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r : r2 = r : p1 = cmpp (r1 < r9) 7: brct p1 Loop resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 for (j=0; j<100; j++) b[j] = a[j] * 26 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r : r2 = r : brlc Loop Loop: LC = 99 Step1: Compute to loop into form that uses LC

Example – Step 2 resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Step 2: DSA convert 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r : r2 = r : brlc Loop Loop: LC = 99

Example – Step ,1 3,0 2,0 1,1 RecMII = 1 RESMII = 2 MII = 2 resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Step3: Draw dependence graph Calculate MII 0,0

Example – Step 4 1: H = 5 2: H = 3 3: H = 0 4: H = 0 5: H = 0 7: H = ,1 0,0 3,0 2,0 1,1 Step 4 – Calculate priorities (MAX height to pseudo stop node) 0,0 1: H = 5 2: H = 3 3: H = 0 4: H = 4 5: H = 0 7: H = 0 Iter1Iter2

Example – Step 5 resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Schedule brlc at time II - 1 alu0 alu1 mem br MRT 0 1X Rolled Schedule Unrolled Schedule

Example – Step 6 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Step6: Schedule the highest priority op Op1: E = 0, L = 1 Place at time 0 (0 % 2) alu0 alu1 mem br MRT 0 1X X Rolled Schedule Unrolled Schedule

Example – Step 7 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Step7: Schedule the highest priority op Op4: E = 0, L = 1 Place at time 0 (0 % 2) alu0 alu1 mem br MRT 0 1X X Rolled Schedule Unrolled Schedule 1 1 X

Example – Step 8 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Step8: Schedule the highest priority op Op2: E = 2, L = 3 Place at time 2 (2 % 2) alu0 alu1 mem br MRT 0 1X X Rolled Schedule Unrolled Schedule 1 1 X X

Example – Step 9 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Step9: Schedule the highest priority op Op3: E = 5, L = 6 Place at time 5 (5 % 2) alu0 alu1 mem br MRT 0 1X X Rolled Schedule Unrolled Schedule 1 1 X X 4 4 X

Example – Step 10 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Step10: Schedule the highest priority op Op5: E = 0, L = 1 Place at time 1 (1 % 2) alu0 alu1 mem br MRT 0 1X X Rolled Schedule Unrolled Schedule 1 1 X X 4 4 X 5 X

Example – Step 11 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop Loop: LC = 99 Step11: calculate ESC, SC = max unrolled sched length / ii unrolled sched time of branch = rolled sched time of br + (ii*esc) SC = 6 / 2 = 3, ESC = SC – 1 time of br = 1 + 2*2 = 5 alu0 alu1 mem br MRT 0 1X X Rolled Schedule Unrolled Schedule 1 1 X X 4 4 X 5 X

Example – Step 12 1: r3[-1] = load(r1[0]) if p1[0] 2: r4[-1] = r3[-1] * 26 if p1[1] 4: r1[-1] = r1[0] + 4 if p1[0] 3: store (r2[0], r4[-1]) if p1[2] 5: r2[-1] = r2[0] + 4 if p1[0] 7: brlc Loop if p1[2] Loop: LC = 99 ESC = 2 p1[0] = 1 Finishing touches - Sort ops, initialize ESC, insert BRF and staging predicate, initialize staging predicate outside loop Unrolled Schedule Stage 1 Stage 2 Stage 3 Staging predicate, each successive stage increment the index of the staging predicate by 1, stage 1 gets px[0]

Example – Dynamic Execution of the Code 1: r3[-1] = load(r1[0]) if p1[0] 2: r4[-1] = r3[-1] * 26 if p1[1] 4: r1[-1] = r1[0] + 4 if p1[0] 3: store (r2[0], r4[-1]) if p1[2] 5: r2[-1] = r2[0] + 4 if p1[0] 7: brlc Loop if p1[2] Loop: LC = 99 ESC = 2 p1[0] = 1 0: 1, 4 1: 5 2: 1,2,4 3: 5 4: 1,2,4 5: 3,5,7 6: 1,2,4 7: 3,5,7 … 98: 1,2,4 99: 3,5,7 100: 2 101: 3,7 102: ,7 time: ops executed

Homework Problem latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 for (j=0; j<100; j++) b[j] = a[j] * 26 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r : r2 = r : brlc Loop Loop: LC = 99 How many resources of each type are required to achieve an II=1 schedule? If the resources are non-pipelined, how many resources of each type are required to achieve II=1 Assuming pipelined resources, generate the II=1 modulo schedule.

What if We Don’t Have Hardware Support? v No predicates »Predicates enable kernel-only code by selectively enabling/disabling operations to create prolog/epilog »Now must create explicit prolog/epilog code segments v No rotating registers »Register names not automatically changed each iteration »Must unroll the body of the software pipeline, explicitly rename  Consider each register lifetime i in the loop  Kmin = min unroll factor = MAXi (ceiling((Endi – Starti) / II))  Create Kmin static names to handle maximum register lifetime »Apply modulo variable expansion

No Predicates A B C D E A B C D E A B C D E A B C D E A B C D E EDCBA Kernel-only code with rotating registers and predicates, II = 1 Without predicates, must create explicit prolog and epilogs, but no explicit renaming is needed as rotating registers take care of this DCB C D B C B C D kernel prolog epilog

No Predicates and No Rotating Registers A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 A3 B3 C3 D3 E3 A4 B4 C4 D4 E4 A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 A3 B3 C3 D3 E3 A4 B4 C4 D4 E4 Assume Kmin = 4 for this example unrolled kernel prolog epilog D1C2B3 C1 D1 B2 C2 B1 C1 D1 E4D1 E1 C2 D2 E2 B3 C3 D3 E3 D4 E4 C1 D1 E1 B2 C2 D2 E2 D3 E3 C4 D4 E4 B1 C1 D1 E1