Scheduling Algorithms

Slides:



Advertisements
Similar presentations
CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
1 Finite Constraint Domains. 2 u Constraint satisfaction problems (CSP) u A backtracking solver u Node and arc consistency u Bounds consistency u Generalized.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Courseware Integer Linear Programming approach to Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark.
COE 561 Digital System Design & Synthesis Scheduling Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals.
Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.
Clock Skewing EECS 290A Sequential Logic Synthesis and Verification.
Sequential Timing Optimization. Long path timing constraints Data must not reach destination FF too late s i + d(i,j) + T setup  s j + P s i s j d(i,j)
FUNDAMENTAL PROBLEMS AND ALGORITHMS Graph Theory and Combinational © Giovanni De Micheli Stanford University.
Tirgul 12 Algorithm for Single-Source-Shortest-Paths (s-s-s-p) Problem Application of s-s-s-p for Solving a System of Difference Constraints.
ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
ECE Synthesis & Verification - Lecture 3 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
SCHEDULING SOURCES- Mark Manwaring Kia Bazargan Giovanni De Micheli Gupta Youn-Long Lin M. Balakrishnan Camposano, J. Hofstede, Knapp, MacMillen Lin.
ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:
ICS 252 Introduction to Computer Design
ECE Synthesis & Verification - LP Scheduling 1 ECE 667 ECE 667 Synthesis and Verification of Digital Circuits Scheduling Algorithms Analytical approach.
Fall 2006EE VLSI Design Automation I VII-1 EE 5301 – VLSI Design Automation I Kia Bazargan University of Minnesota Part VII: High Level Synthesis.
Scheduling for Synthesis of Embedded Hardware
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
Register Placement for High- Performance Circuits M. Chiang, T. Okamoto and T. Yoshimura Waseda University, Japan DATE 2009.
High-Level Synthesis-II Virendra Singh Indian Institute of Science Bangalore IEP on Digital System IIT Kanpur.
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
1 Job Shop Scheduling. 2 Job shop environment: m machines, n jobs objective function Each job follows a predetermined route Routes are not necessarily.
Scheduling Determines the precise start time of each task.
ECE 565 – VLSI Design Automation
Vishwani D. Agrawal James J. Danaher Professor
Chapter 5 : Trees.
Chapter 5. Greedy Algorithms
Fundamentals, Terminology, Traversal, Algorithms
CS 3343: Analysis of Algorithms
Basic Project Scheduling
Assignment Problem, Dynamic Programming
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Basic Project Scheduling
CS137: Electronic Design Automation
High-Level Synthesis Creating Custom Circuits from High-Level Code
Analysis & Design of Algorithms (CSCE 321)
Instruction Scheduling Hal Perkins Summer 2004
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Minimum Spanning Tree.
ICS 353: Design and Analysis of Algorithms
Instruction Scheduling Hal Perkins Winter 2008
Reconfigurable Computing
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Sungho Kang Yonsei University
Architectural-Level Synthesis
Architecture Synthesis
Resource Sharing and Binding
Integrated Systems Centre © Giovanni De Micheli – All rights reserved
Richard Anderson Autumn 2016 Lecture 7
Chapter 7: Job shops Sections 7.1 and 7.2 (skip section 7.3)
Instruction Scheduling Hal Perkins Autumn 2005
Minimum Spanning Tree.
EE5900 Advanced Embedded System For Smart Infrastructure
Graphs and Vertex Coloring
ICS 252 Introduction to Computer Design
“Rate-Optimal” Resource-Constrained Software Pipelining
Instruction Scheduling Hal Perkins Autumn 2011
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Reconfigurable Computing (EN2911X, Fall07)
Data Structures and Algorithms
CS137: Electronic Design Automation
Presentation transcript:

Scheduling Algorithms 4541.633A SoC Design Automation School of EECS Seoul National University

Unconstrained minimum-latency scheduling problem Find j : V --> Z+ such that j(vi) = ti, ti ³ tj + dj, " i, j | (vj, vi) Î E and tn is minimum Resource-constrained minimum-latency scheduling problem |{vi | t(vi) = k and ti £ l < ti + di}|£ ak for each op type k = 1, 2, ..., nres and schedule step l = 1, 2, ..., tn and ti vi ti+di

Scheduling without Resource Constraints Unconstrained scheduling Dedicated resource operation types are all different or resource cost is marginal Resource binding is done resource conflicts are resolved by serializing operations that share the same resource Unconstrained scheduling gives lower bound on latency for constrained problems

Scheduling without Resource Constraints ASAP (As Soon As Possible) scheduling v0 NOP v1 * * v2 * v6 * + C-step 1 v8 v10 C-step 2 v7 + < v3 * * v11 v9 C-step 3 - v4 C-step 4 - v5 NOP vn

Scheduling without Resource Constraints ASAP scheduling algorithm ASAP (G(V, E)) { schedule v0 by setting t0S = 1; repeat { select a vertex vj whose predecessors are all scheduled; schedule vj by setting tjS = max (tiS + di), (vi, vj) Î E; } until (vn is scheduled); return (TS); -- TS = {t0S, t1S,...,tnS} topological sorting --> O(|V| + |E|)

Scheduling without Resource Constraints ALAP (As Late As Possible) scheduling NOP * - v0 v1 v2 v6 v3 v4 v7 v8 + v10 < v11 v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4 Mobility mi= tiL - tiS

Scheduling without Resource Constraints ALAP scheduling algorithm ALAP (G(V, E), l’) { schedule vn by setting tnL = l’ + 1; repeat { select a vertex vi whose successors are all scheduled; schedule vi by setting tiL = min (tjL - di), (vi, vj) Î E; } until (v0 is scheduled); return (TL); --TL = {t0L, t1L,...,tnL} where l’ = tnS - t0S topological sorting --> O(|V| + |E|) mobility mi= tiL - tiS

Scheduling with Resource Constraints Given resource constraint find area/latency trade-off points Integer Linear Programming (ILP) model C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu, “A formal approach to the scheduling problem in high level synthesis,” IEEE Trans. on CAD, April 1991. Exact solution but NP-complete

Scheduling with Resource Constraints l=tnS tnL Minimize cT t = [0 0 ... 0 1] [t0 t1 ... tn]T = tn = S l xnl subject to S xil = 1, i = 0, 1, ..., n S l xil – S l xjl - dj ³ 0, i, j = 0, 1, ..., n, (vj , vi) Î E S S xim £ ak , k= 1, 2, ..., nres , l = 1, 2, ..., l¢+1 xil Î {0, 1}, i = 0, 1, ..., n, l = 1, 2, ..., l¢+1 where 1 if vi starts in step l 0 otherwise dj : execution delay of operation j t(vi) : resource type of operation vi ak : resource constraint l¢ : latency obtained by a heuristic algorithm l=tiS tiL unique start time l=tiS tiL l=tjS tjL data dependency m=l-di+1 l resource constraint i:t(vi)=k xil ={ ti=l-di+1 vi l-1 l l+1 l+2

Scheduling with Resource Constraints Minimize area under latency constraint --> ak: variable objective function: cT a = [area1, area2, ... areanres] [a1 a2 ... anres]T S l xnl £ l¢+1 --> added as latency constraint redundant (S xil = 1, i = 0, 1, ..., n) l=tnS tnL l=tiS tiL

Scheduling with Resource Constraints Example 1 # mult = a1 = 2 # ALU = a2 = 2 by heuristic (list scheduling) algorithm l’ = 4 NOP v0 v1 * + * * v2 * v6 v8 v10 < v7 + v3 * * v11 v9 - v4 - v5 NOP vn

Scheduling with Resource Constraints x0,1 = 1 x1,1 = 1 x2,1 = 1 x3,2 = 1 x4,3 = 1 x5,4 = 1 x6,1 + x6,2 = 1 x7,2 + x7,3 = 1 x8,1 + x8,2 + x8,3 = 1 x9,2 + x9,3 + x9,4 = 1 x10,1 + x10,2 + x10,3 = 1 x11,2 + x11,3 + x11,4 = 1 xn,5 = 1 2x7,2 + 3x7,3 - x6,1 - 2x6,2 - 1 ³ 0 2x9,2 + 3x9,3 + 4x9,4 - x8,1 - 2x8,2 - 3x8,3 - 1 ³ 0 2x11,2 + 3x11,3 + 4x11,4 - x10,1 - 2x10,2 - 3x10,3 - 1 ³ 0 4x5,4 - 2x7,2 - 3x7,3 - 1 ³ 0 5xn,5 - 2x9,2 - 3x9,3 - 4x9,4 -1 ³ 0 5xn,5 - 2x11,2 - 3x11,3 - 4x11,4 - 1 ³ 0 x1,1 + x2,1 + x6,1 + x8,1 £ 2 x3,2 + x6,2 + x7,2 + x8,2 £ 2 x7,3 + x8,3 £ 2 x10,1 £ 2 x9,2 + x10,2 + x11,2 £ 2 x4,3 + x9,3 + x10,3 + x11,3 £ 2 x5,4 + x9,4 + x11,4 £ 2 data-dependency unique start time resource constraint

Scheduling with Resource Constraints NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 < v11 + v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4

Scheduling with Resource Constraints Example 2 (minimize area under latency constraint) cT a = [5, 1] [amult, aALU]T l’ = 4 x1,1 + x2,1 + x6,1 + x8,1 - a1 £ 0 x3,2 + x6,2 + x7,2 + x8,2 - a1 £ 0 x7,3 + x8,3 - a1 £ 0 x10,1 - a2 £ 0 x9,2 + x10,2 + x11,2 - a2 £ 0 x4,3 + x9,3 + x10,3 + x11,3 - a2 £ 0 x5,4 + x9,4 + x11,4 - a2 £ 0 x1,1 + x2,1 + x6,1 + x8,1 £ 2 x3,2 + x6,2 + x7,2 + x8,2 £ 2 x7,3 + x8,3 £ 2 x10,1 £ 2 x9,2 + x10,2 + x11,2 £ 2 x4,3 + x9,3 + x10,3 + x11,3 £ 2 x5,4 + x9,4 + x11,4 £ 2 result: a1 = amult = 2 a2 = aALU = 2 same as the previous example

Scheduling with Resource Constraints Heuristic Scheduling Algorithms List scheduling Force-directed scheduling List scheduling (resource-constrained minimum-latency) Priority list: weight of the longest path to sink NOP - * + < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn 4 3 2 1

Scheduling with Resource Constraints LIST_L (G(V, E), a) { l = 1; repeat { for each resource type k = 1, 2, ..., nres { Determine candidate operations Cl,k; Determine unfinished operations Ul,k; Select Sk Í Cl,k vertices, such that |Sk| + |Ul,k| £ ak; Schedule the Sk operations at step l by setting ti = l "i : vi Î Sk; } l = l + 1; until (vn is scheduled); return (T); NOP - * + < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn 4 3 2 1

Scheduling with Resource Constraints Example 1 a1 = 2 mult a2 = 2 ALU {v1, v2} {v10} {v3, v6} {v11} {v7, v8} {v4} {v5, v9} NOP v0 v1 * 4 * v2 * v6 * + 4 v8 v10 3 2 2 * 3 * v7 + < v3 v9 v11 2 1 1 - v4 2 - v5 1 v0 NOP NOP vn v1 * * v2 + v10 C-step 1 * < C-step 2 v6 v3 * v11 - v8 C-step 3 v4 * v7 * - + v9 C-step 4 v5 NOP vn

Scheduling with Resource Constraints Example 2 a1 = 3 mult, mult delay = 2 a2 = 1 ALU, ALU delay = 1 {v1, v2, v6} {v10} - {v11} {v3, v7, v8} - - - - {v4} - {v5} - {v9} NOP v0 * v1 * v2 * v6 + v10 C-step 1 < v11 C-step 2 * v3 * v7 * v8 C-step 3 C-step 4 - v4 C-step 5 - v5 C-step 6 + v9 C-step 7 NOP vn

Scheduling with Resource Constraints List scheduling (latency-constrained minimum-resource) Start with one resource per type (a = 1) Use slack computed by ALAP (tiL - l) The lower the slack, the higher the urgency Zero slack --> schedule --> no more resource --> add resource

Scheduling with Resource Constraints LIST_R (G(V, E), l’ ) { a = 1; Compute the latest possible start times TL by ALAP(G(V, E), l’ ); if (t0L < 0) return (Æ); l = 1; repeat { for each resource type k = 1, 2, ..., nres { Determine candidate operations Cl,k; Compute the slacks {si = tiL - l, "vi Î Cl,k}; Schedule the candidate operations with zero slack and update a; Schedule the candidate operations requiring no additional resources; } l = l + 1; until (vn is scheduled); return (T, a);

Scheduling with Resource Constraints Example a = [1, 1]T {v1, v2} ---> a = [2, 1]T {v10} {v3, v6} {v11} {v7, v8} {v4} {v5, v9} ---> a = [2, 2]T zero slack v0 NOP v1 * * v2 + v10 C-step 1 * v6 < C-step 2 v3 * v11 + v10 < v11 C-step 3 - * v4 * v7 v8 C-step 4 - + v5 v9 NOP vn

Scheduling with Resource Constraints Force-directed scheduling P. Paulin and J. Knight, “Force-directed scheduling for the behavioral synthesis of ASIC’s,” IEEE Trans. on CAD, June 1989. Time frame: [tiS, tiL], i = 0, 1, ..., n ---> width = mobility + 1 Operation probability: pi (l) = 1/(width of time frame) Type distribution: qk (l) = sum of operation probabilities in step l for operations implementable by type k --> distribution graph NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn 0 1 2 3 1 2 3 4 Distribution graph for ALU 1/3 5/3

Scheduling with Resource Constraints Example unit delay latency bound = 4 ASAP, ALAP --> time frames p1(1) = 1 p1(2) = p1(3) = p1(4) = 0 p6(1) = p6(2) = 1/2 p8(1) = p8(2) = p8(3) = 1/3 q1(1) = 1 + 1 + 1/2 + 1/3 = 17/6 NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn 0 1 2 3 1 2 3 4 Distribution graph for multiplier 17/6 7/3 5/6 multiplier

Scheduling with Resource Constraints displacement in probability self-force(i, l) = S k x = Smqk(m) (dlm - pi(m)) ---> repelling force = qk(l) - (Smqk(m)) / (mi + 1) , m = tiS, ... tiL NOP v0 average distribution v1 * * v2 * v6 * + v8 v10 v7 + < v3 * * v9 v11 0 1 2 3 - v4 17/6 f=kx 1 2 3 4 1 2 3 4 - v5 v6 7/3 NOP vn 5/6

Scheduling with Resource Constraints Example q1(1) = 17/6 q1(2) = 7/3 self-force(6, 1) = 17/6 (1 - 1/2) + 7/3 (0 - 1/2) = 0.25 self-force(6, 2) = 17/6 (0 - 1/2) + 7/3 (1 - 1/2) = - 0.25 Predecessor/successor force assigning an operation to a specific step may reduce the time frame of other operations due to dependency relations v6 to step 2 ---> v7 to step 3 self-force (7, 3) = q1(2) (0 - p7(2)) + q1(3) (1 - p7(3)) = -0.75 = successor-force (6, 2) total-force (6, 2) = -0.25 - 0.75 = -1 NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn

Scheduling with Resource Constraints ps-force (i, l) = Sj=ps(i) ((Sm’ qk(m’))/(mj’ + 1) - (Sm qk(m))/(mj + 1)) where m’ = [tjS’, tjL’] ---> reduced time frame m = [tjS, tjL] ---> initial time frame Example v8 to step 2 ---> v9 to step 3 or 4 ps-force (8, 2) = 1/2(q2(3) + q2(4)) - 1/3(q2(2) + q2(3) + q2(4)) = 0.3 Compared to list scheduling, force-directed scheduling produces better results but takes longer NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn

Scheduling Graphs with Alternative Paths Branching S i:t(vi)=k Slm=l-di+1 xim £ ak , k = 1, ..., nres , l = 1, ..., l’ + 1 , c = 1, ..., nc C: V --> {1, ..., nc} partition V into nc groups operations in different groups are mutually exclusive and C(vi)=c TRUE FALSE mutually exclusive can share a resource without affecting the performance

Scheduling Graphs with Alternative Paths Example Assume path (v0, v8, v9, vn) is mutually exclusive with the remaining operations v0 NOP x1,1 + x2,1 + x6,1 + x8,1 - a1 £ 0 x3,2 + x6,2 + x7,2 + x8,2 - a1 £ 0 x7,3 + x8,3 - a1 £ 0 x10,1 - a2 £ 0 x9,2 + x10,2 + x11,2 - a2 £ 0 x4,3 + x9,3 + x10,3 + x11,3 - a2 £ 0 x5,4 + x9,4 + x11,4 - a2 £ 0 v1 * * v2 * v6 * + v8 v10 v7 + < v3 * * v11 v9 - v4 x8,1 - a1 £ 0 x8,2 - a1 £ 0 x8,3 - a1 £ 0 x9,2 - a2 £ 0 x9,3 - a2 £ 0 x9,4 - a2 £ 0 x1,1 + x2,1 + x6,1 - a1 £ 0 x3,2 + x6,2 + x7,2 - a1 £ 0 x7,3 - a1 £ 0 x10,1 - a2 £ 0 x10,2 + x11,2 - a2 £ 0 x4,3 + x10,3 + x11,3 - a2 £ 0 x5,4 + x11,4 - a2 £ 0 - v5 NOP vn

Scheduling Graphs with Alternative Paths List scheduling + condition vector + branching probability K. Wakabayashi and T. Yoshimura, “A resource sharing and control synthesis method for conditional branches,” ICCAD, Proceedings of the International Conference on Computer-Aided Design, 1989. na; if (c1) { nb; if (c2) nc; else { nd; if (c3) ne; else nf; } else ng; basic condition vector [1,0,0,0] nc [1,1,1,0] [0,1,0,0] nb ne [1,1,1,1] na nd [0,1,1,0] ng [0,0,0,1] nf [0,0,1,0]

Scheduling Graphs with Alternative Paths Basic condition vector vi = one hot encoding for a leaf node vk or vl or ... or vm for non-leaf node where vk, vl, …, and vm are immediate successors of vi Extended condition vector ei = 1 for a predecessor of the sink node vi for a predecessor of a join node ek or el or ... or em in other cases where ek, el, …, and em are immediate successors of ei Actual condition vector ai = ei or el or ... or em where nl, ..., nm Î Ci , Ci = {nj | nj is a conditional node related to ni, nj has not been scheduled yet, ei and ej ¹ 0 } ni [1,1] [0,1] [1,0] nj source sink Statically compute the extended condition vectors and use them as the priorities. Dynamically compute the actual condition vectors and use them to compute the number of resources used.

Scheduling Graphs with Alternative Paths if (a > b) then x = a + b else x = c + b; a b c a b c > + + > MUX + MUX x x

Scheduling Graphs with Alternative Paths Priority function pf (ni) = (pi , di) = Sj pij * dij where pij is the occurrence probability of leaf condition j and di is the sum of extended condition vectors for all operation nodes in the path from the successors of ni to the sink di is computed from the sink (when several paths merge, largest components are adopted) source [1,1] [1,1] [1,1] [1,1] [1,1] [1,0] [1,1] [0,1] [1,0] dij = longest path length (no nodes in the path have 0 in the j-th position of the CV for non-zero eij). e is used rather than a because the schedule of conditional nodes is not known yet. [0,1] [1,0] [1,1] [1,1] [1,1] sink

Scheduling Graphs with Alternative Paths Algorithm (1) calculate ei, di, pf (ni) for all nodes in set R, current c-step l = 1 (2) move candidate nodes for c-step l from R to set C (3) from C, select ni with the largest pf (ni) (4) if ni is a successor of a join node, duplicate (5) if the largest component of slk, k = t (ni) does not exceed the number of available functional units of type k, then assign ni to c-step l slk = S alk , for all nodes of type k scheduled at c-step l otherwise, put ni into R (6) if C is not empty go to (3) (7) if R is not empty, l = l + 1 and go to (2) (8) re-assign operation nodes (post processing for further optimization) (9) synthesize control sequence dij = longest path length (no nodes in the path have 0 in the j-th position of the CV for non-zero eij). e is used rather than a because the schedule of conditional nodes is not known yet.

Scheduling Graphs with Alternative Paths Duplication of operation nodes (4) code lowering [1,0,0] [0,1,1] a c b a b c + + + [1,1,1] [1,0,0] [0,1,1] x x - Re-assignment of operation nodes (8) the number of zeros in sall increases by moving operations with mobility > 0 - Synthesis of control sequence (9) if some components of slall are zero, then c-step l can be skipped for the corresponding branches Duplication is used for (8) and (9). See the next slide.

Scheduling Graphs with Alternative Paths sall s+ s- = + < - + + - [1,1,1] [1,0,1] [0,1,1] [0,0,1] [1,0,0] [0,0,1] [1,0,0] - + + [0,1,0] [1,1,2] [2,1,2] [2,2,2] [0,0,1] [1,0,1] [1,1,1] [1,1,1] [1,0,0] + [0,0,1] + + [1,1,1] - + + - [1,1,1] [0,1,1] [1,1,1] [0,1,1] [0,0,0] [0,1,1] + + [1,1,1]

Scheduling Pipelined Circuits Structural pipelining Pipelined resources List scheduling can be extended allow scheduling of overlapping operations different start times no data dependency Example pipelined mult (Wallace tree + CPA) 3 pipelined mult 1 ALU latency 7 --> 6 ILP, FDS can also be extended v0 NOP * v1 * v2 * v6 + v10 C-step 1 < v11 C-step 2 * * v3 * v7 v8 C-step 3 * v8 + C-step 4 - v4 C-step 5 - v5 C-step 6 v9 + C-step 7 vn NOP

Scheduling Pipelined Circuits Functional pipelining Lower bound of number of resources of type k: ak’ = énk / d0ù where nk is the number of operations of type k and d0 is the data introduction interval Sp=0 S i:t(vi)=k Sm=l-di+1+ pd0 xim £ ak , k = 1, ..., nres , l = 1, ..., d0 él’ / d0ù: number of pipeline stages Example unit delay d0 = 2 a1’ = é6 / 2ù = 3 mult a2’ = é5 / 2ù = 3 ALU try scheduling with a = [3, 3]T use ILP él’ / d0ù -1 l+pd0 v1 * * v2 * + v6 v10 C-step 1 stage 1 * * * v7 v8 < v3 v11 C-step 2 - v4 + v9 C-step 1 stage 2 - v5 C-step 2

Scheduling Pipelined Circuits Heuristic scheduling can be extended to schedule pipelined circuits List scheduling at each control step l, check resource bound Sp=0 S i:t(vi)=k Sm=(l mod d0)-di+1+ pd0 xim £ ak to determine schedulable candidates N. Park and A.C. Parker, "Sehwa: a software package for synthesis of pipelines from behavioral specifications," IEEE Trans. on Computer-Aided Design, Mar. 1988. Force-directed scheduling the computation of type distribution must consider the actual operation concurrency across the control step boundaries él / d0ù -1 (l mod d0)+pd0

Scheduling Pipelined Circuits Loop folding Pipeline the loop body Data introduction interval: dl loop execution delay = (nl - 1) dl + éll / dlù dl # pipe stage ll dl . . . nl

Scheduling Pipelined Circuits Example 1 ll = 4 dl = 2 nl = 10 loop execution delay (without folding) = nl ll = 40 loop execution delay (with folding) = (nl - 1) dl + éll / dlù dl = 22 NOP NOP NOP 1 2 1 2 4 3 3 5 4 NOP NOP 5 NOP

Scheduling Pipelined Circuits Example 2 s = 0; step 1 for i = 1 to 10 { p[i] = c[i] + in[i]; step 2 s = s + p[i]; step 3 } ---> p[1] = c[1] + in[1]; step 2 for i = 2 to 10 { s = s + p[i-1]; p[i] = c[i] + in[i]; step 3 s = s + p[10]; step 4 21 cycles --> 12 cycles