Scheduling for Synthesis of Embedded Hardware EE202A (Fall 2004): Lecture #11 Note: Several slides in this lecture are from Prof. Miodrag Potkonjak, UCLA Computer Science
Reading List for This Lecture 4/17/2017 Reading List for This Lecture Recommended: R. Walker and S. Choudhuri, “Introduction to the scheduling problem”, IEEE Design & Test of Computers (special issue on high-level synthesis), vol. 12, iss. 2, pp. 60-69, June 1995. http://ieeexplore.ieee.org/iel1/54/8746/00386007.pdf C-T. Hwang, J-H. Lee, and Y-C. Hsu, “A formal approach to the scheduling problem in high level synthesis”, IEEE Transactions on CAD, vol. 10, iss. 4, pp. 464-475, April 1991. http://ieeexplore.ieee.org/iel1/43/2519/00075629.pdf P. Paulin and J. P. Knight, “Force directed scheduling for the behavioral synthesis of ASICs”, IEEE Transactions on CAD, vol. 8, iss. 6, June 1989. http://ieeexplore.ieee.org/iel1/43/1364/00031522.pdf
HW-SW Design Flow System Specification High-Level Synthesis C/C++ C/VHDL/Verilog Behavioral Assembly Code Structural RTL System Specification [Gupta, UCSD]
High-Level Synthesis (HLS) Converts behavioral specification into structural register transfer level (RTL) description Example: GCD computation High-Level Synthesis [Ghosh, ICCAD ‘96]
Typical HLS System
High Level Synthesis Resource Allocation - How Much? Scheduling - When? Assignment - Where? Module Selection Template Matching & Operation Chaining Clock Selection Partitioning Transformations
Algorithm Description
CDFG
Precedence Graph
Sequence Graph: Start & End Nodes
Hierarchy in Sequence Graphs
Hierarchy in Sequence Graphs (contd.)
Hierarchy in Sequence Graphs (contd.)
Implementation
Timing Constraints Time measured in “cycles” or “control steps” problem? Max & min timing constraints
Constraint Graphs
Operations with Unknown Delays Unknown but bounded e.g. Conditionals loops Unknown and unbounded I/O operations synchronization Completion signal Called “anchor nodes” Need to schedule relative to these anchors
Scheduling Under Timing Constraints Feasible constraint graph Timing constraints satisfied when execution delays of all the anchors is zero Necessary for existence of schedule Well-posed constraint graph Timing constraints satisfied for all values of execution delays Implies feasibility Feasible constraint graph is well-posed or can be made well-posed iff no cycles with unbounded weight exist
Ill-posed (a, b) vs. Well-posed (c) Timing Constraints
Allocation, Assignment, and Scheduling Techniques well understood and mature
Example: Scheduling, Allocation, and Assignment Control Step
Variants of HLS scheduling Unconstrained Scheduling (UCS) Unlimited HW resources, No latency constraints Time Constrained Scheduling (TCS) Given upper bound on schedule length Goal is to minimize total resource cost Resource Constrained Scheduling (RCS) Given maximum number of each resource type Goal is to minimize the schedule length Time & Resource Constrained Sched. (TRCS)
ASAP Scheduling Algorithm (Solves the Unconstrained Scheduling Problem)
ASAP Scheduling Example
ASAP Scheduling Example Dummy start node scheduled at time = 0 Sequence Graph ASAP Schedule
ALAP Scheduling Algorithm (Solves the Unconstrained Scheduling Problem, but latency constraint is required for algo. to make sense)
ALAP Scheduling Example
ALAP Scheduling Example Dummy sink node scheduled at (latency constraint + 1) ALAP Schedule (latency constraint = 4) Sequence Graph
Observation about ALAP & ASAP Start time of an operation given by ASAP is the earliest possible through any scheduling algorithm For a given latency constraint, start time given by ALAP is the latest possible through any scheduling algorithm (ALAP start time – ASAP start time) denotes the mobility No priority is given to nodes on critical path Unimportant nodes may be scheduled ahead of critical nodes No problem if unlimited hardware. If limited resources, less critical nodes may block critical ones & yield poor schedules List scheduling techniques overcome this problem by utilizing a more global node selection criterion
List Scheduling Illustration Candidate list in each control step Operation(s) in shaded boxes are the ones selected for scheduling in the current CS CS Resource constraint: 1 ADD, 2 MUL CS CS CS Order of selecting candidate operations affects schedule quality
List Scheduling Algorithm Commonly used selection criteria: Nodes with least mobility picked first, Nodes with maximum number of successors picked first
Taxonomy of scheduling algorithms NP complete problem Optimal techniques Integer Linear Programming (ILP) Heuristics - Iterative improvement based e.g., Simulated annealing Heuristics – Constructive e.g., Force directed scheduling, List scheduling If all resources identical, reduces to multi-processor scheduling Minimum latency multiprocessor sched. is also NP complete
Scheduling - Optimal Techniques Integer Linear Programming Branch and Bound
Integer Linear Programming Given: integer-valued matrix Amxn, vectors B = ( b1, b2, … , bm ), C = ( c1, c2, … , cn ) Minimize: CTX Subject to: AX B X = ( x1, x2, … , xn ) is an integer-valued vector
ILP based scheduling RCS version: For a set of (dependent) operations {V0,V1,...,Vn }, given an upper bound ak on the # of available resources of type k where k {1, …, nres}, and latency di of each operation Vi, find a schedule of minimum length that satisfies all resource and precedence constraints V0 denotes dummy start node, Vn denotes dummy sink node Step 1: Run a heuristic, e.g., list scheduling, to obtain possibly sub-optimal, but achievable schedule length. Say it is Step 2: Run ASAP & ALAP without resource constraints to get earliest & latest start times for each operation (used for pruning)
Integer Linear Programming For each computation dependency: ti has to be done before tj, introduce a constraint: k x1i+ (k-1) x2i+ ... + xki k x1j+ (k-1) x2j+ ... + xkj+ 1(*) Minimize: y0 Subject to: x1i+ x2i+ ... + xki = 1 for all 1 i n yj y0 for all 1 i k all computation dependency of type (*)
An Example 6 computations 3 control steps c1 c2 c3 c4 c6 c5
An Example Introduce variables: xij for 1 i 3, 1 j 6 yi = xi1+xi2+xi3+xi4+xi5+xi6 for 1 i 3 y0 Dependency constraints: e.g. execute c1 before c4 3x11+2x21+x31 3x14 +2x24+x34+1 Execution constraints: x1i+x2i+x3i = 1 for 1 i 6
An Example Minimize: y0 Subject to: yi y0 for all 1 i 3 dependency constraints execution constraints One solution: y0 = 2 x11 = 1, x12 = 1, x23 = 1, x24 = 1, x35 = 1, x36 = 1. All other xij = 0
ILP Model of Scheduling For each operation Vi, for each control step j ( 1 j ), define variable xij as: xij = 1, if computation Vi is executed in control step j xij = 0, otherwise Constraint 1: Start time is unique (i.e., each operation must be scheduled exactly once) Start time and end time of operation Vj are given by and respectively
ILP Model of Scheduling (contd.) Constraint 2: Sequencing relationships must be satisfied Constraint 3: Resource bounds must be met Since upper bound on # of resources of type k is ak
Minimum-latency Scheduling Under Resource-constraints Let t be the vector whose entries are start times, and let c = [0, 0, …, 1]T Formal ILP model is given by:
Example Two types of resources Both take 1 cycle execution time Multiplier (2 available) ALU (2 available) Adder Subtraction Comparison Both take 1 cycle execution time
Example (contd.) Heuristic (list scheduling) gives latency = 4 steps Use ALAP and ASAP (with no resource constraints) to get bounds on start times ASAP matches latency of heuristic So heuristic is optimum, but let us ignore it! Constraints?
Example (contd.) Start time is unique
Example (contd.) Sequencing constraints Note: Only non-trivial ones listed Those with more than one possible start time for at least one operation
Example (contd.) Resource constraints
Example (contd.) Consider c = [0, 0, …, 1]T Minimum latency schedule Since sink has no mobility (xn,5 = 1), any feasible schedule is optimum Consider c = [1, 1, …, 1] T Finds earliest start times for all operations Equivalently, it minimizes the following term:
Example: Optimum Schedule Under Resource Constraints
Example (contd.) Extension to TCS version: Assume multiplier costs 5 units of area, & ALU costs 1 unit of area now becomes the (given) schedule length bound Same uniqueness and sequencing constraints as before Resource constraints are now in terms of unknown variables a1 and a2 a1 = # of multipliers a2 = # of ALUs
Example (contd.) Resource constraints
Example Solution Minimize cTa = 5.a1 + 1.a2 Solution with cost 12
Extensions ILP formulation can be extended to consider: Operation chaining Functional pipelining and other transformations See recommended reading paper for details
Precedence-constrained Multiprocessor Scheduling All operations done by the same type of resource NP complete even if all operations have unit delay
Scheduling - Iterative Improvement Kernighan - Lin (deterministic) Simulated Annealing Lottery Iterative Improvement Neural Networks Genetic Algorithms Taboo Search
Scheduling - Constructive Techniques Most Constrained Least Constraining
Force Directed Scheduling Goal is to reduce hardware by balancing concurrency Iterative algorithm, one operation scheduled per iteration Information (i.e. speed & area) fed back into scheduler
The Force Directed Scheduling Algorithm
Step 1 Determine ASAP and ALAP schedules * - * * * * - * * - + < +
Step 2 Determine Time Frame of each op Length of box ~ Possible execution cycles Width of box ~ Probability of assignment Uniform distribution, Area assigned = 1 C-step 1 * - + < C-step 2 C-step 3 1/2 C-step 4 1/3 Time Frames
Step 3 Create Distribution Graphs Sum of probabilities of each Op type Indicates concurrency of similar Ops DG(i) = Prob(Op, i) 1 2 3 4 1 2 3 4 1 1 2 2 3 3 4 4 DG for Multiply DG for Add, Sub, Comp
Diff Eq Example: Precedence Graph Recalled
Diff Eq Example: Time Frame & Probability Calculation
Diff Eq Example: DG Calculation
Conditional Statements Operations in different branches are mutually exclusive Operations of same type can be overlapped onto DG Probability of most likely operation is added to DG 1 2 DG for Add - + Fork Join
Self Forces Force(i) = DG(i) * x(i) Self Force(j) = [Force(i)] Scheduling an operation will effect overall concurrency Every operation has 'self force' for every C-step of its time frame Analogous to the effect of a spring: f = Kx Desirable scheduling will have negative self force Will achieve better concurrency (lower potential energy) Force(i) = DG(i) * x(i) DG(i) ~ Current Distribution Graph value x(i) ~ Change in operation’s probability Self Force(j) = [Force(i)]
Example Attempt to schedule multiply in C-step 1 Self Force(1) = Force(1) + Force(2) = ( DG(1) * X(1) ) + ( DG(2) * X(2) ) = [2.833*(0.5) + 2.333 * (-0.5)] = +0.25 This is positive, scheduling the multiply in the first C-step would be bad 1 2 3 4 DG for Multiply * - + < C-step 1 C-step 2 C-step 3 C-step 4 1/2 1/3
Diff Eq. Example: Self Force for Node 4
Predecessor & Successor Forces Scheduling an operation may affect the time frames of other linked operations This may negate the benefits of the desired assignment Predecessor/Successor Forces = Sum of Self Forces of any implicitly scheduled operations * - + <
Diff Eq Example: Successor Force on Node 4 If node 4 scheduled in step 1 no effect on time frame for successor node 8 Total force = Froce4(1) = +0.25 If node 4 scheduled in step 2 causes node 8 to be scheduled into step 3 must calculate successor force
Diff Eq Example: Final Time Frame and Schedule
Diff Eq Example: Final DG
Lookahead Temporarily modify the constant DG(i) to include the effect of the iteration being considered Force (i) = temp_DG(i) * x(i) temp_DG(i) = DG(i) + x(i)/3 Consider previous example: Self Force(1) = (DG(1) + x(1)/3)x(1) + (DG(2) + x(2)/3)x(2) = .5(2.833 + .5/3) -.5(2.333 - .5/3) = +.41667 This is even worse than before
Minimization of Bus Costs Basic algorithm suitable for narrow class of problems Algorithm can be refined to consider “cost” factors Number of buses ~ number of concurrent data transfers Number of buses = maximum transfers in any C-step Create modified DG to include transfers: Transfer DG Trans DG(i) = [Prob (op,i) * Opn_No_InOuts] Opn_No_InOuts ~ combined distinct in/outputs for Op Calculate Force with this DG and add to Self Force
Minimization of Register Costs Minimum registers required is given by the largest number of data arcs crossing a C-step boundary Create Storage Operations, at output of any operation that transfers a value to a destination in a later C-step Generate Storage DG for these “operations” Length of storage operation depends on final schedule s d 2 3 1 4 Storage distribution for S 5 ASAP Lifetime MAX Lifetime ALAP Lifetime
Minimization of Register Costs (contd.) avg life] = storage DG(i) = (no overlap between ASAP & ALAP) storage DG(i) = (if overlap) Calculate and add “Storage” Force to Self Force 7 registers minimum ASAP Force Directed 5 registers minimum
Pipelining Functional Pipelining Structural Pipelining * + < - 1', 3 2', 4 1 2 3 4 DG for Multiply 3, 1’ 4, 2’ 3’ 4’ Instance Instance’ Functional Pipelining Functional Pipelining Pipelining across multiple operations Must balance distribution across groups of concurrent C-steps Cut DG horizontally and superimpose Finally perform regular Force Directed Scheduling Structural Pipelining Pipelining within an operation For non data-dependant operations, only the first C-step need be considered 1 2 3 4 * Structural Pipelining
Other Optimizations Local timing constraints Multiclass FU’s Insert dummy timing operations -> Restricted time frames Multiclass FU’s Create multiclass DG by summing probabilities of relevant ops Multistep/Chained operations. Carry propagation delay information with operation Extend time frames into other C-steps as required Hardware constraints Use Force as priority function in list scheduling algorithms
Scheduling using Simulated Annealing Reference: Devadas, S.; Newton, A.R. Algorithms for hardware allocation in data path synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, July 1989, Vol.8, (no.7):768-81.
Simulated Annealing Local Search Solution space Cost function ?
Statistical Mechanics Combinatorial Optimization State {r:} (configuration -- a set of atomic position ) weight e-E({r:])/K BT -- Boltzmann distribution E({r:]): energy of configuration KB: Boltzmann constant T: temperature Low temperature limit ??
Analogy Physical System Optimization Problem State (configuration) Energy Ground State Rapid Quenching Careful Annealing Optimization Problem Solution Cost Function Optimal Solution Iteration Improvement Simulated Annealing
Generic Simulated Annealing Algorithm 1. Get an initial solution S 2. Get an initial temperature T > 0 3. While not yet 'frozen' do the following: 3.1 For 1 i L, do the following: 3.1.1 Pick a random neighbor S'of S 3.1.2 Let =cost(S') - cost(S) 3.1.3 If 0 (downhill move) set S = S' 3.1.4 If >0 (uphill move) set S=S' with probability e-/T 3.2 Set T = rT (reduce temperature) 4. Return S
Basic Ingredients for S.A. Solution Space Neighborhood Structure Cost Function Annealing Schedule
Observation All scheduling algorithms we have discussed so far are critical path schedulers They can only generate schedules for iteration period larger than or equal to the critical path They only exploit concurrency within a single iteration, and only utilize the intra-iteration precedence constraints
Example Can one do better than iteration period of 4? Approaches Pipelining + retiming can reduce critical path to 3, and also the # of functional units Approaches Transformations followed by scheduling Transformations integrated with scheduling
Conclusions High Level Synthesis Connects Behavioral Description and Structural Description Scheduling is a key step Estimations, Transformations are others High Level of Abstraction, High Impact on the Final Design