Scheduling for Synthesis of Embedded Hardware

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.

ECE 667 Synthesis and Verification of Digital Circuits

ECE-777 System Level Design and Automation Hardware/Software Co-design

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

Chapter 4 Retiming.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 14: March 3, 2004 Scheduling Heuristics and Approximation.

COE 561 Digital System Design & Synthesis Scheduling Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals.

Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.

Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.

Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.

Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.

COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

ECE Synthesis & Verification - Lecture 3 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.

MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

Approximation Algorithms

Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

EDA (CS286.5b) Day 11 Scheduling (List, Force, Approximation) N.B. no class Thursday (FPGA) …

Mahapatra-Texas A&M-Fall'001 Partitioning - I Introduction to Partitioning.

COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.

SCHEDULING SOURCES- Mark Manwaring Kia Bazargan Giovanni De Micheli Gupta Youn-Long Lin M. Balakrishnan Camposano, J. Hofstede, Knapp, MacMillen Lin.

EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

ICS 252 Introduction to Computer Design

ECE Synthesis & Verification - LP Scheduling 1 ECE 667 ECE 667 Synthesis and Verification of Digital Circuits Scheduling Algorithms Analytical approach.

VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.

1 IOE/MFG 543 Chapter 7: Job shops Sections 7.1 and 7.2 (skip section 7.3)

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 5: February 2, 2009 Architecture Synthesis (Provisioning, Allocation)

Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.

ENCI 303 Lecture PS-19 Optimization 2

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 12: February 13, 2002 Scheduling Heuristics and Approximation.

ECE Synthesis & Verification - Lecture 5 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 10: February 18, 2015 Architecture Synthesis (Provisioning, Allocation)

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.

ELEC692 VLSI Signal Processing Architecture Lecture 3

Gradual Relaxation Techniques with Applications to Behavioral Synthesis Zhiru Zhang, Yiping Fan, Miodrag Potkonjak, Jason Cong Department of Computer Science.

Pipelining and Retiming

L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

Retiming EECS 290A Sequential Logic Synthesis and Verification.

1 Chapter 5 Branch-and-bound Framework and Its Applications.

Scheduling Determines the precise start time of each task.

Basic Project Scheduling

Basic Project Scheduling

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Integrated Systems Centre © Giovanni De Micheli – All rights reserved

ICS 252 Introduction to Computer Design

Fast Min-Register Retiming Through Binary Max-Flow

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Scheduling for Synthesis of Embedded Hardware EE202A (Fall 2004): Lecture #11 Note: Several slides in this lecture are from Prof. Miodrag Potkonjak, UCLA Computer Science

Reading List for This Lecture 4/17/2017 Reading List for This Lecture Recommended: R. Walker and S. Choudhuri, “Introduction to the scheduling problem”, IEEE Design & Test of Computers (special issue on high-level synthesis), vol. 12, iss. 2, pp. 60-69, June 1995. http://ieeexplore.ieee.org/iel1/54/8746/00386007.pdf C-T. Hwang, J-H. Lee, and Y-C. Hsu, “A formal approach to the scheduling problem in high level synthesis”, IEEE Transactions on CAD, vol. 10, iss. 4, pp. 464-475, April 1991. http://ieeexplore.ieee.org/iel1/43/2519/00075629.pdf P. Paulin and J. P. Knight, “Force directed scheduling for the behavioral synthesis of ASICs”, IEEE Transactions on CAD, vol. 8, iss. 6, June 1989. http://ieeexplore.ieee.org/iel1/43/1364/00031522.pdf

HW-SW Design Flow System Specification High-Level Synthesis C/C++ C/VHDL/Verilog Behavioral Assembly Code Structural RTL System Specification [Gupta, UCSD]

High-Level Synthesis (HLS) Converts behavioral specification into structural register transfer level (RTL) description Example: GCD computation High-Level Synthesis [Ghosh, ICCAD ‘96]

Typical HLS System

High Level Synthesis Resource Allocation - How Much? Scheduling - When? Assignment - Where? Module Selection Template Matching & Operation Chaining Clock Selection Partitioning Transformations

Algorithm Description

CDFG

Precedence Graph

Sequence Graph: Start & End Nodes

Hierarchy in Sequence Graphs

Hierarchy in Sequence Graphs (contd.)

Hierarchy in Sequence Graphs (contd.)

Implementation

Timing Constraints Time measured in “cycles” or “control steps” problem? Max & min timing constraints

Constraint Graphs

Operations with Unknown Delays Unknown but bounded e.g. Conditionals loops Unknown and unbounded I/O operations synchronization Completion signal Called “anchor nodes” Need to schedule relative to these anchors

Scheduling Under Timing Constraints Feasible constraint graph Timing constraints satisfied when execution delays of all the anchors is zero Necessary for existence of schedule Well-posed constraint graph Timing constraints satisfied for all values of execution delays Implies feasibility Feasible constraint graph is well-posed or can be made well-posed iff no cycles with unbounded weight exist

Ill-posed (a, b) vs. Well-posed (c) Timing Constraints

Allocation, Assignment, and Scheduling Techniques well understood and mature

Example: Scheduling, Allocation, and Assignment Control Step

Variants of HLS scheduling Unconstrained Scheduling (UCS) Unlimited HW resources, No latency constraints Time Constrained Scheduling (TCS) Given upper bound on schedule length Goal is to minimize total resource cost Resource Constrained Scheduling (RCS) Given maximum number of each resource type Goal is to minimize the schedule length Time & Resource Constrained Sched. (TRCS)

ASAP Scheduling Algorithm (Solves the Unconstrained Scheduling Problem)

ASAP Scheduling Example

ASAP Scheduling Example Dummy start node scheduled at time = 0 Sequence Graph ASAP Schedule

ALAP Scheduling Algorithm (Solves the Unconstrained Scheduling Problem, but latency constraint is required for algo. to make sense)

ALAP Scheduling Example

ALAP Scheduling Example Dummy sink node scheduled at (latency constraint + 1) ALAP Schedule (latency constraint = 4) Sequence Graph

Observation about ALAP & ASAP Start time of an operation given by ASAP is the earliest possible through any scheduling algorithm For a given latency constraint, start time given by ALAP is the latest possible through any scheduling algorithm (ALAP start time – ASAP start time) denotes the mobility No priority is given to nodes on critical path Unimportant nodes may be scheduled ahead of critical nodes No problem if unlimited hardware. If limited resources, less critical nodes may block critical ones & yield poor schedules List scheduling techniques overcome this problem by utilizing a more global node selection criterion

List Scheduling Illustration Candidate list in each control step Operation(s) in shaded boxes are the ones selected for scheduling in the current CS CS Resource constraint: 1 ADD, 2 MUL CS CS CS Order of selecting candidate operations affects schedule quality

List Scheduling Algorithm Commonly used selection criteria: Nodes with least mobility picked first, Nodes with maximum number of successors picked first

Taxonomy of scheduling algorithms NP complete problem Optimal techniques Integer Linear Programming (ILP) Heuristics - Iterative improvement based e.g., Simulated annealing Heuristics – Constructive e.g., Force directed scheduling, List scheduling If all resources identical, reduces to multi-processor scheduling Minimum latency multiprocessor sched. is also NP complete

Scheduling - Optimal Techniques Integer Linear Programming Branch and Bound

Integer Linear Programming Given: integer-valued matrix Amxn, vectors B = ( b1, b2, … , bm ), C = ( c1, c2, … , cn ) Minimize: CTX Subject to: AX  B X = ( x1, x2, … , xn ) is an integer-valued vector

ILP based scheduling RCS version: For a set of (dependent) operations {V0,V1,...,Vn }, given an upper bound ak on the # of available resources of type k where k  {1, …, nres}, and latency di of each operation Vi, find a schedule of minimum length that satisfies all resource and precedence constraints V0 denotes dummy start node, Vn denotes dummy sink node Step 1: Run a heuristic, e.g., list scheduling, to obtain possibly sub-optimal, but achievable schedule length. Say it is  Step 2: Run ASAP & ALAP without resource constraints to get earliest & latest start times for each operation (used for pruning)

Integer Linear Programming For each computation dependency: ti has to be done before tj, introduce a constraint: k x1i+ (k-1) x2i+ ... + xki  k x1j+ (k-1) x2j+ ... + xkj+ 1(*) Minimize: y0 Subject to: x1i+ x2i+ ... + xki = 1 for all 1  i  n yj  y0 for all 1  i  k all computation dependency of type (*)

An Example 6 computations 3 control steps c1 c2 c3 c4 c6 c5

An Example Introduce variables: xij for 1  i  3, 1  j  6 yi = xi1+xi2+xi3+xi4+xi5+xi6 for 1  i  3 y0 Dependency constraints: e.g. execute c1 before c4 3x11+2x21+x31  3x14 +2x24+x34+1 Execution constraints: x1i+x2i+x3i = 1 for 1  i  6

An Example Minimize: y0 Subject to: yi  y0 for all 1  i  3 dependency constraints execution constraints One solution: y0 = 2 x11 = 1, x12 = 1, x23 = 1, x24 = 1, x35 = 1, x36 = 1. All other xij = 0

ILP Model of Scheduling For each operation Vi, for each control step j ( 1  j   ), define variable xij as: xij = 1, if computation Vi is executed in control step j xij = 0, otherwise Constraint 1: Start time is unique (i.e., each operation must be scheduled exactly once) Start time and end time of operation Vj are given by and respectively

ILP Model of Scheduling (contd.) Constraint 2: Sequencing relationships must be satisfied Constraint 3: Resource bounds must be met Since upper bound on # of resources of type k is ak

Minimum-latency Scheduling Under Resource-constraints Let t be the vector whose entries are start times, and let c = [0, 0, …, 1]T Formal ILP model is given by:

Example Two types of resources Both take 1 cycle execution time Multiplier (2 available) ALU (2 available) Adder Subtraction Comparison Both take 1 cycle execution time

Example (contd.) Heuristic (list scheduling) gives latency = 4 steps Use ALAP and ASAP (with no resource constraints) to get bounds on start times ASAP matches latency of heuristic So heuristic is optimum, but let us ignore it! Constraints?

Example (contd.) Start time is unique

Example (contd.) Sequencing constraints Note: Only non-trivial ones listed Those with more than one possible start time for at least one operation

Example (contd.) Resource constraints

Example (contd.) Consider c = [0, 0, …, 1]T Minimum latency schedule Since sink has no mobility (xn,5 = 1), any feasible schedule is optimum Consider c = [1, 1, …, 1] T Finds earliest start times for all operations Equivalently, it minimizes the following term:

Example: Optimum Schedule Under Resource Constraints

Example (contd.) Extension to TCS version: Assume multiplier costs 5 units of area, & ALU costs 1 unit of area  now becomes the (given) schedule length bound Same uniqueness and sequencing constraints as before Resource constraints are now in terms of unknown variables a1 and a2 a1 = # of multipliers a2 = # of ALUs

Example (contd.) Resource constraints

Example Solution Minimize cTa = 5.a1 + 1.a2 Solution with cost 12

Extensions ILP formulation can be extended to consider: Operation chaining Functional pipelining and other transformations See recommended reading paper for details

Precedence-constrained Multiprocessor Scheduling All operations done by the same type of resource NP complete even if all operations have unit delay

Scheduling - Iterative Improvement Kernighan - Lin (deterministic) Simulated Annealing Lottery Iterative Improvement Neural Networks Genetic Algorithms Taboo Search

Scheduling - Constructive Techniques Most Constrained Least Constraining

Force Directed Scheduling Goal is to reduce hardware by balancing concurrency Iterative algorithm, one operation scheduled per iteration Information (i.e. speed & area) fed back into scheduler

The Force Directed Scheduling Algorithm

Step 1 Determine ASAP and ALAP schedules * - * * * * - * * - + < +

Step 2 Determine Time Frame of each op Length of box ~ Possible execution cycles Width of box ~ Probability of assignment Uniform distribution, Area assigned = 1 C-step 1 * - + < C-step 2 C-step 3 1/2 C-step 4 1/3 Time Frames

Step 3 Create Distribution Graphs Sum of probabilities of each Op type Indicates concurrency of similar Ops DG(i) =  Prob(Op, i) 1 2 3 4 1 2 3 4 1 1 2 2 3 3 4 4 DG for Multiply DG for Add, Sub, Comp

Diff Eq Example: Precedence Graph Recalled

Diff Eq Example: Time Frame & Probability Calculation

Diff Eq Example: DG Calculation

Conditional Statements Operations in different branches are mutually exclusive Operations of same type can be overlapped onto DG Probability of most likely operation is added to DG 1 2 DG for Add - + Fork Join

Self Forces Force(i) = DG(i) * x(i) Self Force(j) = [Force(i)] Scheduling an operation will effect overall concurrency Every operation has 'self force' for every C-step of its time frame Analogous to the effect of a spring: f = Kx Desirable scheduling will have negative self force Will achieve better concurrency (lower potential energy) Force(i) = DG(i) * x(i) DG(i) ~ Current Distribution Graph value x(i) ~ Change in operation’s probability Self Force(j) = [Force(i)]

Example Attempt to schedule multiply in C-step 1 Self Force(1) = Force(1) + Force(2) = ( DG(1) * X(1) ) + ( DG(2) * X(2) ) = [2.833*(0.5) + 2.333 * (-0.5)] = +0.25 This is positive, scheduling the multiply in the first C-step would be bad 1 2 3 4 DG for Multiply * - + < C-step 1 C-step 2 C-step 3 C-step 4 1/2 1/3

Diff Eq. Example: Self Force for Node 4

Predecessor & Successor Forces Scheduling an operation may affect the time frames of other linked operations This may negate the benefits of the desired assignment Predecessor/Successor Forces = Sum of Self Forces of any implicitly scheduled operations * - + <

Diff Eq Example: Successor Force on Node 4 If node 4 scheduled in step 1 no effect on time frame for successor node 8 Total force = Froce4(1) = +0.25 If node 4 scheduled in step 2 causes node 8 to be scheduled into step 3 must calculate successor force

Diff Eq Example: Final Time Frame and Schedule

Diff Eq Example: Final DG

Lookahead Temporarily modify the constant DG(i) to include the effect of the iteration being considered Force (i) = temp_DG(i) * x(i) temp_DG(i) = DG(i) + x(i)/3 Consider previous example: Self Force(1) = (DG(1) + x(1)/3)x(1) + (DG(2) + x(2)/3)x(2) = .5(2.833 + .5/3) -.5(2.333 - .5/3) = +.41667 This is even worse than before

Minimization of Bus Costs Basic algorithm suitable for narrow class of problems Algorithm can be refined to consider “cost” factors Number of buses ~ number of concurrent data transfers Number of buses = maximum transfers in any C-step Create modified DG to include transfers: Transfer DG Trans DG(i) =  [Prob (op,i) * Opn_No_InOuts] Opn_No_InOuts ~ combined distinct in/outputs for Op Calculate Force with this DG and add to Self Force

Minimization of Register Costs Minimum registers required is given by the largest number of data arcs crossing a C-step boundary Create Storage Operations, at output of any operation that transfers a value to a destination in a later C-step Generate Storage DG for these “operations” Length of storage operation depends on final schedule s d 2 3 1 4 Storage distribution for S 5 ASAP Lifetime MAX Lifetime ALAP Lifetime

Minimization of Register Costs (contd.) avg life] = storage DG(i) = (no overlap between ASAP & ALAP) storage DG(i) = (if overlap) Calculate and add “Storage” Force to Self Force 7 registers minimum ASAP Force Directed 5 registers minimum

Pipelining Functional Pipelining Structural Pipelining * + < - 1', 3 2', 4 1 2 3 4 DG for Multiply 3, 1’ 4, 2’ 3’ 4’ Instance Instance’ Functional Pipelining Functional Pipelining Pipelining across multiple operations Must balance distribution across groups of concurrent C-steps Cut DG horizontally and superimpose Finally perform regular Force Directed Scheduling Structural Pipelining Pipelining within an operation For non data-dependant operations, only the first C-step need be considered 1 2 3 4 * Structural Pipelining

Other Optimizations Local timing constraints Multiclass FU’s Insert dummy timing operations -> Restricted time frames Multiclass FU’s Create multiclass DG by summing probabilities of relevant ops Multistep/Chained operations. Carry propagation delay information with operation Extend time frames into other C-steps as required Hardware constraints Use Force as priority function in list scheduling algorithms

Scheduling using Simulated Annealing Reference: Devadas, S.; Newton, A.R. Algorithms for hardware allocation in data path synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, July 1989, Vol.8, (no.7):768-81.

Simulated Annealing Local Search Solution space Cost function ?

Statistical Mechanics Combinatorial Optimization State {r:} (configuration -- a set of atomic position ) weight e-E({r:])/K BT -- Boltzmann distribution E({r:]): energy of configuration KB: Boltzmann constant T: temperature Low temperature limit ??

Analogy Physical System Optimization Problem State (configuration) Energy Ground State Rapid Quenching Careful Annealing Optimization Problem Solution Cost Function Optimal Solution Iteration Improvement Simulated Annealing

Generic Simulated Annealing Algorithm 1. Get an initial solution S 2. Get an initial temperature T > 0 3. While not yet 'frozen' do the following: 3.1 For 1 i  L, do the following: 3.1.1 Pick a random neighbor S'of S 3.1.2 Let =cost(S') - cost(S) 3.1.3 If   0 (downhill move) set S = S' 3.1.4 If >0 (uphill move) set S=S' with probability e-/T 3.2 Set T = rT (reduce temperature) 4. Return S

Basic Ingredients for S.A. Solution Space Neighborhood Structure Cost Function Annealing Schedule

Observation All scheduling algorithms we have discussed so far are critical path schedulers They can only generate schedules for iteration period larger than or equal to the critical path They only exploit concurrency within a single iteration, and only utilize the intra-iteration precedence constraints

Example Can one do better than iteration period of 4? Approaches Pipelining + retiming can reduce critical path to 3, and also the # of functional units Approaches Transformations followed by scheduling Transformations integrated with scheduling

Conclusions High Level Synthesis Connects Behavioral Description and Structural Description Scheduling is a key step Estimations, Transformations are others High Level of Abstraction, High Impact on the Final Design