ECE 565 High-Level Synthesis—An Introduction

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Idea of Register Allocation x = m[0]; y = m[1]; xy = x*y; z = m[2]; yz = y*z; xz = x*z; r = xy + yz; m[3] = r + xz x y z xy yz xz r {} {x} {x,y} {y,x,xy}
Cs 152 L1 Intro.1 Patterson Fall 97 ©UCB ECE 366 Computer Architecture Lecture 3 Shantanu Dutt ( Decomposition of Computer.
Courseware Integer Linear Programming approach to Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
ICS 252 Introduction to Computer Design Fall 2006 Eli Bozorgzadeh Computer Science Department-UCI.
ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:
Improving Code Generation Honors Compilers April 16 th 2002.
ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
ECE 368 Simple VHDL Synthesis Examples Shantanu Dutt ECE Dept. Univ. of Illinois at Chicago.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.
Introduction to State Machine
11/17/2007DSD,USIT,GGSIPU1 RTL Systems References: 1.Introduction to Digital System by Milos Ercegovac,Tomas Lang, Jaime H. Moreno; wiley publisher 2.Digital.
L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
ECE Computer Architecture Lecture Notes # 6 Shantanu Dutt How to Add To & Use the Basic Processor Organization To Execute Different Instructions.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
CSE 140 Lecture 15 System Design II CK Cheng CSE Dept. UC San Diego 1.
Overview Logistics Last lecture Today HW5 due today
Prof. Sin-Min Lee Department of Computer Science
DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik
CSE241A VLSI Digital Circuits Winter 2003 Recitation 2
Register Transfer Specification And Design
Chap 7. Register Transfers and Datapaths
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers The Processor
CS203 – Advanced Computer Architecture
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
Processor (I).
Registers and Counters Register : A Group of Flip-Flops. N-Bit Register has N flip-flops. Each flip-flop stores 1-Bit Information. So N-Bit Register Stores.
Basics Combinational Circuits Sequential Circuits Ahmad Jawdat
ECE CAD-Based Logic Design
Introduction to cosynthesis Rabi Mahapatra CSCE617
Lecture 5: Pipelining Basics
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
CS 140 Lecture 16 Professor CK Cheng 11/21/02.
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
CSE 140 Lecture 14 System Design
Architecture Synthesis
ICS 252 Introduction to Computer Design
Resource Sharing and Binding
ECE 465 Sequential Multiplication Techniques
Instruction Execution Cycle
ECE 465: Digital Systems Lecture Notes # 8
Michele Santoro: Further Improvements in Interconnect-Driven High-Level Synthesis of DFGs Using 2-Level Graph Isomorphism Michele.
Lecture 4: Advanced Pipelines
Controllers and Datapaths
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
EGR 2131 Unit 12 Synchronous Sequential Circuits
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
Sources of Constraints in Computations
Guest Lecturer: Justin Hsia
Announcements Assignment 7 due now or tommorrow Assignment 8 posted
Reconfigurable Computing (EN2911X, Fall07)
(Carry Lookahead Adder)
CS 201 Compiler Construction
Presentation transcript:

ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC

HLS Flow Code/Algorithm  Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)

HLS Flow (contd)

HLS Flow (contd) Taken into consideration Taken into consideration during register allocation (post scheduling). Taken into consideration during scheduling. (Binding) Allocation: Simple counting of FUs after the above 2 stages

Simple HLS Examples +

Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc z ldz X + a b c d mux demux x y lda ldb ldx ldc ldd ldy mux1 mux2 I0 I1 cc 3i+1 (a) Scheduling i) Non-overlapped pipelined scheduling: Schedule an operation when i/p data and FU available (may need to break ties between competing operations) c1(1) c1(2) c2(1) c3(1) c2(2) c3(2) X + cc’s (b) Arch. Synthesis: Binding & FU, reg, mux/demux allocation and interconnection 1 2 3 4 5 6 O1 O0 (c) Controller FSM Synthesis [y  c+d] (c2) mux1=0, mux2=0 demux=0, ldy=1 Controller FSM: Reset cc 3i lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. Note: Unspecified control signals (cs) have either an inactive value, or if such a concept doesn’t exist for the cs, then the don’t-care value ldx=1 cc 3i+2 lda = 1 reg. “a” loaded [z  x+y] (c3) [x  a x b] (c1)

Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) z ldz X + a b c d mux demux x y lda ldb ldx ldc ldd ldy mux1 mux2 I0 I1 (a) Scheduling ii) Overlapped pipelined scheduling X c1(1) c1(2) (b) Arch. Synthesis + c2(1) c3(1) c2(2) c3(2) cc’s 1 2 3 4 5 6 cc 3i+1 (c) Controller FSM Synthesis [z  x+y,] (c3) ldc=1, ldd=1, mux1=1, mux2=1, demux=1, ldz=1 Controller FSM: Reset cc 3i For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc  ~ 34% throughput improvement using an overlapped schedule lda=1, ldb=1, mux1=0, mux2=0 demux=0, ldy=1, ldx=1 [y  c+d, x  a x b] ((c1, c2)

Simple HLS Examples (contd) Condition (T/F) in out1 out2 T F Distributor in1 in2 out Selectot Some DFG control operation nodes: Conditional code: If (a > b) then c  a-b; Else c  b-a; Possible DFGs corresponding to the above conditional code: Note that the 2 subs in the left dfg does not mean an HLS algorithm will use 2 subtractors/adders. A good one will use 1, which will be shared in a mutually exclusive way between the two subs that are in two different sections of an if-then-else

Simple HLS Examples (contd) Iterative code: while (a > b) a  a-b; dist > sel - a b T F Initialized to F a b lda ldb 1 Mux mux b’ a c2 b’+1 = 2’s compl. of -b c1 To fsm + cin 1 s xor ovfl = 1  -ve = 0  +ve r1 ldr1 and (s xor ovfl) demux Demux 1 ldfina (a) Scheduling (using only 1 adder/sub) final a (b) Arch. Synthesis c1 c2 + cc’s Scheduling & binding:

Delay Nodes in DFGs A delay node is generally implemented as a register (or a series of registers if clock period < T0); a delay node thus becomes a state variable.

Delay Nodes in DFGs (contd) register Mapping to the architecture w/ the register decoupling input and output s.t. register i/p = o/p of combinational part and register o/p = i/p of combinational part, and these can be treated as independent of each other as their availabilities are in different time steps (e.g., clock cycles) Transformation in the DFG

Detailed HLS Example

Detailed HLS Example (contd) Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available closest to u’s earliest finish (i.e., asap time of child is earliest), otherwise the FU(s) will be idle unnecessary leading to a larger latency (this will also reduce lifetimes of sibling o/ps). Different paths (i/p  o/p) in the DFG (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); Goal:Miinimize latency (b) Reg. alloc. for o/p of operations For WAR constraint [can’t store in d1 as would be natural, as d1’s current data yet to be consumed by c6 which has not been scheduled yet] The synthesized architecture (c) Arch. synthesis Note: Above register allocation for adder has been done w/ separate regs for multiplier and adder o/ps. It is sub-optimal (4 non-primary i/p regs. needed)

Detailed HLS Example (contd)

Detailed HLS Example—Register Allocation

Detailed HLS Example—Register Allocation (contd) 3 non-primary i/p regs. needed Scheduling heuristic: As stated earlier d0 In the conflict graph (one per FU [as here] if regs are grouped by FU, else one per FU type if regs are shared across each FU type or only one [global] if regs are shared across FUs), there is an edge between 2 variable nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing)

Detailed HLS Example—Register Allocation (contd) 3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties arbitrarily: B’s lifetime increases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information