ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.
Idea of Register Allocation x = m[0]; y = m[1]; xy = x*y; z = m[2]; yz = y*z; xz = x*z; r = xy + yz; m[3] = r + xz x y z xy yz xz r {} {x} {x,y} {y,x,xy}
Cs 152 L1 Intro.1 Patterson Fall 97 ©UCB ECE 366 Computer Architecture Lecture 3 Shantanu Dutt ( Decomposition of Computer.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &
Carnegie Mellon Lecture 6 Register Allocation I. Introduction II. Abstraction and the Problem III. Algorithm Reading: Chapter Before next class:
08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.
1 CS 201 Compiler Construction Lecture 7 Code Optimizations: Partial Redundancy Elimination.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
CS Dept, City Univ.1 Low Latency Broadcast in Multi-Rate Wireless Mesh Networks LUO Hongbo.
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
CS541 Advanced Networking 1 Routing and Shortest Path Algorithms Neil Tang 2/18/2009.
ICS 252 Introduction to Computer Design Fall 2006 Eli Bozorgzadeh Computer Science Department-UCI.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,
ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
1 IOE/MFG 543 Chapter 7: Job shops Sections 7.1 and 7.2 (skip section 7.3)
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
ECE 368 Simple VHDL Synthesis Examples Shantanu Dutt ECE Dept. Univ. of Illinois at Chicago.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.
HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
High-Level Synthesis-II Virendra Singh Indian Institute of Science Bangalore IEP on Digital System IIT Kanpur.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 03, 2005 Session 15.
L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
ECE Computer Architecture Lecture Notes # 6 Shantanu Dutt How to Add To & Use the Basic Processor Organization To Execute Different Instructions.
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
System-on-Chip Design Analysis of Control Data Flow
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Sequential Execution Example of three micro-operations in the same clock period.
CSE 140 Lecture 15 System Design II CK Cheng CSE Dept. UC San Diego 1.
ECE 565 High-Level Synthesis—An Introduction
Chap 7. Register Transfers and Datapaths
Introduction to cosynthesis Rabi Mahapatra CSCE617
CSE 140 Lecture 17 System Design II
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
CSE 140 Lecture 14 System Design
Architecture Synthesis
ICS 252 Introduction to Computer Design
Resource Sharing and Binding
ECE 465 Sequential Multiplication Techniques
Final Code Generation and Code Optimization
Michele Santoro: Further Improvements in Interconnect-Driven High-Level Synthesis of DFGs Using 2-Level Graph Isomorphism Michele.
Controllers and Datapaths
CS 201 Compiler Construction
Presentation transcript:

ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC

HLS Flow Code/Algorithm  Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)

HLS Flow (contd)

Allocation: Simple counting of FUs after the above 2 stages (Binding)

Simple HLS Examples +

Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc z ldz X + ab cd mux demux xy lda ldb ldx ldc ldd ldy mux1 mux2 I0 I1 I0 I1 demux cc 3(i+1) lda = 1 reg. “a” loaded Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 mux1=0, mux2=0 demux=0, ldy=1 ldx=1 [z  x+y] (c3) [y  c+d] (c2) [x  a x b] (c1) cc 3i cc 3(i+2) Reset Controller FSM: c1(1)c1(2) c2(1) c3(1)c2(2) c3(2) X + i) Non-overlapped pipelined scheduling cc’s Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’t- care value (a) Scheduling (b) Arch. Synthesis (c) Controller FSM Synthesis O0 O1

Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) c1(1)c1(2) c2(1)c3(1)c2(2)c3(2) X + ii) Overlapped pipelined scheduling z ldz X + ab cd mux demux xy lda ldb ldx ldc ldd ldy mux1 mux2 I0 I1 I0 I1 demux cc 3(i+1) lda=1, ldb=1, mux1=0, mux2=0 demux=0, ldy=1, ldx=1 ldc=1, ldd=1, mux1=1, mux2=1, demux=1, ldz=1 [y  c+d, x  a x b] ((c1, c2) [z  x+y,] (c3) cc 3i Reset Controller FSM: cc’s For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc  ~ 34% throughput improvement using an overlapped schedule (a) Scheduling (b) Arch. Synthesis (c) Controller FSM Synthesis

Simple HLS Examples (contd) Condition (T/F) in out1 out2 TF Distributor Condition (T/F) in1 in2 out TF Selectot Some DFG control operation nodes: Conditional code: If (a > b) then c  a-b; Else c  b-a; Possible DFGs corresponding to the above conditional code:

Simple HLS Examples (contd) Iterative code: while (a > b) a  a-b; dist > sel - a b a TF T F Initialized to F + b final a Mux Demux a r1 cin 1 b’+1 = 2’s compl. of -b b’ s xor ovfl = 1  -ve = 0  +ve mux ldr1 lda ldb demux ldfina To fsm c1 c2 c1c2 + cc’s c1c2 Scheduling & binding: a (a) Scheduling (using only 1 adder/sub) (b) Arch. Synthesis

Delay Nodes in DFGs A delay node is generally implemented as a register; a delay node thus becomes a state variable.

Delay Nodes in DFGs (contd) register Transformation in the DFG Mapping to the architecture

Detailed HLS Example

Detailed HLS Example (contd) The synthesized architecture Note: Not clear how register allocation has been done. It is sub-optimal (4 non-primary i/p regs. needed) (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); goal: min. latency Different paths (i/p  o/p) in the DFG (b) Reg. alloc. for o/p of operations (c) Arch. synthesis For WAR constraint Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available at u’s earliest finish will have the largest lifetime at that point.

Detailed HLS Example (contd)

Detailed HLS Example—Register Allocation

d0 3 non-primary i/p regs. needed Detailed HLS Example—Register Allocation (contd) In the conflict graph (one per FU), there is an edge between 2 var. nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing) Scheduling heuristic: Among available opers schedule those on avail. FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be avail. at u’s earliest finish will have the largest lifetime at that point.

Detailed HLS Example—Register Allocation (contd) d0 3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking arbitrarily: B’s lifetime oncreases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information