Jordi Cortadella and Jordi Petit

Slides:



Advertisements
Similar presentations
Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu.
Advertisements

ECE-777 System Level Design and Automation Hardware/Software Co-design
Courseware Integer Linear Programming approach to Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Department of Electrical and Computer Engineering M.A. Basith, T. Ahmad, A. Rossi *, M. Ciesielski ECE Dept. Univ. Massachusetts, Amherst * Univ. Bretagne.
Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.
Handshake protocols for de-synchronization I. Blunno, J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin and C. Sotiriou Politecnico di Torino, Italy Universitat.
1 Logic synthesis from concurrent specifications Jordi Cortadella Universitat Politecnica de Catalunya Barcelona, Spain In collaboration with M. Kishinevsky,
Models of Computation for Embedded System Design Alvise Bonivento.
Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.
Synthesis of synchronous elastic architectures Jordi Cortadella (Universitat Politècnica Catalunya) Mike Kishinevsky (Intel Corp.) Bill Grundmann (Intel.
Logic Design Outline –Logic Design –Schematic Capture –Logic Simulation –Logic Synthesis –Technology Mapping –Logic Verification Goal –Understand logic.
1 An Exploration of the MPEG Algorithm Using Latency Insensitive Design EE249 Presentation (12/04/1999) Trevor Meyerowitz Mentored by: Luca Carloni.
1 State Encoding of Large Asynchronous Controllers Josep Carmona and Jordi Cortadella Universitat Politècnica de Catalunya Barcelona, Spain.
Dipartimento di Informatica - Università di Verona Networked Embedded Systems The HW/SW/Network Cosimulation-based Design Flow Introduction Transaction.
EECS 470 Cache and Memory Systems Lecture 14 Coverage: Chapter 5.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN Asher Berkovitz Yaniv.
Physical Planning for the Architectural Exploration of Large-Scale Chip Multiprocessors Javier de San Pedro, Nikita Nikitin, Jordi Cortadella and Jordi.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
FORMAL VERIFICATION OF ADVANCED SYNTHESIS OPTIMIZATIONS Anant Kumar Jain Pradish Mathews Mike Mahar.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
High-Level Synthesis-II Virendra Singh Indian Institute of Science Bangalore IEP on Digital System IIT Kanpur.
Lecture 9 RTL Design Methodology. Structure of a Typical Digital System Datapath (Execution Unit) Controller (Control Unit) Data Inputs Data Outputs Control.
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
Automatic Pipelining during Sequential Logic Synthesis Jordi Cortadella Universitat Politècnica de Catalunya, Barcelona Joint work with Marc Galceran-Oms.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Retiming EECS 290A Sequential Logic Synthesis and Verification.
Specification mining for asynchronous controllers Javier de San Pedro† Thomas Bourgeat ‡ Jordi Cortadella† † Universitat Politecnica de Catalunya ‡ Massachusetts.
1 Introduction to Engineering Spring 2007 Lecture 19: Digital Tools 3.
INF3430 / 4431 Synthesis and the Integrated Logic Analyzer (ILA) (WORK IN PROGRESS)
Computer Organization
System-on-Chip Design Homework Solutions
Framework For Exploring Interconnect Level Cache Coherency
Asynchronous Interface Specification, Analysis and Synthesis
CSE 140 Lecture 14 System Designs
VLSI Testing Lecture 5: Logic Simulation
ECE 565 High-Level Synthesis—An Introduction
VLSI Testing Lecture 5: Logic Simulation
Paul Pop, Petru Eles, Zebo Peng
Ring Oscillator Clocks and Margins
Vishwani D. Agrawal Department of ECE, Auburn University
James D. Z. Ma Department of Electrical and Computer Engineering
IP – Based Design Methodology
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Introduction to cosynthesis Rabi Mahapatra CSCE617
Centar ( Global Signal Processing Expo
CSE 140 Lecture 15 System Designs
From C to Elastic Circuits
LPSAT: A Unified Approach to RTL Satisfiability
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
ECE-C662 Introduction to Behavioral Synthesis Knapp Text Ch
Dynamically Scheduled High-level Synthesis
Steve Dai, Gai Liu, Zhiru Zhang
Architecture Synthesis
ICS 252 Introduction to Computer Design
De-synchronization: from synchronous to asynchronous
ESE535: Electronic Design Automation
RTL Design Methodology
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Fast Min-Register Retiming Through Binary Max-Flow
Research: Past, Present and Future
Presentation transcript:

A Hierarchical Mathematical Model for Automatic Pipelining and Allocation using Elastic Systems Jordi Cortadella and Jordi Petit Universitat Politècnica de Catalunya, Barcelona

A Hierarchical Mathematical Model for RTL optimizations Jordi Cortadella and Jordi Petit Universitat Politècnica de Catalunya, Barcelona

Outline RTL optimizations: introduction Elasticity Basic mathematical model Hierarchical mathematical model Two examples October 30, 2017 RTL optimization

RTL optimizations F3 F1 F3 F4 Control F2 October 30, 2017

RTL optimizations F1 F4 F2 Control Mathematical models to find optimal solutions F1 F4 Control F2 October 30, 2017 RTL optimization

Synthesis flow C, C++, SystemC Retiming Elasticity (bubbles) RTL Preserve execution order HLS Retiming Elasticity (bubbles) Unit selection RTL RTL optimization RTL Synthesis throughput area architectural exploration Netlist October 30, 2017 RTL optimization

Time is elastic … October 30, 2017 RTL optimization

Elasticity is known from long ago VME bus AMBA AHB Unpredictable timing handled by handshake signals October 30, 2017 RTL optimization

Rigid vs. Elastic timing time time In S Out 9 8 7 6 5 4 3 2 1 9 8 7 6 5 4 3 2 1 CLK 5 4 3 2 1 S In Out 6 req ack req ack bubble 5 4 3 2 1 S In Out CLK valid ready October 30, 2017 RTL optimization

Incorporating elasticity Out In F1 F3 F2 Modus operandi: compute only when the inputs are valid October 30, 2017 RTL optimization

+ Elastic Join Valid V1 Cntrl V Valid Ready R1 Cntrl VS R Ready Valid October 30, 2017 RTL optimization

Variable Latency Units F Datain Dataout Validin Readyin Validout Readyout [0 - k] cycles go done clear Cntrl Cntrl October 30, 2017 RTL optimization

Basic mathematical model October 30, 2017 RTL optimization

The equations of retiming (registers) In every cycle, the number of registers is invariant October 30, 2017 RTL optimization

The equations of retiming (delays) October 30, 2017 RTL optimization

Beyond retiming Can we do it better? Min cycle period: 2 No! … unless we use elasticity Min cycle period: 2 1 1 In every cycle, the number of registers is invariant October 30, 2017 RTL optimization

Beyond retiming Can we do it better? No! … unless we use elasticity 1 Introduce bubbles (registers with invalid data) tokens In every cycle, the number of registers is invariant October 30, 2017 RTL optimization

Performance optimization tokens/cycle Cycle Period: 1 Throughput: 2 tokens / 3 cycles 1 max Throughput Period tokens/ns ns/cycle 1 1 1 without bubbles 1 token 2 ns with bubbles 2 tokens 3 ns Bubbles contribute to reduce cycle period, … but they also reduce throughput ! better !!! October 30, 2017 RTL optimization

Performance model Retiming (fluid) tokens 4/3 2/3 J. Campos, G. Chiola, and M. Silva, “Ergodicity and throughput boundsfor Petri nets with unique consistent firing count vector,” IEEE Transactions on Software Engineering, 1991. October 30, 2017 RTL optimization

Putting all together Delay & Area equations (same as retiming)  Bufistov, Cortadella, Kishinevsky, and Sapatnekar. A general model for performance optimization of sequential systems. ICCAD, 2007. October 30, 2017 RTL optimization

Hierarchical mathematical model October 30, 2017 RTL optimization

Hierarchical model P Arch Regfile Exec FPU OR Decode Memory FPU1 AND FADD FMUL FDIV Netlist: Blocks Connections Registers cadd xmul cmul cdiv carr cnorm October 30, 2017 RTL optimization

F Hierarchical model c a b Each module has an ILP model with interface variables. The interface variables connect the local ILP models. October 30, 2017 RTL optimization

Unit selection F October 30, 2017 RTL optimization

Unit selection  can be linearized October 30, 2017 RTL optimization

Block netlist F ILP(F) G ILP(G) ILP(F,G) October 30, 2017 RTL optimization

ILP model (top) Hierarchical model ILP model (FPU) ILP model (FPU1) Regfile Exec FPU OR Decode Memory ILP model (FPU1) ILP model (FPU2) FPU1 AND FPU2 ILP model FADD ILP model FMUL FDIV cadd xmul cmul cdiv carr cnorm October 30, 2017 RTL optimization

Example: systolic grid October 30, 2017 RTL optimization

Area: 10 Delay: 20 Area: 20 Delay: 10 Area: 5 Delay: 0

ILP model: 1766 rows x 1334 columns (578 integer, 66 binary) 4778 nonZeros Presolved: 857 rows x 809 columns (432 integer, 16 binary) 2752 nonZeros CPU  2 mins

Area-Throughput trade-off October 30, 2017 RTL optimization

Example: fork/join pipeline October 30, 2017 RTL optimization

Example Area 𝑘 =𝑘 ∙Area 1 Thr 𝑘 =𝑘 ∙Thr 1 Problems: Area: 10 Delay: 40 Area: 10 Delay: 40 Area: 15 Delay: 40 Area: 10 Delay: 20 Area 𝑘 =𝑘 ∙Area 1 Thr 𝑘 =𝑘 ∙Thr 1 Area: 30 Delay: 30 Problems: Max Throughput under Area Constraint Min Area under Throughput Constraint October 30, 2017 RTL optimization

Throughput Area October 30, 2017 RTL optimization

4 3 2 1 3 2 Throughput 2 1 Area October 30, 2017 RTL optimization

Conclusions There is room for architectural exploration at RTL after High-Level Synthesis Elasticity (adaptive timing) is essential to boost performance Mathematical optimization models for elasticity and unit selection are possible Future: Incorporate models with branch/merge nodes Scalable (heuristic) optimization methods October 30, 2017 RTL optimization