Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.

Slides:

Advertisements

Similar presentations

OCV-Aware Top-Level Clock Tree Optimization

Advertisements

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.

Timing Margin Recovery With Flexible Flip-Flop Timing Model

George Mason University FPGA Design Flow ECE 448 Lecture 9.

EECE579: Digital Design Flows

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

ECE 699: Lecture 2 ZYNQ Design Flow.

Chapter #6: Sequential Logic Design 6.2 Timing Methodologies

Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

DELAY INSERTION METHOD IN CLOCK SKEW SCHEDULING BARIS TASKIN and IVAN S. KOURTEV ISPD 2005 High Performance Integrated Circuit Design Lab. Department of.

Power Reduction for FPGA using Multiple Vdd/Vth

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.

J. Christiansen, CERN - EP/MIC

VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Introductory project. Development systems Design Entry –Foundation ISE –Third party tools Mentor Graphics: FPGA Advantage Celoxica: DK Design Suite Design.

Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning Charles Eric LaForest J. Gregory Steffan University of Toronto ICFPT 2013.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

12006 MAPLD International ConferenceSpaceWire 101 Seminar Data Strobe (DS) Encoding Sam Stratton 2006 MAPLD International Conference.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Retiming EECS 290A Sequential Logic Synthesis and Verification.

EE141 Timing Issues 1 Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003 Rev /05/2003.

CORDIC Based 64-Point Radix-2 FFT Processor

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

INF3430 / 4431 Synthesis and the Integrated Logic Analyzer (ILA) (WORK IN PROGRESS)

A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.

Gopakumar.G Hardware Design Group

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Presenter: Darshika G. Perera Assistant Professor

Programmable Hardware: Hardware or Software?

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Parallel Beam Back Projection: Implementation

Ph.D. in Computer Science

Rapid Overlay Builder for Xilinx FPGAs

Pipelining and Retiming 1

Automated Extra Pipeline Analysis of Applications mapped to Xilinx UltraScale+ FPGAs

FIT Front End Electronics & Readout

FPGAs in AWS and First Use Cases, Kees Vissers

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

COOLRUNNER II REAL DIGITAL CPLD

The Xilinx Virtex Series FPGA

Timing Analysis 11/21/2018.

Future Directions in Clocking Multi-GHz Systems ISLPED 2002 Tutorial This presentation is available at: under Presentations.

Clocking in High-Performance and Low-Power Systems Presentation given at: EPFL Lausanne, Switzerland June 23th, 2003 Vojin G. Oklobdzija Advanced.

Topics Performance analysis..

Timing Optimization Andreas Kuehlmann

Alan Mishchenko University of California, Berkeley

FPGA Tools Course Answers

Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003

Xilinx CPLD Fitter Advanced Optimization

332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew

ECE 699: Lecture 3 ZYNQ Design Flow.

The Xilinx Virtex Series FPGA

Robert Brayton Alan Mishchenko Niklas Een

Xilinx Alliance Series

Presentation transcript:

Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin

Agenda Time-borrowing concept Hardware support for time-borrowing in UltraScale+ Time-borrowing algorithm based on ILP Experimental results Conclusion

Time-borrowing Improve Fmax by redistributing slack between fast and slow paths Uneven slack arises from Different logic depth Quantized routing Point-to-point vs high-fanout connectivity Control sets, routing congestion, and other PNR restrictions

Time Borrowing based on Clock Skew Scheduling CRITICAL SETUP PATH JUST MEETS

Time Borrowing based on using pulsed latches CRITICAL SETUP PATH JUST MEETS

Time Borrowing and Re-timing 𝐷𝑒𝑙𝑎𝑦(𝑖→𝑗)≤ 𝑇 𝑐𝑙𝑜𝑐𝑘 Re-timing Time-borrowing * Practical differences Re-timing Time-borrowing Transparency to user Invasive netlist changes No design changes Granularity Coarse Fine-grain Sensitivity to control sets (CE/RST) Sensitive Insensitive Max WNS change ∞ HW-defined *Sapatnekar and Deokar, Utilizing the retiming-skew equivalence in a practical algorithm for retiming large circuits, CAD 1996

Agenda Time-borrowing concept Hardware support for time-borrowing in UltraScale+ Time-borrowing algorithm based on ILP Experimental results Conclusion

UltraScale+ MPSoC Floorplan Programmable delays and pulse generators

Programmable Delay Hardware Location Junction between distribution and leaf clocking Quantity One per leaf clock track 16 time-borrowing blocks per 960 FFs Features 5 clock delay taps + pulse generator Cascading for cost-efficient way of borrowing > 300ps

Clock Skew Scheduling and Pulsed Latches Baseline leaf clock bypasses programmable delays Bypass logic optimized for latency (minimizes extra variation, jitter) FF Clock skew scheduling Pulsed latches

Agenda Time-borrowing concept Hardware support for time-borrowing in UltraScale+ Time-borrowing algorithm Experimental results Conclusion

Time-borrowing optimization Software flow synthesis Many strategies possible Use a subset of skews/pulse widths: minimize runtime Use all features, violate hold and fix with hold router: maximize Fmax Time-borrowing algorithms Local greedy optimization Globally optimal ILP-based This work (Vivado 2016.1) Do not violate hold Globally optimal ILP solution place route Time-borrowing optimization bitgen

Time Borrowing Based on Global ILP algorithm Extract timing subgraph Extracting timing subgraph Max paths 𝑾𝑵𝑺< 𝑾𝑵𝑺 𝒘𝒐𝒓𝒔𝒕 +𝟐× 𝑻𝒃𝒐𝒓𝒓𝒐𝒘 𝒎𝒂𝒙 Min paths 𝑾𝑯𝑺< 𝑻𝒃𝒐𝒓𝒓𝒐𝒘 𝒎𝒂𝒙 Construct LP constraints for each path Setup: 𝑷𝒂𝒕𝒉𝑫𝒆𝒍𝒂𝒚 − (𝒔𝒌𝒆𝒘 𝒆𝒏𝒅 − 𝒔𝒌𝒆𝒘 𝒔𝒕𝒂𝒓𝒕 )<𝑻 Hold: 𝑷𝒂𝒕𝒉𝑫𝒆𝒍𝒂𝒚 − (𝒔𝒌𝒆𝒘 𝒆𝒏𝒅 − 𝒔𝒌𝒆𝒘 𝒔𝒕𝒂𝒓𝒕 )>𝟎 Objective function: 𝑴𝒊𝒏𝒊𝒎𝒖𝒎(𝑻) Construct LP formulation LP solver Deposit skew solution report

Full Set of ILP constraints Setup constraint Hold constraint Clock delay variation Pulse width variation Clock skew delay tap/pulsed latch exclusivity

Agenda Time-borrowing concept Hardware support for time-borrowing in UltraScale+ Time-borrowing algorithm Experimental results Conclusion

Experimental setup Vivado Design Suite version 2016.1 ≈90 representative designs and Xilinx IP blocks Communications, test/measurement, emulation, etc Implemented on UltraScale+ devices Fastest speed grade -3E Metric Min Max Avg clk domains 1 28 2 FMax 77 MHz 850 MHz 300 MHz LUT 8k 464k 129k FF 3k 586k 123k BRAM 1152 187 DSP 2700 195 Total designs 89

Performance improvement results Default time-borrowing configuration 5 clock skew values [0, 41, 96, 168, 295]ps 1 clock pulse width 295ps Globally optimal ILP algorithm No hold violations allowed

Cascading programmable delays Cost-efficient way to borrow > 300ps 8 possible clock skew values [0, 41, 96, 168, 295][+295]*ps 2 pulse widths [295, 610]ps No hold violations allowed

Hold Sensitivity Analysis Impact of hold on Fmax 5.5% Fmax with 0 hold violations router can potentially delay fast paths measure impact of adding hold margin Results - holdMargin

Location, cost, and performance Why delay and replicate leaf clocks? Why not global clock buffers? Why not in the logic slice? 5% Fmax/unit area 1.3% Fmax/unit area Replicated leaf architecture provides highest Fmax/$

Concluding Remarks UltraScale+ architecture with programmable time-borrowing improves Fmax by re-distributing slack between fast and slow paths employs both clock skew scheduling and pulsed latches transparent to customer, no netlist changes Performance results in production (Vivado 2016.1) 5.5% gmean Fmax increase with zero-hold ILP-based algorithm higher Fmax possible when using cascades or increasing hold margin Area- and runtime-efficient Less than 0.1% of additional chip area Less than 4 minutes of additional runtime on average

Thank you