UC Berkeley BRASS Group Post Placement C-Slow Retiming for Xilinx Virtex FPGAs Nicholas Weaver Yury Markovskiy Yatish Patel John Wawrzynek UC Berkeley.

Slides:

Advertisements

Similar presentations

Spartan-3 FPGA HDL Coding Techniques

Advertisements

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

FPGA Configuration. Introduction What is configuration? – Process for loading data into the FPGA Configuration Data Source Configuration Data Source FPGA.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Graduate Computer Architecture I Lecture 16: FPGA Design.

Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.

HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,

Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.

Evolution of implementation technologies

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Programmable logic and FPGA

CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.

HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.

CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.

February 4, 2002 John Wawrzynek

CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.

Configuration. Mirjana Stojanovic Process of loading bitstream of a design into the configuration memory. Bitstream is the transmission.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

EDA (CS286.5b) Day 18 Retiming. Today Retiming –cycle time (clock period) –C-slow –initial states –register minimization.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

© 2003 Xilinx, Inc. All Rights Reserved FPGA Design Techniques.

Ch.9 CPLD/FPGA Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.

Reconfigurable Computing - Verifying Circuits Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

© 2003 Xilinx, Inc. All Rights Reserved Synchronous Design Techniques.

Swankoski MAPLD 2005 / B103 1 Dynamic High-Performance Multi-Mode Architectures for AES Encryption Eric Swankoski Naval Research Lab Vijay Narayanan Penn.

Basic Sequential Components CT101 – Computing Systems Organization.

ENG241 Digital Design Week #8 Registers and Counters.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

1 - CPRE 583 (Reconfigurable Computing): VHDL to FPGA: A Tool Flow Overview Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 5: 9/7/2011.

Timing and Constraints “The software is the lens through which the user views the FPGA.” -Bill Carter.

M.Mohajjel. Why? TTM (Time-to-market) Prototyping Reconfigurable and Custom Computing 2Digital System Design.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

Review of “Register Binding for FPGAs with Embedded Memory” by Hassan Al Atat and Iyad Ouaiss Lisa Steffen CprE 583.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

Floating-Point FPGA (FPFPGA)

Registers and Counters

FPGA Implementation of Multicore AES 128/192/256

XC4000E Series Xilinx XC4000 Series Architecture 8/98

Dynamic High-Performance Multi-Mode Architectures for AES Encryption

CS184a: Computer Architecture (Structures and Organization)

Presentation transcript:

UC Berkeley BRASS Group Post Placement C-Slow Retiming for Xilinx Virtex FPGAs Nicholas Weaver Yury Markovskiy Yatish Patel John Wawrzynek UC Berkeley Reconfigurable Architectures, Systems, and Software (BRASS) Group ACM Symposium on Field Programmable Gate Arrays (FPGA) February 2x,

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 2 Outline “Automatically Double Your Throughput” –“You paid for those registers, here’s how to use them” Retiming and C-slow Retiming –The transformation C-slow Retiming and the Virtex FPGA –The target Retiming 3 Benchmarks –The tests

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 3 Retiming and Repipelining Retiming –Automatically moving registers to minimize the clock period –Benefits limited by the number of registers –Algorithm developed by Leiserson et al Repipelining –Adding registers to the front or back –Let retiming then move them around But What About Feedback Loops? –Retiming and repipelining are of limited benefit when you have feedback loops

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 4 C-Slow Retiming Replace every register with a sequence of C registers. –With more registers retiming can break the design into finer pieces –Again proposed by Leiserson et al, to meet systolic slowdown Semantic altering transformation –But resulting semantics are predictable and useful Ideal: C-slow in synthesis, retime after placement Our prototype: C-slow and retime after placement

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 5 Design Semantics After C-Slowing Design operates on C independent data streams –Data streams are externally interleaved on round robin basis Semantics apply to designs with Task Level Parallelism –Encryption Counter (CTR) mode works on independent blocks –Sequence matching Compare sequence vs database C-slowing improves throughput but adds latency and registers

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 6 C-slowing, Retiming, and the Virtex FPGA Every 4-LUT has associated register –Register can, almost always, be used independently of the LUT LUTs can act as clocked shift registers (SRL16s) –Used in our AES hand-benchmark –Not used in our tool Many designs have low register utilization –Excess of registers available in unoptimized designs Retiming best performed with/after placement –Xilinx placement operates on mapped slices –Need net delay information for better results F1 F2 F3 F4 BX X XB 4-LUT

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 7 Sketch of Tool’s Operation 1.Convert.ncd to.xdl after placement 2.Load design into graph representation 3.Replace registers with edge annotations to represent registers 4.Replace every single register with C registers 5.Compute costs based on delay model 6.Retime 7.Convert edge annotations back to instance registers 8.Write out.xdl, convert to.ncd 9.Route PlacerRouter.xdl xdl

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 8 Experiment 1: How Good is the Tool? Tool is a simple prototype –Manhattan distance delay estimate –No attempt to minimize flip-flops –Basic flip-flop allocation Two benchmarks: AES and Smith/Waterman –Hand mapped –(optionally) hand placed –(optionally) hand C-slowed and retimed Our Best hand AES implementation –1.3 Gb/s –<800 Slices, 10 BlockRAMs –$10 part, Spartan II-100

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 9 Experiment 1: AES, Automatically Placed VersionClock Rate (Throughput) Stream Clock Rate (1 / Latency) Initial Design48 MHz 5-Slow by hand105 MHz21 MHz Retimed Automatically47 MHz 2-Slow Automatically64 MHz32 MHz 3-Slow Automatically75 MHz25 MHz 4-Slow Automatically87 MHz21 MHz 5-Slow Automatically88 MHz18 MHz Just retiming is of no benefit Automatic C-slowing very effective –But could do even better

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 10 Experiment 1: Smith/Waterman, Automatically Placed VersionClock Rate (Throughput) Stream Clock Rate (1 / Latency) Initial Design43 MHz 4-Slow by hand90 MHz22 MHz Retimed Automatically40 MHz 2-Slow Automatically69 MHz34 MHz 3-Slow Automatically84 MHz28 MHz 4-Slow Automatically76 MHz25 MHz Again, just retiming is of no benefit C-slowing highly effective –Within 7% of hand-built implementation

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 11 Experiment 1: Comments Just retiming is of no benefit –Both designs limited by single cycle feedback loops C-Slowing very effective –Able to automatically nearly double throughput Hand implementations more than doubled throughput –Reasonable numbers of additional registers Limitations of prototype tool: –Flip-flop allocation routines could be better –Some AES hand benchmarks used SRL16 delay chains Simple is pretty good –Relatively simplistic implementation gets reasonably close to hand-mapped performance

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 12 Experiment 2: Retiming LEON Can we automatically C-slow a large, synthesized design? Leon 1: A synthesized, GPLed SPARC compatible microprocessor core [1] –5 stage pipeline, integer only –Modify register file to use BlockRAMs BlockRAMs are used as negative edge devices –Remove caches, I/O, etc –Synthesize, using Symplify with CEs disabled –Edit EDIF to replace Sets/Resets Retime and C-slow with prototype tool –Prototype tool converts BlockRAMs to positive edge C-slow a microprocessor core... –Get an interleaved multithreaded architecture [1] Leon 1, by Jiri Gaisler,

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 13 Experiment 2: Results VersionClock Rate (Throughput) Thread Clock Rate (Latency) Lut Associated Flip Flops Lut Independent Flip Flops Initial Design23 MHz 1611NA Retimed Automatically25 MHz Slow Automatically46 MHz23 MHz Slow Automatically47 MHz16 MHz Luts for all designs Retiming alone worked surprisingly well 2-slowing very effective 3-slowing hit diminishing returns

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 14 Experiment 2: Comments Retiming alone worked surprisingly well –Tool automatically converted BlockRAMs to positive-edge clocking and rebalanced the pipeline 2-slowing very effective –Effectively doubled the initial throughput NO slowdown in latency over initial design because retiming was effective without C-slowing –Used more many registers, but fewer registers than LUTs 3-slowing hit diminishing returns –Too many registers required combined with poor register allocation  poor performance

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 15 Conclusions: C-slow retiming is very effective –"Automatically double your throughput" Benefits: More throughput Costs: More Flip Flops, worse latency Post-placement retiming appropriate –Independent Flip Flop usage critical –Have delay model for interconnect as well as logic Some room for improvement –Faster/Better implementation Minimize Flip Flop usage as well as delay Use SRL16s Better placement of Flip Flops –Experience suggests more Flip Flops/LUT would be useful

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 16 Backup Slide: Why Not Use (Current) Synthesis Tools? Many synthesis tools support retiming, but with caveats: –ONLY works for synthesized items AES and Smith/Waterman didn't use synthesis –Can't automatically C-slow –Can't retime through memory blocks –Can't accurately guesstimate interconnect delay before placement >½ of the delay is the interconnect –Can't effectively scavenge unused flip-flops before placement Xilinx placement operates on slices, not luts

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 17 Backup Slide: Why the limitations on total speedup? Absolute maximum –Interconnect + LUT + Flip-Flop Practical maximums –Too many flip-flops to allocate “Only” one flip-flop per LUT available –Flip-flop allocation poor Quick and dirty greedy heuristic –Works well for mild C-slowing –Fails with highly aggressive C-slowing –Tool doesn’t minimize flip-flops –Critical path is defined by the single worst path –Tool uses “Cheap and dirty” interconnect delay model

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 18 (Backup Slide) :Design Restrictions to Enable C-slowing Resets and Clock Enables –Convert to explicit logic Memories –Increase by a factor of C Add high bits of addr to provide round-robin access Every stream sees an independent memory Global Set/Reset –Convert to individual resets –Still highly restrictive Interleave/deinterleave IO –Requires external logic No asynchronous sets/resets Din Dout Addr WE Din Dout Addr WE Thread Counter

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 19 Scrap Image

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 20 Scrap Image 2-

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 21 Scrap Image 3 Din Dout Addr WE Din Dout Addr WE Thread Counter

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 22 Scrap Image 4

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 23 Scrap

UC Berkeley BRASS Group Automatic C-Slow Retiming for Virtex FPGAs 24 Scrap