Xilinx Adaptive Compute Acceleration Platform: Versal Architecture

Slides:

Advertisements

Similar presentations

Network Devices Repeaters, hubs, bridges, switches, routers, NICs.

Advertisements

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 261 Lecture 26 Logic BIST Architectures n Motivation n Built-in Logic Block Observer (BILBO) n Test.

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Clock Design Adopted from David Harris of Harvey Mudd College.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

Evolution of implementation technologies

Programmable logic and FPGA

CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.

February 4, 2002 John Wawrzynek

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Computer performance.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

J. Christiansen, CERN - EP/MIC

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

A High-Speed & High-Capacity Single-Chip Copper Crossbar John Damiano, Bruce Duewer, Alan Glaser, Toby Schaffer, John Wilson, and Paul Franzon North Carolina.

Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.

ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Reconfigurable Computing - Performance Issues John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Programmable Logic Devices

Sequential Logic Design

Instructor Materials Chapter 1: LAN Design

Lecture 2: Cloud Computing

Presenter: Darshika G. Perera Assistant Professor

Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.

ESE532: System-on-a-Chip Architecture

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

ECE354 Embedded Systems Introduction C Andras Moritz.

Microarchitecture.

HyperTransport™ Technology I/O Link

Advanced Computer Networks

Give qualifications of instructors: DAP

Instructor: Dr. Phillip Jones

Architecture & Organization 1

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Introduction to Reconfigurable Computing

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Anne Pratoomtong ECE734, Spring2002

Israel Cidon, Ran Ginosar and Avinoam Kolodny

We will be studying the architecture of XC3000.

Software Defined Networking (SDN)

The Xilinx Virtex Series FPGA

Architecture & Organization 1

Challenges Implementing Complex Systems with FPGA Components

XC4000E Series Xilinx XC4000 Series Architecture 8/98

Characteristics of Reconfigurable Hardware

Overview of Computer Architecture and Organization

Programmable Logic- How do they do that?

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Interfacing Data Converters with FPGAs

HIGH LEVEL SYNTHESIS.

Computer Evolution and Performance

The Xilinx Virtex Series FPGA

Network-on-Chip Programmable Platform in Versal™ ACAP Architecture

Lecture 26 Logic BIST Architectures

Connectors, Repeaters, Hubs, Bridges, Switches, Routers, NIC’s

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Xilinx Adaptive Compute Acceleration Platform: Versal Architecture Name Intro Principal Engineer on Arch team Longmont Colorado What I’m not going to cover Limited time, bypass discussions on what an ACAP is,AI engines, and NoC NoC – Ian Swarbrick later Fabric architecture Brian Gaide Dinesh Gaitonde, Chirag Ravishankar, Trevor Bauer 2/25/19

In Search of a Scalable Fabric Solution Technology scaling alone insufficient to meet project goals Slowed pace of Moore’s Law  compute efficiency not scaling well Metal resistance is a primary issue, especially for FPGAs Economic challenges Heterogeneous compute  higher volume, more cost sensitive markets Competition w/ non FPGA based solutions Need scalable fabric solutions to address these new challenges -porting ultrascale+ forward would have not met many of our design goals -Why? Combination of tech and economic forces

Scalable Routing

Interconnect Metal not scaling Coarser CLE + local crossbar More layers but fewer tracks Leverage local connections in cheaper metal Hierarchical routing without the delay penalty Coarser CLE + local crossbar Small amount of muxes capture a disproportionate number of internal routes Both the demanded fraction and realized fraction of internal routes increase Metal resistance forces a tradeoff of delay vs. routability Hierarchical interconnect. Why? Basic idea is to envelop a larger % of local routes on lower performance / cheaper wires Virtual crossbar from all CLE cluster outputs to inputs Internally demanded connections increase by 3X Internally satisfied connections increase by 8X 15% of all routes are captured internally Muxing structure is optional, only pay delay penalty when congested Also leverage for multi pin nets and control signal propagation

Interposer Routing Distributed die to die interface Reduces congestion to/from interface Leverage interposer for long distance routing 30% faster than standard interconnect Interconnect capacity scaling Only pay for more routing on larger devices that require it SSIT explanation Interposer routing – second level routing layer that sits on top of base die routing Scalable in the sense that it enables large devices to route but small (monolithic) devices don’t have to pay for it How? -interposer that exists not just over die to die interfaces but full device – enables full mesh not only between dice but within in all 4 directions -fully distribute the interposer interface into each CLE Why? -any interposer route used frees up the on-die routing resources otherwise used -Long distance routes are 30% faster than on-die. Single hop is 75 tiles away vertically -Reduces on-die routing congestion -concentrated die to die interfaces cause routing bottlenecks getting to the scarce interfaces -distributed approach reduces interface congestion by a factor of 2 More vertical tracks near die edge so no impact to inter-die bandwidth

Scalable Compute

Streamlined CLB Less is more philosophy Use soft logic for 4LUT A1 A2 A3 A4 A5 A6 O6 O5_1 O5_2 prop cascade_in O5 Less is more philosophy More, general purpose CLEs better than fewer specialized ones Use soft logic for Wide muxes, wide functions Deep LUTRAM/SRL modes Every CLE is the same 50% LUTRAM/SRL capable everywhere Enhanced LUT Pack more dual LUT functions Fast cascade path Leverage for lower cost carry chain -Better to have more general purpose CLEs rather than fewer more specialized ones -some corner case designs suffer but over aggregate, most designs see a compute density improvement -less used functions – wide muxes, wide function generators, deep lutram/srl modes – soft logic -Uniform canvas for software: all CLEs have LUTRAM -LUT -packing improvements for dual input -generic cascade path -lower cost carry at no performance expense

Imux Registers Increase design speed with minimal design adaptation and low cost All designs benefit Bypassable registers + clock modulation on each block input Flexible input registers Clock enable, sync/async reset, initialization Multiple modes Time borrowing – transparent to user Hold fixing – fixes min delays Pipelining/Retiming Pipelining/Retiming + time borrowing -Imux – mux on every input pin for fabric facing blocks -Pipeline registers that exist on imux pins along with some sophisticated clock modulation features -call them this because marketing hasn’t had a chance to name them something weird yet -Idea behind this feature was to be generic enough to enable performance enhancement for all designs -Any optional feature costs additional delay when not used -Overcome with aggressive time borrowing -Min delays are often the limiter -Hold fixing mode fixes min delays -Couple pipelining with time borrowing -lower barrier to entry for customers that are willing to modify their design -Input registers are fully featured

Imux Register Examples Original path dominated by INT delay Time borrowing or retiming to move virtual location of register Pipelining to isolate each level of logic Pipelining + Time borrowing to break up interconnect delay

Global Clocking Horizontal clock spines (24) Leaf selection mux Challenge – reduce clocking overhead without sacrificing capacity or flexibility 3 pronged approach to reducing clocking overhead 1) Isolated global clocking supply – jitter reduction 2) Active clock deskew – intra clock domain skew reduction Also between dice in SSIT 3) Local clock dividers – inter clock domain skew reduction Clock divider PLL Clock Leaf -Wanted to maintain asic like clocking that we pioneered 2 generations ago -flexible root location, segmentable clock tracks that allow for multiple smaller clock networks to be placed on same track layer -Clocking overhead becomes substantial for high speed designs. How to overcome?

Scalable Platform

Configuration Fabric blocks: Perimeter blocks: 8X reduction in configuration time per bit Fully pipelined data path Aligned repeated fabric blocks for address buffer insertion Increased internal config bus width by 4X 56X-300X readback enhancements Leverages same configuration path speedups Concentrate CLB flop data into minimal number of frames Read pipeline efficiency gains Parallel readback of multiple dice in SSIT dice Design state snapshotting (50Mhz Fmax or less) Capture design state without stopping the clock Read out in the background Perimeter blocks: Separate NoC based configuration scheme Lower overhead, more flexible -Larger designs require longer configuration times -Demand for faster configuration in the datacenter, faster debug time for emulation markets -Wanted configuration / reconfiguration / debug times to scale better than a straight technology port -Configuration -brute force approach – wider internal datapath, highly pipelined, address line buffer insertion between fabric blocks, higher config clock speed -8X speedup -Readback -leverage config infrastructure plus additional enhancements -concentrate CLB flop data into minimal amount of frames -transaction reordering -parallel readback of multiple SSIT dice -Design snapshotting

Hardened Features Everything required for shell Controller - processor subsystem / platform management Data channel - NoC Communication protocol - CPM / PCIE External communication interface Memory controller DDR / HBM Interface Shell system is fully operational without a bitstream Additional market specific features Wired comms – various protocols Wireless / Machine learning – AI engines -Hardened features are not a fabric feature but more the absence of needing to use fabric -In order to reach higher volume markets where hardware programming experience is more limited, raise the level of abstraction -Hardened solutions for all of the low level requirements for bringup of the systme -Enables system to be fully operational without a user bitstream

Conclusion Versal enables a scalable fabric solution for next generation designs Scalable Interconnect Hierarchical approach reduces metal demand Interposer routing adds extra routing level to larger devices Scalable Compute Compute density optimized architecture Pipelining/Time Borrowing with minimal design perturbation Lower clocking overhead Scalable Platform Substantial reductions in config and readback times Hardened shell features

Backup

Versal Architecture Architecture behind the first Adaptive Compute Acceleration Platform (ACAP) Tight integration between 1) SW programmable processor (ARM cores) 2) SW programmable accelerators (AI engines) 3) HW programmable fabric (traditional FPGA) Raise abstraction level through critical function hardening Single integrated platform is key Higher system performance, lower system power Hardened Shell Functions Microprocessor Domain Specific Compute Array Hardware-programmable Logic