FPGA Intra-cluster Routing Crossbar Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Slides:

Advertisements

Similar presentations

Great Theoretical Ideas in Computer Science

Advertisements

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

ECE 506 Reconfigurable Computing ece. arizona

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.

Poly-Logarithmic Approximation for EDP with Congestion 2

Simulation of Fracturable LUTs

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.

Great Theoretical Ideas in Computer Science for Some.

1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Beyond Trilateration: On the Localizability of Wireless Ad Hoc Networks Reported by: 莫斌.

Great Theoretical Ideas in Computer Science.

1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.

Approximation Algorithms: Combinatorial Approaches Lecture 13: March 2.

CS294-6 Reconfigurable Computing Day 8 September 17, 1998 Interconnect Requirements.

048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Scaling.

Reconfigurable Computing (EN2911X, Fall07)

CS294-6 Reconfigurable Computing Day 12 October 1, 1998 Interconnect Population.

CSE 291-a Interconnection Networks Lecture 7: February 7, 2007 Prof. Chung-Kuan Cheng CSE Dept, UC San Diego Winter 2007 Transcribed by Thomas Weng.

Service Profile-Aware Control Plane: Multi-Instance Fixed Point Approximation within Multi-Granularity VPN Loss Networks Perspective Project Concept Project.

HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Packing and Placement Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 15: February 12, 2003 Interconnect 5: Meshes.

ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.

Power Reduction for FPGA using Multiple Vdd/Vth

Network Aware Resource Allocation in Distributed Clouds.

FPGA Switch Block Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.

Implementation of Finite Field Inversion

EE384y EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches Nick McKeown Professor of Electrical Engineering and Computer Science,

Stochastic Multicast with Network Coding Ajay Gopinathan, Zongpeng Li Department of Computer Science University of Calgary ICDCS 2009, June , Montreal.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

FPGA Global Routing Architecture Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Applied research laboratory 1 Scaling Internet Routers Using Optics Isaac Keslassy, et al. Proceedings of SIGCOMM Slides:

Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

Topics Architecture of FPGA: Logic elements. Interconnect. Pins.

Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

Based on An Engineering Approach to Computer Networking/ Keshav

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

ESE Spring DeHon 1 ESE534: Computer Organization Day 18: March 26, 2012 Interconnect 5: Meshes (and MoT)

ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.

EE384Y: Packet Switch Architectures Scaling Crossbar Switches

Data Center Network Architectures

Technology Mapping into General Programmable Cells

Lecture 2. Switching of physical circuits.

June 2017 High Density Clusters.

Great Theoretical Ideas in Computer Science

Andy Ye, Jonathan Rose, David Lewis

Computability and Complexity

Discrete Mathematics for Computer Science

ESE534: Computer Organization

SAT-Based Optimization with Don’t-Cares Revisited

Off-path Leakage Power Aware Routing for SRAM-based FPGAs

CS184a: Computer Architecture (Structure and Organization)

Presentation transcript:

FPGA Intra-cluster Routing Crossbar Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223

Generating Highly Routable Sparse Crossbars for PLDs Guy Lemieux, Paul Leventis, David Lewis International Symposium on FPGAs, 2000

Basic Notation

Fully Populated Crossbar Full capacity – can connect as many signals as the number of outputs Flexibility – Can connect any input to any output

Full-capacity Minimal Crossbars Full capacity Reduced Flexibility: you lose the ability to connect any input to any output p = m(m – n + 1) switches

Full-capacity Minimal Crossbars … Area savings is minimal if n >> m

Perfect and Sparse Crossbars Perfect crossbars – Can disjointly route any m-sized subset of the n inputs to the m outputs – Both full and full-capacity minimal crossbars are perfect Sparse crossbars – Has p < m(m – n + 1) switches – Cannot be perfect

Bipartite Graph Representation I1I1 I2I2 I3I3 I4I4 I5I5 I6I6 O1O1 O2O2 O3O3 O4O4 I1I1 I2I2 I3I3 I4I4 I5I5 I6I6 O1O1 O2O2 O3O3 O4O4

Evaluation Challenge How “routable” is a given crossbar? – Build an FPGA, map 20+ applications, observe results Slow, highly subject to the application mix – Monte Carlo Test Generate random test vectors Route each test vector on the crossbar (network flow) Report number of successes as a percentage A highly routable sparse crossbar has a >= 95% success rate

Hall’s Theorm Given a bipartite graph G = (V, E) – X, Y are the bipartite independent sets of G G has a matching of X onto Y if and only if N(v) is the set of neighbors of vertex v N(S) is the set of neighbors of all vertices in S Leverage Hall’s Theorem to generate routable sparse crossbars!

Practical Issues Cannot enumerate all subsets of m inputs N(x) should be approximately equal for all input vertices x in X – Otherwise, any subset containing a large number of low-degree vertices is unlikely to be routable N(y) should be approximately equal for all output vertices y in Y – Symmetric argument

Hamming Distance and Coding Theory Represent N(v) as a bitvector bv – bv[i] = 1 if v fans out to O i Hamming Distance – d(bv 1, bv 2 ) Strategy – Maximize d(bv i, bv j ) for every pair of distinct vertices v i and v j

Switch Placement Optimizer Start with initial switch placement Generate random swap of switch positions – Accept the swap if there is an improvement – Otherwise, reject the swap Stop after a fixed number of swap candidates (e.g., 10K) fails to find an improvement Objective is to minimize:

Example Identical Hamming costs before and after the swap Before: cannot route {1, 2, 3} After: reduces Hamming costs

168x24 Crossbar, 10K Test Vectors

Altera Flex 8000 HP Plasma Hextant

# Switches vs. Routability

Using Sparse Crossbars within LUT Clusters Guy Lemieux, David Lewis International Symposium on FPGAs, 2001

Five Questions 1.Will depopulation save area, require greater routing area, or create unroutable architectures? 2.Will depopulation reduce or increase routing delays? 3.What amount of depopulation is reasonable? 4.How much area or delay reduction can be attained, if any? 5.What are the other effects of depopulating the cluster?

Architecture and Parameters

Results

Designing Efficient Input Interconnect Blocks for LUT Clusters Using Counting and Entropy Wenyi Feng and Sinan Kaptanoglu ACM Transactions on Reconfigurable Technology and Systems (TRETS), 1(1): article #6, March, 2008 Note: Paper is from Actel (now Microsemi)

Count Configurations (Details Omitted) 784 Configurations 312 Configurations256 Configurations

Routing Requirement Vector (RRV) An ordered list of N subsets containing K distinct signals The i th subset is K distinct signals to route to the i th K-LUT Total number of RRVs for the crossbar: M inputs KN outputs

Entropy of an Intra-cluster Routing Crossbar H = lg(# routable RRVs) – Accounts for equivalence of LUT inputs Why Entropy? – # routable RRVs is huge – Minimum number of configuration bits to program the crossbar – Inversely correlated with usage of global routing muxes (details omitted) If we reduce the routability of the crossbar, we will end up programming more global routing muxes to compensate for the entropy loss

Conceptual Idea intra-cluster crossbar global routing

Theorem Let P and L be the number of muxes and switches in a crossbar – The entropy is at most Plg(L/P) – The entropy per switch is at most log(L/P) / (L/P) – These bounds are achieved only when each mux has size L/P and each configuration realizes a unique RRV Proof omitted because I DO NOT HATE YOU!

What are we doing here? Lemieux and Lewis – Routability: Monte Carlo simulations – Area: Count switches Feng and Kaptanoglu – Routability: Crossbar entropy – Area: Entropy per switch – Caveat: Focus only on crossbars where we can count routable, non-redundant RRVs!

Type-1 Crossbar 1-level – L2 muxes are driven directly by crossbar input signals – #routable RRVs depends on L2 crossbar topology Not area-efficient due to big L2 muxes Xilinx Virtex-style

Type-2 Crossbar 2-level – L1 is sparsely populated – L2 is fully populated Fully populated L2 reduces area efficiency VPR – F c,in determines L1 population density

Type-3 Crossbar 2-level, Partitioned – L1 partition P i only drives L2 partition O i – From input m to LUT input n, all paths go through muxes in P i and O i exclusively – #Routable RRVs is the product of #Routable RRVs for each disjoint sub-crossbar

Proposed Type-3 Crossbar and Generation Algorithm Each sub-crossbar is Type-2 Can count #routable RRVs (Details omitted)

Entropy vs. # Switches

Entropy vs. Global Routing Mux Usage

The Bottom Line… Who cares… – Theoretical properties are cute – Actel/Microsemi did not use these crossbars in their FPGAs Practical observation… – The cheaper you make the intra-cluster routing crossbar, the more expensive the global routing…

A 65nm flash-based FPGA fabric optimized for low cost and power Jonathan W. Greene, et al. International Symposium on FPGAs, 2011 Note: Paper is from Microsemi (Feng and Kaptanoglu are co-authors)

Corporate Secrets Divulged They used a Clos Network – Three parameters: m, n, r

Clos Network Properties Used when the physical circuit switching needs to exceed the capacity of the largest feasible single crossbar Much cheaper than a fully populated nxn crossbar

Strict-sense Nonblocking Clos Network (m > 2n – 1) An unused input on an ingress switch can always be connected to an unused output on an egress switch, without reconfiguration!

Rearrangeably Nonblocking Clos Network (m > n) An unused input on an ingress switch can always be connected to an unused output on an egress switch, but reconfiguration may be necessary!

Recursive Clos Network Design Scalable to any ODD number of stages – Replace center crossbar with a 3-stage Clos Network