The Future of FPGA Interconnect Guy Lemieux The University of British Columbia Tuesday, December 8, 2009 FPT 2009 Workshop Getting the LUT-heads to work…

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

ECE 506 Reconfigurable Computing ece. arizona
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
FPGA Intra-cluster Routing Crossbar Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.
Evolution of implementation technologies
Programmable logic and FPGA
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
Issues in System-Level Direct Networks Jason D. Bakos.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
Lecture 3 1 ECE 412: Microcomputer Laboratory Lecture 3: Introduction to FPGAs.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
General FPGA Architecture Field Programmable Gate Array.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
EGRE 427 Advanced Digital Design Figures from Application-Specific Integrated Circuits, Michael John Sebastian Smith, Addison Wesley, 1997 Chapter 7 Programmable.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 15: February 12, 2003 Interconnect 5: Meshes.
Power Reduction for FPGA using Multiple Vdd/Vth
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
Open Discussion of Design Flow Today’s task: Design an ASIC that will drive a TV cell phone Exercise objective: Importance of codesign.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
J. Christiansen, CERN - EP/MIC
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
FPGA Global Routing Architecture Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.
Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
EE3A1 Computer Hardware and Digital Design
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Directional and Single-Driver Wires in FPGA Interconnect Guy Lemieux Edmund LeeMarvin TomAnthony Yu Dept. of ECE, University of British Columbia Vancouver,
An Improved “Soft” eFPGA Design and Implementation Strategy
EE121 John Wakerly Lecture #15
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
ESE Spring DeHon 1 ESE534: Computer Organization Day 18: March 26, 2012 Interconnect 5: Meshes (and MoT)
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee, Guy Lemieux & Shahriar Mirabbasi University of British Columbia, Canada Electrical & Computer.
ESE Spring DeHon 1 ESE534: Computer Organization Day 20: April 9, 2014 Interconnect 6: Direct Drive, MoT.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
ESE534: Computer Organization
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
ESE532: System-on-a-Chip Architecture
Instructor: Dr. Phillip Jones
Give qualifications of instructors: DAP
CS184a: Computer Architecture (Structure and Organization)
CprE / ComS 583 Reconfigurable Computing
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

The Future of FPGA Interconnect Guy Lemieux The University of British Columbia Tuesday, December 8, 2009 FPT 2009 Workshop Getting the LUT-heads to work…

Layman’s viewpoint How do I explain FPGA interconnect to mom? Imagine planning a city on a grid – Maximum of 100,000 people, “LUT-heads” – For every LUT-head, given two things Home location Work location (often multiple work locations…) Problem: Getting the LUT-heads to work! Problem: Getting the LUT-heads to work! – Design a fixed road network – Every LUT-head drives in own lane (no time-sharing or bus) – Very expensive, lots of infrastructure “logic family” 2

Layman’s viewpoint (2) Problem, Version 2 – After 25yrs, every LUT-head changes home & work LUT-head population may grow or shrink – Same road network must still be used Can only ‘reconfigure lanes’ by changing road paint Problem, Version 3 – Start over, assuming 1,000,000 LUT-heads – New issues when the problem scales? Average trip length ? Average number of lanes in road ? 3

Overview What’s in FPGA interconnect? – Review of typical design What are the main application areas? – Driving the future of interconnect design What are the interconnect metrics? – Pushing the envelope, then becoming practical Open research problems? – Driving the future of interconnect design 4

Overview What’s in FPGA interconnect? – Review of typical design What are the main application areas? – Driving the future of interconnect design What are the interconnect metrics? – Pushing the envelope, then becoming practical Open research problems? – Driving the future of interconnect design 5

Input connections S Block C Block Altera Stratix Interconnect CLB aka LAB 6

Input connections IIB: input interconnect block Altera Stratix Interconnect 7

Input connections, neighbours 1 S Block C Block Connections in CLB grow bigger 8

Input connections, neighbours 2 S Block C Block Connections in C Block grow bigger 9

Output connections, local S Block C Block Altera Stratix Interconnect Single-driver: LUT outputs must only feed muxes 10

Output connections, global S Block C Block Altera Stratix Interconnect Single-driver: LUT outputs must only feed muxes extended to include LUT outputs 11

Design considerations Design of C Block / IIB – Selects LUT inputs – Overall function: ‘M’ choose ‘kN’ M = wires (H + V) N = LUTs k = 4..6 inputs/LUT 12

Design considerations Design of S Block – Steers M signals throughout array (turns) Also accepts N LUT outputs – Topologically simple Fs = 3: each wire connects to only 3 outgoing wires Exception: LUT outputs connect to > 3 wires – Strongly influenced by circuit implementation Bidirectional vs directional 13

Array of CLBs, C and S Blocks 14

Bidirectional vs. Directional Wiring bidir/dir == S Block Design + single-driver == C Block Design

Bidirectional Wires Logic C Block S Block 16

Bidirectional Wires Problem Half of tristate buffers left unused Buffers + input muxes dominate interconnect area 17

Bidirectional vs Directional 18

Bidirectional vs Directional 19

Bidirectional vs Directional 20

Bidirectional vs Directional 21

Bidirectional Switch Block 22

Directional Switch Block 23

Bidirectional vs Directional Switch Element Same quantity and type of circuit elements, twice the wiring Switch Block Directional half as many Switch Elements 24

Quantization of Channel Width Bidirectional (Q=1) 4 Switch Elements Ch. Width = 4 * Q = 4 * 1 Directional (Q=2) 2 Switch Elements Ch. Width = 2 * Q = 2 * 2 No “partial” switch elements with < Q wires 25

S Blocks with Long Wires Long wires, span L tiles – Example L = 3 Changes Q Q = L for bidirectional Q = 2L for directional 123 CLB 26

Building up Long Wires Start with One Switch Element Wire ends for straight connections. CLB 27

Building up Long Wires Connect MUX Inputs Extend MUX inputs CLB 28

Building up Long Wires Connect MUX Inputs TURN UP from wire-ends to mux CLB 29

Building up Long Wires Connect MUX Inputs TURN DOWN from wire-ends to mux CLB 30

Building up Long Wires Add +2 More Wires (4 total) Add LONG WIRES, turning UP and DOWN. CLB 31

Building up Long Wires Add +2 More Wires (6 total) Add LONG WIRES, turning UP and DOWN CLB 32

Building up Long Wires Twisting to Next Tile Add wire twisting CLB 33

CLB Full S Block with Long Wires Using One L=3 Switch Element (Q = 2L = 6) 34

Scaling Channel Width Using L=3 Switch Element CLB 2 Switch Elements Channel width = 2Q = 12 1 Switch Element Channel width = Q = 6 VERY IMPORTANT: Area growth is linear with channel width 35

Long Wires  Changes Quantum Long wires, span L tiles – Example L = 3 Q = L for bidirectional Q = 2L for directional 123 CLB 36

Multi-driver Wiring Logic outputs use tristate buffers (C Block) Directional & multi-driver wiring C Block S Block CLB 37

Single-driver Wiring Logic outputs use muxes (S Block) Directional & single-driver wiring New connectivity constraint S Block CLB 38

Directional, Single-driver Benefits Average improvements 0% channel width 9% delay 14% tile length of physical layout 25% transistor count 32% area-delay product 37% wiring capacitance Any reason to use bidir? – Important implications on future interconnect! 39

C Block design C Block 40

C Block design 41 M inputs (100 … 500) Up to kN outputs (4*8... 8*10)

C Block design 42

C Block design Sparse crossbar Similar # switchpoints – On inputs – On outputs Spread out pattern – Two columns have maximum Hamming distance (most # of different switch points) – True for all pairs of columns 43

Overview What’s in FPGA interconnect? – Review of typical design What are the main application areas? – Driving the future of interconnect design What are the interconnect metrics? – Pushing the envelope, then becoming practical Open research problems? – Driving the future of interconnect design 44

What are the main application areas? What are FPGAs used for? – A long long time ago… small glue logic Modern… – Internet routers (table lookups, multiplexing) – Embedded systems design (NIOS II, MicroBlaze) – Cell phone basestations (communications DSP) – HDTV sets / set-top boxes (video/image DSP) Future? 45

Application drivers What we know – FPGAs increasingly more powerful, constant cost – ASIC design costs escalating wildly Most ASICs use older technology (0.18/0.13  m) Increasingly, ASICs implemented as FPGAs instead – FPGAs only in low-volume E.g., being designed-out of HDTV sets Extrapolate to find new emerging markets … 46

Application drivers (2) Extrapolating… low volume, high margin – Industrial/scientific instruments: low volume, high margin Medical sensing, imaging (ultrasound, PET, …) Electronics test & measurement (router tester, …) Physics (neutrino detection, …) mixed volume, mixed margin – Computation: mixed volume, mixed margin Computer system simulation (RAMP, …) Molecular dynamics, financial modeling, seismic / oil & gas mixed volume, mixed margin – Portable/handheld: mixed volume, mixed margin Consumer Industrial/Medical 47

Application drivers (3) Problems with FPGAs – Expensive for high-volume markets Need cost-reduction strategy – Insufficient capacity Could just wait for Moore’s Law to catch up Capture emerging markets early: ultra-capacity FPGA – Hard to program Particularly important when used for computation Domain-specific languages help – Power – Slow 48

Overview What’s in FPGA interconnect? – Review of typical design What are the main application areas? – Driving the future of interconnect design What are the interconnect metrics? – Pushing the envelope, then becoming practical Open research problems? – Driving the future of interconnect design 49

Interconnect metrics Typical – Area – Delay (latency) – Power Obscure, but important! – Co$t – Bandwidth – Programmability/Ease of use – Reliability/Integrity – Flexibility/Runtime reconfigurability 50

Pushing the envelope Research is about discovery, ideas, exploration – Also evaluation, limit studies, potential uses One general research strategy – Pick a metric – Push the envelope How far did you get? – Back off until practical – Re-integrate with reality 51

Pushing the envelope (2) Example: Area – Cyclone/Spartan are low-cost (low-area) FPGAs Push area to the limits? – Reduce every routing buffer to 1x inverter – Extensive use of pass transistor switches – Reduce connectivity, force sparse logic – Bit-serial logic + routing for datapath How small can we get? – Is this practical? Is there a market? Is it cost-effective? – Increased parallelism? Prototype future FPGA designs now? 52

Pushing the envelope (3) Example: Bandwidth – Virtex/Stratix are high-performance FPGAs Push bandwidth to the limits? – E.g., pipeline every routing wire / switch – Use registers or wave-pipeline How much throughput can we get? – Wave-pipelining ~5Gbps in 65nm [FPGA2009] – Is this practical? Is there a market? 53

Pushing the envelope (4) Example: Flexibility/Runtime reconfigurability – Limited reconfigurability on Xilinx, not on Altera Push flexibility/RTR to the limits? – Note: not a naïve “fully connected” graph – Every switch is dynamically addressable, reconfigurable – Every route has an alternative/backup What can we gain? – Choose-your-own adventure routing [FPGA2009] – Improved NoC-on-FPGA (?) – Is this practical? Is there a market? 54

Pushing the envelope (5) Pushing envelope for other metrics – Power [Kaptanoglu, keynote FPT2007] – Co$t (area?) – Programmability/Ease of use (a CPU?) – Reliability/Integrity (built-in TMR & Razor?) 55

Pushing the envelope (5’) Pushing envelope for other metrics – Power [Kaptanoglu, keynote FPT2007] Portable/handheld Portable/handheld – Co$t (area?) Portable/handheld, computation Portable/handheld, computation – Programmability/Ease of use (a CPU?) Computation Computation – Reliability/Integrity (built-in TMR & Razor?) Scientific/industrial instruments Scientific/industrial instruments Markets exist for these metrics! 56

Overview What’s in FPGA interconnect? – Review of typical design What are the main application areas? – Driving the future of interconnect design What are the interconnect metrics? – Pushing the envelope, then becoming practical Open research problems? – Driving the future of interconnect design 57

Open research problems Defect tolerance IIB design – Hard core integration Memory-footprint // Runtime optimized Performance guarantees Layout-aware methods Efficient datapaths Expose the muxes Low-latency, area-efficient repeaters/switches 58

Open research problems (2) Defect tolerance – Future semiconductor technologies expected to be less reliable – Interconnect has built-in redundancy (by design) Issues – Defect localization – Delay-oriented defects – Abstraction suitable for CAD or bitstream-load – Intentional redundancy: how, where, quantity 59

Open research problems (3) IIB (input interconnect block) design – Function: ‘M’ choose ‘kN’ – Conserve ‘switchpoints’, area (# muxes, mux size), delay (levels) – Maximize ‘entropy’ == # of unique functional configurations Are some configurations more important than others? How to count # of configurations? – Generally, difficult topological design problem Most promising ‘type 3’ IIB [TRETS2008] ≈ Clos network ? IIB: input interconnect block M inputs kN outputs 60

Open research problems (4) Hard core integration – Heterogeneous instance of IIB design problem Issues – Each hard core has different # inputs, # outputs Complicates uniformity – Some have large # inputs, outputs Creates congestion ‘pinch points’ Need to design for ‘worst case’ routability – Would prefer ‘average case’ 61

Open research problems (5) Memory-footprint / Runtime optimized – Architecture graph – Netlist search graph Issues – Entire architecture graph is huge, static – Netlist search graph dynamic, alloc/dealloc – Random pointer-chasing – Cache-unfriendly, cache-DRAM bandwidth – Can architecture changes make improvements? 62

Open research problems (6) Performance guarantees – FPGA routers work well, nobody complains Thank you, PathFinder [McMurchie & Ebeling] Issues – Not guaranteed to find a solution (no detection!) Want ‘Just (unoptimally) route it!’ algorithm – No performance bounds on metrics Within X% tracks, Y% delay from minimum 63

Open research problems (7) Layout-aware methods – Altera, Xilinx know how to lay out interconnect – 10+ levels of metal, metal-over-switches, integration of switches and logic Issues – Arbitrary ‘topology’ graphs not practical to build – “One size fits all” FPGA diminishing “Application-specific” FPGA likely to arrive – Automated layout, automated circuit design tools Aware of FPGA architecture / structure 64

Open research problems (8) Efficient datapaths – Multi-bit connections; same source, same sink – Datapath connections coherent, seemingly simple – Very common in computation designs Issues – No successful datapath circuit-switched architecture Dedicated datapath interconnect only 5-10% smaller Abandon circuit switching?  power – How wide? 4b, 8b, 32b? – How to build? 65

Open research problems (9) Expose the muxes (1) – LUTs terrible for implementing multiplexers 2 x 4LUTs = 1 x 6LUT = 4:1 mux Imagine 54b barrel shifter (IEEE double-precision) 1 CLB ≈ 8 x 6LUTs ≈ 2 x 16:1 muxes – Interconnect is full of muxes 1 CLB ≈ 60 x 16:1 muxes Issues – How to ‘expose’ interconnect muxes to users? – Put routing mux select bits under user control – How to guarantee signal ordering? 66

Open research problems (9’) Expose the muxes (2) – Many systems use lots of 32b muxes NIOS, MicroBlaze, NoC, Compute engines – Can we use fast run-time reconfiguration instead of building muxes? Issues – How to expose programming bits to user? – How to enumerate & pre-p&r all configurations? 67

Summary Interconnect design is fun and challenging Many ‘practical’ of issues solved – Lots of ‘academically interesting’ problems remain – Can still ‘push the envelope’ Promising open problems Final thoughts… – Circuit design ↔ Topology ↔ Layout  CAD – Architectural models (C block, S block) restrictive 68

EOF 69

70