Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA 1 1 2 2 3 3.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Symantec 2010 Windows 7 Migration Global Results.
1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.
AP STUDY SESSION 2.
1
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Processes and Operating Systems
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
CALENDAR.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Chapter 5 Input/Output 5.1 Principles of I/O hardware
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Break Time Remaining 10:00.
Figure 12–1 Basic computer block diagram.
Augmenting FPGAs with Embedded Networks-on-Chip
Jongsok Choi M.A.Sc Candidate, University of Toronto.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Chapter 1: Introduction to Scaling Networks
PP Test Review Sections 6-1 to 6-6
Discrete Mathematical Structures: Theory and Applications
Chapter 3 Logic Gates.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.
Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4.
The Case for Embedding Networks-on-Chip in FPGA Architectures
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
Mohamed Abdelfattah Vaughn Betz
Presentation transcript:

Mohamed ABDELFATTAH Vaughn BETZ

2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA

3 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA MotivationPrevious Work

Interconnect 4 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires

5 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires Hard Blocks: Memory Multiplier Processor Hard Blocks: Memory Multiplier Processor

6 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires Hard Interfaces DDR/PCIe.. Hard Interfaces DDR/PCIe.. Interconnect still the same Hard Blocks: Memory Multiplier Processor Hard Blocks: Memory Multiplier Processor 1600 MHz 200 MHz 800 MHz

7 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1600 MHz 200 MHz 800 MHz

8 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet

9 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet

10 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet

BarcelonaLos Angeles Keep the “roads”, but add “freeways”. Hard Blocks Logic Cluster Source: Google Earth

12 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect NoC RoutersLinks Router forwards data packet Router moves data to local interconnect

13 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect  Pre-design NoC to requirements  NoC links are “re-usable”  Latency-tolerant communication  NoC abstraction favors modularity  High bandwidth endpoints known

14 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect  Latency-tolerant communication  NoC abstraction favors modularity

15 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet  Implementation options:  Soft Logic (LUTs,.. )  Hard Logic (unchangeable)  Mixed Soft/Hard  Implementation options:  Soft Logic (LUTs,.. )  Hard Logic (unchangeable)  Mixed Soft/Hard Soft NoC Hard NoC Build as needed out of LUTs Must build the whole thing Tailor to application Must be general enough for any aiapplication Slower, bigger Faster, smaller  Investigate the hard vs. soft tradeoff for NoCs (area/delay) Configurability Efficiency

 FPGA-tuned Soft NoCs: – LiPar (2005), NoCeM (2008), Connect (2012)  Hard NoCs: – Francis and Moore (2008): Exploring Hard and Soft Networks-on-Chip for FPGAs  Applications that leverage NoCs: – Chung et al. (2011): CoRAM: An In-Fabric Memory Architecture for FPGA-based Computing 16 Our Contributions: 1.Quantify area/performance gap of hard and soft NoCs 2.Investigate how this impacts NoC design (hard/soft) 3.Integrate hard NoC with FPGA fabric Our Contributions: 1.Quantify area/performance gap of hard and soft NoCs 2.Investigate how this impacts NoC design (hard/soft) 3.Integrate hard NoC with FPGA fabric 1. Why NoCs on FPGAs?

17 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA NoC Architecture Methodology Soft NoC design Results Area/Speed Efficiency Gap

 NoC = Routers + Links Hard/Soft Efficiency  State-of-the-art router architecture from Stanford: 1.Acknowledge that the NoC community have excelled at building a router: We just use it 2.To meet FPGA bandwidth requirements: High-performance router 3.A complex router includes a superset of NoC components that may be used: More complete analysis  Split router into 5 Components 

19 2. Hard/Soft Efficiency

20 2. Hard/Soft Efficiency Multi-Queue Buffer Port Width Buffer depth Number of VCs = Memory + CIControl Logic Input Modules

21 2. Hard/Soft Efficiency Multiplexers Logic + crowded interconnect Port Width Number of Ports Crossbar

22 2. Hard/Soft Efficiency Retiming Register Registers + little control logic Port Width Number of VCs Output Modules

23 2. Hard/Soft Efficiency Arbiters = Logic + Registers Number of Ports Number of VCs Allocators

24 2. Hard/Soft Efficiency 5 Components Input Module Crossbar VC Allocator SW Allocator Output Module Port Width Number of Ports Number of VCs Buffer Depth 4 Parameters

 Post-routing FPGA (soft) area and delay  Post-synthesis ASIC (hard) area and delay  Both TSMC 65 nm technology (Stratix III)  Verify results against previous FPGA:ASIC comparison by Kuon and Rose Hard/Soft Efficiency Per Router Component

 Relatively small memories  Critical component in router design  3 options for FPGA: 26 Registers LUTRAM Block RAM One per LUT 640 bits 9 Kbits 2. Hard/Soft Efficiency  Area of each implementation option 

27 Width = 32 Bits 2. Hard/Soft Efficiency Another logic cluster used

 Relatively small memories  3 options for implementation on FPGA 28 Registers LUTRAM Block RAM One per LUT 640 bits 9 Kbits 0.77 Kbit/mm 2 23 Kbit/mm Kbit/mm 2  16% utilized BRAM more area efficient than fully used LUTRAM (Valid for Stratix III)  LUTRAM could win for some points in other FPGAs Use BRAM for FPGA (soft) implementation Soft 2. Hard/Soft Efficiency

29 High port count inefficient in soft Soft 24X – 94X 60X – 170X 2. Hard/Soft Efficiency

30 High port count inefficient in soft  Width scales better Soft 2. Hard/Soft Efficiency 26X – 17X 72X

31 Buffer depth is free on FPGAs when using BRAM Soft Filling up the BRAM 2. Hard/Soft Efficiency

 Design recommendations based on FPGA silicon area  Supported by delay measurements 32 Buffer depth is free on FPGAs when using BRAM Soft High port count inefficient in soft  Width scales better Soft Use BRAM for FPGA (soft) implementation Soft 2. Hard/Soft Efficiency

33 Memory = Logic + Registers 2. Hard/Soft Efficiency Router ComponentMean Area RatioLUT:REG Input Module17-- Crossbar85-- VC Allocator488:1 Switch Allocator5620:1 Output Module390.6:1 Router30

34 2. Hard/Soft Efficiency Router ComponentMean Delay Ratio Input Module2.9 Crossbar4.4 VC Allocator3.9 Switch Allocator3.3 Output Module3.4 Router3.6

35 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA Hard NoC + FPGA Wiring Conclusion Future Work

36 Router ComponentArea RatioDelay Ratio Input Module172.9 Crossbar854.4 VC Allocator483.9 Switch Allocator563.3 Output Module393.4 Router303.6 Router ComponentArea RatioDelay Ratio Input Module172.9 Crossbar854.4 VC Allocator483.9 SW Allocator563.3 Output Module393.4 Router % Total Area Critical Path Results suggest hardening Crossbar and Allocators  Mixed hard/soft implementation 40% 10% 3. Hard NoC with FPGA

37 SoftHardMixed Area4.1 mm 2 (1X)0.14 mm 2 (30X)2.3 mm 2 (1.8X) Speed150 MHz (1X)810 MHz (5X)390 MHz (2.5X) ? ? How to connect hard and soft? How efficient is mixed/hard after doing that? Soft Hard Mixed not worth hardening For a typical router.. 5 ports 32 bits wide 2 VCs 10 buffer words 3. Hard NoC with FPGA

38 3. Hard NoC with FPGA FPGA Router Same I/O mux structure as a logic block – 9X the area Conventional FPGA interconnect between routers Logic clusters Router Logic

FPGA Router Hard NoC with FPGA Same I/O mux structure as a logic block – 9X the area Conventional FPGA interconnect between routers 730 MHz

Router Hard NoC with FPGA Assumed a mesh  Can form any topology FPGA

41 SoftHardHard (+ interconnect) Area4.1 mm 2 (1X)0.14 mm 2 (30X)0.18 mm 2 = 9 LABs (22X) Speed150 MHz (1X)810 MHz (5X)730 MHz (4.7X) 64-node NoC on Stratix V Router SoftHard (+ interconnect) Area ~12,500 LABs576 LABs %LABs 33 %1.6 % %FPGA 12 %0.6 % 3. Hard NoC with FPGA Hard NoC + Soft Interconnect is very compelling Provides 47 GB/s peak bisection bandwidth Very Cheap! Less than cost of 3 soft nodes

Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA Big city needs freeways to handle traffic Solve communication problems for a large/heterogeneous FPGA: Timing Closure – Interconnect Scaling – Modular Design A hard NoC is on average 30X smaller and 3.6X faster than soft Crossbars and allocators worst – Input buffer best An efficient soft NoC: Uses BRAMs – Large width, low Port Count – Deep buffers Mixed implementation does not make sense Integrated fully hard NoC with FPGA fabric (for NoC Links) 22X area improvement over soft Reaches max. FPGA frequency (4.7X faster than soft) 64-node NoC = 0.6% of total FPGA area (Stratix V)

 Power analysis  More hardening: – Dedicated inter-router links (hard wires) – Clock domain crossing hardware  How do traffic hotspots (DDR/PCIe) influence NoC design?  Latency insensitive design methodology that uses NoC  CAD tool changes for a NoC-based FPGA Hard NoC with FPGA