Presentation is loading. Please wait.

Presentation is loading. Please wait.

OpenSMART: An Opensource Single-cycle Multi-hop NoC Generator

Similar presentations


Presentation on theme: "OpenSMART: An Opensource Single-cycle Multi-hop NoC Generator"— Presentation transcript:

1 OpenSMART: An Opensource Single-cycle Multi-hop NoC Generator
Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab ( OpenSMART ( Nov 12, 2017

2 OpenSMART NoC

3 Challenges for NoCs Scalability Flexibility Design-cost
Supporting many-IP heterogeneous system Lower latency Lower area & energy Flexibility Support diverse connectivity for custom heterogeneous system Support diverse latency/throughput requirements Design-cost Automating the design of high-performance, low-energy NoCs Lowering design/verification costs of SoCs with NoCs

4 OpenSMART OpenSMART

5 SMART NoC Single-cycle Multi-hop Asynchronous Repeated Traversal SSR (SMART Setup Request) SSR (SMART Setup Request) SSR (SMART Setup Request) S D SMART: achieve the performance of dedicated connections over a network of shared links HPCmax Krishna et al, HPCA 2013 Chen et al, DATE 2013 Krishna et al, IEEE Micro Top Picks 2014, 1-cycle (no other traffic)

6 Features of SMART Low latency network Separate control path
Dynamic bypass of intermediate routers between any two routers Limit: HPCmax (hops per cycle max), maximum number of “hops” that the underlying wire allows the flit to traverse within a clock cycle Separate control path HPCmax bits from every router along each direction Arbitration of multiple bypass requests on the same link No ACK required

7 How to Get the Source Code
Go to Synergy lab homepage (synergy.ece.gatech.edu)

8 How to Get the Source Code
In the released tools tap, click OPENSMART

9 How to Get the Source Code
You will be forwarded to access request form page. Please fill and submit the form, then you will get a link to OpenSMART repository

10 How to Get the Source Code
Using the link, you can access to the repository

11 Source Tree (Under Backend/BSV)
Frontend: Configuration Parser (under development) Backend/BSV: BSV implementation (Main files) src: Building blocks Network.bsv : Connectivity configuration (default: Mesh) Types/Types.bsv : Topology (Number of routers), VC, Routing algorithm, SMART (HPCmax) configuration lib: Fundamental BSV libraries (FIFOs and CReg) testbenches: Include synthetic traffic-based simulation Backend/Chisel: Chisel implementation (Router only) We provide scalable NoC generator that supports irregular topologies

12 OpenSMART Design Flow

13 How to Specify a topology
In Backend/BSV/src/Network.bsv ... 64 for(Integer i=0; i < meshHeight; i++) begin Interconnecting all the data/credit West -> East links in a mesh network 65 for(Integer j=0; j < meshWidth -1; j++) begin 66 mkConnection(routers[i][j].dataLInks[East].getFlit, routers[i][j+1].dataLinks[West].putFlit) 67 mkConnection(routers[i][j].controlLInks[East].putCredit, routers[i][j+1].controlLinks[West].getCredit) 68 end 69 We provide scalable NoC generator that supports irregular topologies Can change connectivity using “mkConnection” with different routers/links Automation of this process is under development

14 OpenSMART Design Flow

15 How to Configure OpenSMART
In Backend/BSV/Types/types.bsv 1 typedef Benchmark Cycle 2 typedef 32 DataSz Flit data size 3 typedef 4 NumFlitsPerDataMessage Determine number of flits in a packet 4 5 typedef 6 UserHPCMax Determine HPCmax (SMART feature) 6 typedef 8 MeshWidth Mesh dimension (determines number of routers) 7 typedef 8 MeshHeight 8 9 typedef 4 NumUserVCs Determine number of VCs 10 11 currentRoutingAlgorithm = XY_; Determine routing algorithm We provide scalable NoC generator that supports irregular topologies

16 OpenSMART Design Flow

17 OpenSMART Building Blocks
Input buffer + Input VC arbitration Output VC selection + Output port arbitration + Credit management Switching (via crossbar) + Routing calculation SSR communication & Arbitration + Bypass flag

18 OpenSMART Router Arbiter Number of VCs/VC Depth Flit Size

19 OpenSMART Router Arbiter

20 OpenSMART Router Routing Algorithm

21 OpenSMART Router (SMART)
HPCmax SSR Prioritization - Slide 25: Here 2 points should come out when you speak. One: the SSR traversal - put a red circle there and talk about SSRs going up to HPCmax as you've done now. Two: the SSR priority. So show a red circle over that box and say that SSRs are prioritized by distance, with local getting highest priority. And say you will show this with example later. Prioritization by distance -> SSR from a nearer router gets the higher priority (Local (distance = 0) has the highest prirority)

22 Cycle 1: Multi-hop Bypass
Walk-through Example Router r4 sends a flit to router r7 Router r5 sends a flit to router r7 HPCmax = 3 Cycle 1: Multi-hop Bypass Cycle 0: SSR Send SSR (SMART Setup Request) 110 110 110 100 100 - Slide 30 is the only one that shows a contention scenario so needs to be explained well. This is a crucial slide in my opinion. In the animation, once the SSR is sent, show a pop out from r5 showing the hasSSR and has Local Flit blocks and the priority arbiter setting the flag to Red. Emphasize that the arbiter is a simple arbiter that prioritizes SSRs based on distance. Winner SMART Unit in r5

23 OpenSMART Design Flow

24 How to Run OpenSMART ./OpenSMART –c
In Backend/BSV/ > ./OpenSMART –c Compile synthetic traffic-based Simulation ./OpenSMART –r Run compiled simulation ./OpenSMART –v Generate Verilog code ./OpenSMART –clean Clean up build files We provide scalable NoC generator that supports irregular topologies

25 How to Run OpenSMART Simulation Compilation Print-out Messages
We provide scalable NoC generator that supports irregular topologies

26 How to Run OpenSMART Simulation Print-out Messages
We provide scalable NoC generator that supports irregular topologies Simulation Ticks: every 10,000 cycles Indicates if the simulation is alive or not

27 How to Run OpenSMART Simulation Print-out Messages
We provide scalable NoC generator that supports irregular topologies Send/Receive counts for every router Summary of the total statistics

28 Similar print-out messages as simulation compilation
How to Run OpenSMART Generating Verilog files We provide scalable NoC generator that supports irregular topologies Similar print-out messages as simulation compilation

29 Verilog files are generated in ./Verilog
How to Run OpenSMART Generating Verilog files We provide scalable NoC generator that supports irregular topologies Verilog files are generated in ./Verilog

30 OpenSMART(https://tinyurl.com/Get-OpenSMART)
Thank you! OpenSMART(

31 Backup Slides Each design leverages latency and area

32 Outline Motivation and Background OpenSMART Conclusions
Getting source code Changing topology Modifying other configurations Building blocks Conclusions

33 Outline Motivation and Background OpenSMART OpenSMART: User guide
Design Flow Building Blocks Walk-through Examples OpenSMART: User guide Source tree Commands Conclusions

34 Source Tree (Under Backend/BSV)
Frontend: Configuration Parser (under development) Backend/BSV: BSV implementation (Main files) src: Building blocks Network.bsv : Connectivity configuration (default: Mesh) Types/Types.bsv : Topology (Number of routers), VC, Routing algorithm, SMART (HPCmax) configuration lib: Fundamental BSV libraries (FIFOs and CReg) testbenches: Include synthetic traffic-based simulation Backend/Chisel: Chisel implementation (Router only) We provide scalable NoC generator that supports irregular topologies

35 Cycle 1: Multi-hop Bypass
Walk-through Example 1 Router r4 sends a flit to router r7 HPCmax = 3 bypass, bypass, stop Cycle 1: Multi-hop Bypass Cycle 0: SSR Send 110 110 110 SSR (SMART Setup Request)

36 Latency 5X 4X (a) Uniform Random (b) Bit-complement

37 Repeaters require less energy than clocked latches
Energy Consumption Anticipated Question: How did you estimate the energy? Repeaters require less energy than clocked latches

38 HPCmax (a) HPCmax on ASIC (b) HPCmax on FPGA
Depends on operating clock frequency and tecnhologies (a) HPCmax on ASIC (b) HPCmax on FPGA

39 Outline Motivation: Scalable, Flexible, and Low-cost NoCs
Background: SMART NoCs OpenSMART Design Flow Building Blocks Walk-through Examples Case Studies Mesh vs. SMART High-radix vs. Low-radix Conclusions

40 Router Area

41 Router Power Number of Ports (a) ASIC (b) FPGA

42 Maximum Clock Frequency

43 Outline Motivation: Scalable, Flexible, and Low-cost NoCs
Background: SMART NoCs OpenSMART Design Flow Building Blocks Walk-through Examples Case Studies Mesh vs. SMART High-radix vs. Low-radix Conclusions

44 Conclusion NoCs are crucial components to support many-IP heterogeneous systems Providing connectivity while satisfying their diverse requrements. OpenSMART provides automatic generation of NoCs for many-IP heterogeneous systems Supports recent low latency SMART NoC as well as highly-optimized 1-cycle routers Written in high-level HDLs

45 Announcement Thank you!
OpenSMART contributes the open-source hardware ecosystem! Source code will be available in May 2017 Please sign up via our webpage to request the source code Thank you!

46 Is 1-cycle Network Possible?
Yes Is wire fast enough to support 1-cycle network? Wire traversal length within 1ns (1Ghz): 10-16mm Wire delay over technology: constant Chip dimension: remain similar (~20mm) On-chip wires are fast enough to transmit across the chip within 1-2 cycles at 1GHz even if technology scales Clock frequency: remain similar (1~3GHz) Tile dimension: decrease over technology Repeaters are just Inverters and buffers. The knwon fact is that 70~80 ps is reuiqred to traverse 1mm with global repeated wires ~20mm ~20mm ~20mm

47 Hardware Development Cost
There has been no culture of IP reuse because companies try to squeeze as much performance as possible by reimplmeneting each IP for specific systems source: Todd Austin, Micro-49 keynote Low cost challenge

48 Many-IP Heterogeneous System
Network-on-Chip (NoC) Scalability challenge Flexibility challenge

49 Diverse System Requirements
Throughput Critical Latency Critical source: MNIST, Engadget, TheStack

50 OpenSMART Router (1cycle)

51 OpenSMART Router (2cycle/SMART)

52 OpenSMART Design Flow Configuration Topology #Network Num_Nodes 16
graph mesh16 { r0--r1--r2--r3; r4--r5--r6--r7; r8--r9--r10--r11; r12--r13--r14—r15; r0--r4--r8--r12; r1--r5--r9--r13; r2--r6--r10--r14; r3--r7--r11--r15; } #Network Num_Nodes 16 Topology_File Mesh.dot Routing_Algorithm XY Flow_Control VC Link_Width 128 #Router Pipeline_Stages 1 SMART False Num_VCs 4 VC_Depth 1

53 OpenSMART Design Flow Topology graph mesh16 { r0--r1--r2--r3;
}

54 OpenSMART Design Flow

55 OpenSMART Design Flow

56 OpenSMART Design Flow

57 Open-source Hardware CPU GPU Accelerator ...
All of you might be familiar with open-source hardware projects such as RISC-V CPU and MIAOW GPU

58 Designing a 1-cycle Network
What limits us from designing a 1-cycle network? source: the future of computer performance source:ITRS Is it the wire delay? Classic scaling challenge with wires Wire-delay increases relative to logic delay But … Wire-delay in cycles expected to remain constant. The intuition is that the RC delay goes up as square of the wire’s length (since both R and C go up linearly with the wire’s length). Breaking the long wire into multiple stages makes the delay go up linearly with the number of stages instead Smaller Transistor => smaller C, slightly higher R, overall RC lower. Thinner Wire => lower cross-sectional area => higher R, lower Cgnd. Earlier – thickness increased to keep cross-sectional area constant. No longer possible now. Coupling Capacitance goes up with wires being closer. Keep them 3x apart => coupling cap negligible. Wires fast enough to transmit across chip in 1-2 cycles today and in the future. source:DSENT Simulation

59 Designing a 1-cycle Network
What limits us from designing a 1-cycle network? Is it the wire delay? No! Krishna et al, HPCA 2013 Chen et al, DATE 2013 Krishna et al, IEEE Micro Top Picks 2014, repeater hop Dream Traversal Actual Traversal router HopStopHop Dedicated 1-cycle wire on-chip router to manage sharing of output link every cycle This design strategy blows up area and power. What is more scalable is mesh. Shared link Shared Links! Number of wires = O(n2) Fully-connected Mesh (Practical) Number of wires = O(n) (Impractical)

60 Designing a 1-cycle Network
What limits us from designing a 1-cycle network? Is it the wire delay? Is it the routers? Yes! No! repeater 1 1-cycle 9-Cycles (Best Case) (single-cycle router, no other traffic!) Dedicated topologies are impractical* Routers need to share links Router Delay: 2-4 cycles. Best Case: 1 cycle. Note: this is at every hop *unless we design a chip for a specific application Number of wires = O(n2) Fully-connected Mesh (Practical) Number of wires = O(n) (Impractical)

61 Designing a 1-cycle Network
What limits us from designing a 1-cycle network? Is it the wire delay? Is it the routers? Yes! No! repeater 1 1-cycle 9-Cycles (Best Case) Krishna et al, HPCA 2013 Chen et al, DATE 2013 Krishna et al, IEEE Micro Top Picks 2014, Dedicated topologies are impractical* Routers need to share links *unless we design a chip for a specific application Can we address both? Yes! Single-cycle Multi-hop Asynchronous Repeated Traversal SMART: achieve the performance of dedicated connections over a network of shared links repeater 1-cycle (no other traffic)

62 Designing a 1-cycle Network
What limits us from designing a 1-cycle network? Is it the wire delay? Repeated global wire delay expected to remain constant/decrease slightly with technology scaling. Repeated global wires can go up to 17mm within 1ns Technology = 45nm Target Clock Period = 1ns Metal Layer = M6 Repeater Spacing = 1mm Wire Width = DRCmin Wire Spacing = 3×DRCmin (coupling cap  0) The intuition is that the RC delay goes up as square of the wire’s length (since both R and C go up linearly with the wire’s length). Breaking the long wire into multiple stages makes the delay go up linearly with the number of stages instead Smaller Transistor => smaller C, slightly higher R, overall RC lower. Thinner Wire => lower cross-sectional area => higher R, lower Cgnd. Earlier – thickness increased to keep cross-sectional area constant. No longer possible now. Coupling Capacitance goes up with wires being closer. Keep them 3x apart => coupling cap negligible. *DSENT (NOCS 2012): Timing-driven NoC Power Estimation Tool

63 Critical Paths Mention 2GHz

64 Features of SMART Conditions to stop
[HPCmax] When a flit traverses HPCmax hops [Turn] When a flit reaches a router to make a turn [Contention] When two flits want to traverse the same link, the router prioritizes its local one over the bypassing one

65 OpenSMART Design Flow Configuration Topology graph mesh16 {
r0--r1--r2--r3; r4--r5--r6--r7; r8--r9--r10--r11; r12--r13--r14—r15; r0--r4--r8--r12; r1--r5--r9--r13; r2--r6--r10--r14; r3--r7--r11--r15; } #Network Num_Nodes 16 Topology_File Mesh.dot Routing_Algorithm XY Flow_Control VC Link_Width 128 #Router Pipeline_Stages 1 SMART False Num_VCs 4 VC_Depth 1

66 OpenSMART Design Flow Topology graph mesh16 { r0--r1--r2--r3;
}

67 Managing Distributed Arbitration
Cycle 1: R0 sends Req = 3. R2 sends Req = 2. Assume HPCmax = 3 (max Hops Per Cycle) Can different routers enforce different priorities? ReqR2 = 2 ReqR0 = 3 R0 en R1 en R2 en R3 en R4 en buffer mux X xbar buffer mux X xbar buffer mux X xbar buffer mux X xbar buffer mux X xbar Prio = Bypass Prio = Local Prioritize ReqR0 over ReqR2 Prioritize ReqR2 over ReqR0

68 Managing Distributed Arbitration
Cycle 1: R0 sends Req = 3. R2 sends Req = 2. Assume HPCmax = 3 (max Hops Per Cycle) Can different routers enforce different priorities? ReqR2 = 2 No! ReqR0 = 3 R0 en R1 en R2 en R3 en R4 en buffer mux local xbar WE buffer mux bypass xbar WE buffer mux bypass xbar WE buffer mux bypass xbar WE buffer 1 mux X xbar Prio = Bypass R0’s flit incorrectly reaches R4, instead of getting stopped at R3. Prio = Local Prioritize ReqR0 over ReqR2 Prioritize ReqR2 over ReqR0 October 25, 2016

69 Managing Distributed Arbitration
Distributed Consensus: All routers need to take the same decision about multiple contending flits in a distributed manner Solution: All routers follow the same static priority between the path setup requests that they receive Prio = Local: 0 hop > 1 hop > … (HPCmax-1) hop > HPCmax hop Prio = Bypass: HPCmax hop > (HPCmax-1) hop > … 1 hop > 0 hop Implication: a router will not receive a flit that it does not expect But can a router not receive a flit that it does expect? October 25, 2016

70 Managing Distributed Arbitration
Can a router not receive a flit that it does expect? ReqR1 = X Control for N direction Control for E direction ReqR0 = 3 R0 en R1 en R2 en R3 en R4 en buffer mux X xbar buffer mux X xbar buffer mux X xbar buffer mux X xbar buffer mux X xbar Prio = Local October 25, 2016

71 Managing Distributed Arbitration
Can a router not receive a flit that it does expect? Is there a performance loss? The R2R3 link was granted for this cycle, but went unused. What if some other flit wanted to use it? Yes! No. (Prio=Local) R0 en R1 en R2 en R3 en R4 en buffer mux local xbar W->E buffer 1 mux local xbar W->N buffer mux bypass xbar W->E buffer 1 mux X xbar buffer mux X xbar October 25, 2016


Download ppt "OpenSMART: An Opensource Single-cycle Multi-hop NoC Generator"

Similar presentations


Ads by Google