Download presentation
Presentation is loading. Please wait.
Published byJayson Collins Modified over 6 years ago
1
OpenSMART: An Opensource Single-cycle Multi-hop NoC Generator
Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab ( OpenSMART ( Nov 12, 2017
2
OpenSMART NoC
3
Challenges for NoCs Scalability Flexibility Design-cost
Supporting many-IP heterogeneous system Lower latency Lower area & energy Flexibility Support diverse connectivity for custom heterogeneous system Support diverse latency/throughput requirements Design-cost Automating the design of high-performance, low-energy NoCs Lowering design/verification costs of SoCs with NoCs
4
OpenSMART OpenSMART
5
SMART NoC Single-cycle Multi-hop Asynchronous Repeated Traversal SSR (SMART Setup Request) SSR (SMART Setup Request) SSR (SMART Setup Request) S D SMART: achieve the performance of dedicated connections over a network of shared links HPCmax Krishna et al, HPCA 2013 Chen et al, DATE 2013 Krishna et al, IEEE Micro Top Picks 2014, 1-cycle (no other traffic)
6
Features of SMART Low latency network Separate control path
Dynamic bypass of intermediate routers between any two routers Limit: HPCmax (hops per cycle max), maximum number of “hops” that the underlying wire allows the flit to traverse within a clock cycle Separate control path HPCmax bits from every router along each direction Arbitration of multiple bypass requests on the same link No ACK required
7
How to Get the Source Code
Go to Synergy lab homepage (synergy.ece.gatech.edu)
8
How to Get the Source Code
In the released tools tap, click OPENSMART
9
How to Get the Source Code
You will be forwarded to access request form page. Please fill and submit the form, then you will get a link to OpenSMART repository
10
How to Get the Source Code
Using the link, you can access to the repository
11
Source Tree (Under Backend/BSV)
Frontend: Configuration Parser (under development) Backend/BSV: BSV implementation (Main files) src: Building blocks Network.bsv : Connectivity configuration (default: Mesh) Types/Types.bsv : Topology (Number of routers), VC, Routing algorithm, SMART (HPCmax) configuration lib: Fundamental BSV libraries (FIFOs and CReg) testbenches: Include synthetic traffic-based simulation Backend/Chisel: Chisel implementation (Router only) We provide scalable NoC generator that supports irregular topologies
12
OpenSMART Design Flow
13
How to Specify a topology
In Backend/BSV/src/Network.bsv ... 64 for(Integer i=0; i < meshHeight; i++) begin Interconnecting all the data/credit West -> East links in a mesh network 65 for(Integer j=0; j < meshWidth -1; j++) begin 66 mkConnection(routers[i][j].dataLInks[East].getFlit, routers[i][j+1].dataLinks[West].putFlit) 67 mkConnection(routers[i][j].controlLInks[East].putCredit, routers[i][j+1].controlLinks[West].getCredit) 68 end 69 We provide scalable NoC generator that supports irregular topologies Can change connectivity using “mkConnection” with different routers/links Automation of this process is under development
14
OpenSMART Design Flow
15
How to Configure OpenSMART
In Backend/BSV/Types/types.bsv 1 typedef Benchmark Cycle 2 typedef 32 DataSz Flit data size 3 typedef 4 NumFlitsPerDataMessage Determine number of flits in a packet 4 5 typedef 6 UserHPCMax Determine HPCmax (SMART feature) 6 typedef 8 MeshWidth Mesh dimension (determines number of routers) 7 typedef 8 MeshHeight 8 9 typedef 4 NumUserVCs Determine number of VCs 10 11 currentRoutingAlgorithm = XY_; Determine routing algorithm We provide scalable NoC generator that supports irregular topologies
16
OpenSMART Design Flow
17
OpenSMART Building Blocks
Input buffer + Input VC arbitration Output VC selection + Output port arbitration + Credit management Switching (via crossbar) + Routing calculation SSR communication & Arbitration + Bypass flag
18
OpenSMART Router Arbiter Number of VCs/VC Depth Flit Size
19
OpenSMART Router Arbiter
20
OpenSMART Router Routing Algorithm
21
OpenSMART Router (SMART)
HPCmax SSR Prioritization - Slide 25: Here 2 points should come out when you speak. One: the SSR traversal - put a red circle there and talk about SSRs going up to HPCmax as you've done now. Two: the SSR priority. So show a red circle over that box and say that SSRs are prioritized by distance, with local getting highest priority. And say you will show this with example later. Prioritization by distance -> SSR from a nearer router gets the higher priority (Local (distance = 0) has the highest prirority)
22
Cycle 1: Multi-hop Bypass
Walk-through Example Router r4 sends a flit to router r7 Router r5 sends a flit to router r7 HPCmax = 3 Cycle 1: Multi-hop Bypass Cycle 0: SSR Send SSR (SMART Setup Request) 110 110 110 100 100 - Slide 30 is the only one that shows a contention scenario so needs to be explained well. This is a crucial slide in my opinion. In the animation, once the SSR is sent, show a pop out from r5 showing the hasSSR and has Local Flit blocks and the priority arbiter setting the flag to Red. Emphasize that the arbiter is a simple arbiter that prioritizes SSRs based on distance. Winner SMART Unit in r5
23
OpenSMART Design Flow
24
How to Run OpenSMART ./OpenSMART –c
In Backend/BSV/ > ./OpenSMART –c Compile synthetic traffic-based Simulation ./OpenSMART –r Run compiled simulation ./OpenSMART –v Generate Verilog code ./OpenSMART –clean Clean up build files We provide scalable NoC generator that supports irregular topologies
25
How to Run OpenSMART Simulation Compilation Print-out Messages
We provide scalable NoC generator that supports irregular topologies
26
How to Run OpenSMART Simulation Print-out Messages
We provide scalable NoC generator that supports irregular topologies Simulation Ticks: every 10,000 cycles Indicates if the simulation is alive or not
27
How to Run OpenSMART Simulation Print-out Messages
We provide scalable NoC generator that supports irregular topologies Send/Receive counts for every router Summary of the total statistics
28
Similar print-out messages as simulation compilation
How to Run OpenSMART Generating Verilog files We provide scalable NoC generator that supports irregular topologies Similar print-out messages as simulation compilation
29
Verilog files are generated in ./Verilog
How to Run OpenSMART Generating Verilog files We provide scalable NoC generator that supports irregular topologies Verilog files are generated in ./Verilog
30
OpenSMART(https://tinyurl.com/Get-OpenSMART)
Thank you! OpenSMART(
31
Backup Slides Each design leverages latency and area
32
Outline Motivation and Background OpenSMART Conclusions
Getting source code Changing topology Modifying other configurations Building blocks Conclusions
33
Outline Motivation and Background OpenSMART OpenSMART: User guide
Design Flow Building Blocks Walk-through Examples OpenSMART: User guide Source tree Commands Conclusions
34
Source Tree (Under Backend/BSV)
Frontend: Configuration Parser (under development) Backend/BSV: BSV implementation (Main files) src: Building blocks Network.bsv : Connectivity configuration (default: Mesh) Types/Types.bsv : Topology (Number of routers), VC, Routing algorithm, SMART (HPCmax) configuration lib: Fundamental BSV libraries (FIFOs and CReg) testbenches: Include synthetic traffic-based simulation Backend/Chisel: Chisel implementation (Router only) We provide scalable NoC generator that supports irregular topologies
35
Cycle 1: Multi-hop Bypass
Walk-through Example 1 Router r4 sends a flit to router r7 HPCmax = 3 bypass, bypass, stop Cycle 1: Multi-hop Bypass Cycle 0: SSR Send 110 110 110 SSR (SMART Setup Request)
36
Latency 5X 4X (a) Uniform Random (b) Bit-complement
37
Repeaters require less energy than clocked latches
Energy Consumption Anticipated Question: How did you estimate the energy? Repeaters require less energy than clocked latches
38
HPCmax (a) HPCmax on ASIC (b) HPCmax on FPGA
Depends on operating clock frequency and tecnhologies (a) HPCmax on ASIC (b) HPCmax on FPGA
39
Outline Motivation: Scalable, Flexible, and Low-cost NoCs
Background: SMART NoCs OpenSMART Design Flow Building Blocks Walk-through Examples Case Studies Mesh vs. SMART High-radix vs. Low-radix Conclusions
40
Router Area
41
Router Power Number of Ports (a) ASIC (b) FPGA
42
Maximum Clock Frequency
43
Outline Motivation: Scalable, Flexible, and Low-cost NoCs
Background: SMART NoCs OpenSMART Design Flow Building Blocks Walk-through Examples Case Studies Mesh vs. SMART High-radix vs. Low-radix Conclusions
44
Conclusion NoCs are crucial components to support many-IP heterogeneous systems Providing connectivity while satisfying their diverse requrements. OpenSMART provides automatic generation of NoCs for many-IP heterogeneous systems Supports recent low latency SMART NoC as well as highly-optimized 1-cycle routers Written in high-level HDLs
45
Announcement Thank you!
OpenSMART contributes the open-source hardware ecosystem! Source code will be available in May 2017 Please sign up via our webpage to request the source code Thank you!
46
Is 1-cycle Network Possible?
Yes Is wire fast enough to support 1-cycle network? Wire traversal length within 1ns (1Ghz): 10-16mm Wire delay over technology: constant Chip dimension: remain similar (~20mm) On-chip wires are fast enough to transmit across the chip within 1-2 cycles at 1GHz even if technology scales Clock frequency: remain similar (1~3GHz) Tile dimension: decrease over technology Repeaters are just Inverters and buffers. The knwon fact is that 70~80 ps is reuiqred to traverse 1mm with global repeated wires ~20mm ~20mm ~20mm
47
Hardware Development Cost
There has been no culture of IP reuse because companies try to squeeze as much performance as possible by reimplmeneting each IP for specific systems source: Todd Austin, Micro-49 keynote Low cost challenge
48
Many-IP Heterogeneous System
Network-on-Chip (NoC) Scalability challenge Flexibility challenge
49
Diverse System Requirements
Throughput Critical Latency Critical source: MNIST, Engadget, TheStack
50
OpenSMART Router (1cycle)
51
OpenSMART Router (2cycle/SMART)
52
OpenSMART Design Flow Configuration Topology #Network Num_Nodes 16
graph mesh16 { r0--r1--r2--r3; r4--r5--r6--r7; r8--r9--r10--r11; r12--r13--r14—r15; r0--r4--r8--r12; r1--r5--r9--r13; r2--r6--r10--r14; r3--r7--r11--r15; } #Network Num_Nodes 16 Topology_File Mesh.dot Routing_Algorithm XY Flow_Control VC Link_Width 128 #Router Pipeline_Stages 1 SMART False Num_VCs 4 VC_Depth 1
53
OpenSMART Design Flow Topology graph mesh16 { r0--r1--r2--r3;
}
54
OpenSMART Design Flow
55
OpenSMART Design Flow
56
OpenSMART Design Flow
57
Open-source Hardware CPU GPU Accelerator ...
All of you might be familiar with open-source hardware projects such as RISC-V CPU and MIAOW GPU
58
Designing a 1-cycle Network
What limits us from designing a 1-cycle network? source: the future of computer performance source:ITRS Is it the wire delay? Classic scaling challenge with wires Wire-delay increases relative to logic delay But … Wire-delay in cycles expected to remain constant. The intuition is that the RC delay goes up as square of the wire’s length (since both R and C go up linearly with the wire’s length). Breaking the long wire into multiple stages makes the delay go up linearly with the number of stages instead Smaller Transistor => smaller C, slightly higher R, overall RC lower. Thinner Wire => lower cross-sectional area => higher R, lower Cgnd. Earlier – thickness increased to keep cross-sectional area constant. No longer possible now. Coupling Capacitance goes up with wires being closer. Keep them 3x apart => coupling cap negligible. Wires fast enough to transmit across chip in 1-2 cycles today and in the future. source:DSENT Simulation
59
Designing a 1-cycle Network
What limits us from designing a 1-cycle network? Is it the wire delay? No! Krishna et al, HPCA 2013 Chen et al, DATE 2013 Krishna et al, IEEE Micro Top Picks 2014, repeater hop Dream Traversal Actual Traversal router HopStopHop Dedicated 1-cycle wire on-chip router to manage sharing of output link every cycle This design strategy blows up area and power. What is more scalable is mesh. Shared link Shared Links! Number of wires = O(n2) Fully-connected Mesh (Practical) Number of wires = O(n) (Impractical)
60
Designing a 1-cycle Network
What limits us from designing a 1-cycle network? Is it the wire delay? Is it the routers? Yes! No! repeater 1 1-cycle 9-Cycles (Best Case) (single-cycle router, no other traffic!) Dedicated topologies are impractical* Routers need to share links Router Delay: 2-4 cycles. Best Case: 1 cycle. Note: this is at every hop *unless we design a chip for a specific application Number of wires = O(n2) Fully-connected Mesh (Practical) Number of wires = O(n) (Impractical)
61
Designing a 1-cycle Network
What limits us from designing a 1-cycle network? Is it the wire delay? Is it the routers? Yes! No! repeater 1 1-cycle 9-Cycles (Best Case) Krishna et al, HPCA 2013 Chen et al, DATE 2013 Krishna et al, IEEE Micro Top Picks 2014, Dedicated topologies are impractical* Routers need to share links *unless we design a chip for a specific application Can we address both? Yes! Single-cycle Multi-hop Asynchronous Repeated Traversal SMART: achieve the performance of dedicated connections over a network of shared links repeater 1-cycle (no other traffic)
62
Designing a 1-cycle Network
What limits us from designing a 1-cycle network? Is it the wire delay? Repeated global wire delay expected to remain constant/decrease slightly with technology scaling. Repeated global wires can go up to 17mm within 1ns Technology = 45nm Target Clock Period = 1ns Metal Layer = M6 Repeater Spacing = 1mm Wire Width = DRCmin Wire Spacing = 3×DRCmin (coupling cap 0) The intuition is that the RC delay goes up as square of the wire’s length (since both R and C go up linearly with the wire’s length). Breaking the long wire into multiple stages makes the delay go up linearly with the number of stages instead Smaller Transistor => smaller C, slightly higher R, overall RC lower. Thinner Wire => lower cross-sectional area => higher R, lower Cgnd. Earlier – thickness increased to keep cross-sectional area constant. No longer possible now. Coupling Capacitance goes up with wires being closer. Keep them 3x apart => coupling cap negligible. *DSENT (NOCS 2012): Timing-driven NoC Power Estimation Tool
63
Critical Paths Mention 2GHz
64
Features of SMART Conditions to stop
[HPCmax] When a flit traverses HPCmax hops [Turn] When a flit reaches a router to make a turn [Contention] When two flits want to traverse the same link, the router prioritizes its local one over the bypassing one
65
OpenSMART Design Flow Configuration Topology graph mesh16 {
r0--r1--r2--r3; r4--r5--r6--r7; r8--r9--r10--r11; r12--r13--r14—r15; r0--r4--r8--r12; r1--r5--r9--r13; r2--r6--r10--r14; r3--r7--r11--r15; } #Network Num_Nodes 16 Topology_File Mesh.dot Routing_Algorithm XY Flow_Control VC Link_Width 128 #Router Pipeline_Stages 1 SMART False Num_VCs 4 VC_Depth 1
66
OpenSMART Design Flow Topology graph mesh16 { r0--r1--r2--r3;
}
67
Managing Distributed Arbitration
Cycle 1: R0 sends Req = 3. R2 sends Req = 2. Assume HPCmax = 3 (max Hops Per Cycle) Can different routers enforce different priorities? ReqR2 = 2 ReqR0 = 3 R0 en R1 en R2 en R3 en R4 en buffer mux X xbar buffer mux X xbar buffer mux X xbar buffer mux X xbar buffer mux X xbar Prio = Bypass Prio = Local Prioritize ReqR0 over ReqR2 Prioritize ReqR2 over ReqR0
68
Managing Distributed Arbitration
Cycle 1: R0 sends Req = 3. R2 sends Req = 2. Assume HPCmax = 3 (max Hops Per Cycle) Can different routers enforce different priorities? ReqR2 = 2 No! ReqR0 = 3 R0 en R1 en R2 en R3 en R4 en buffer mux local xbar WE buffer mux bypass xbar WE buffer mux bypass xbar WE buffer mux bypass xbar WE buffer 1 mux X xbar Prio = Bypass R0’s flit incorrectly reaches R4, instead of getting stopped at R3. Prio = Local Prioritize ReqR0 over ReqR2 Prioritize ReqR2 over ReqR0 October 25, 2016
69
Managing Distributed Arbitration
Distributed Consensus: All routers need to take the same decision about multiple contending flits in a distributed manner Solution: All routers follow the same static priority between the path setup requests that they receive Prio = Local: 0 hop > 1 hop > … (HPCmax-1) hop > HPCmax hop Prio = Bypass: HPCmax hop > (HPCmax-1) hop > … 1 hop > 0 hop Implication: a router will not receive a flit that it does not expect But can a router not receive a flit that it does expect? October 25, 2016
70
Managing Distributed Arbitration
Can a router not receive a flit that it does expect? ReqR1 = X Control for N direction Control for E direction ReqR0 = 3 R0 en R1 en R2 en R3 en R4 en buffer mux X xbar buffer mux X xbar buffer mux X xbar buffer mux X xbar buffer mux X xbar Prio = Local October 25, 2016
71
Managing Distributed Arbitration
Can a router not receive a flit that it does expect? Is there a performance loss? The R2R3 link was granted for this cycle, but went unused. What if some other flit wanted to use it? Yes! No. (Prio=Local) R0 en R1 en R2 en R3 en R4 en buffer mux local xbar W->E buffer 1 mux local xbar W->N buffer mux bypass xbar W->E buffer 1 mux X xbar buffer mux X xbar October 25, 2016
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.