NoC: Network OR Chip? Israel Cidon Technion
Technion’s NoC Research: PIs Israel Cidon (networking) Ran Ginosar (VLSI) Idit Keidar (Dist. Systems) Avinoam Kolodny (VLSI) Students: Evgeny Bolotin, Reuven Dobkin, Zvika Guz, Arkadiy Morgenshtein, Zigi Walter Roman Gindin
Origins of the NoC concept Early publications: Guerrier and Greiner (2000) – “A generic architecture for on-chip packet-switched interconnections” Hemani, Jantsch, Kumar, Postula, Oberg ,Millberg and Lindqvist (2000) – “Network on chip: An architecture for billion transistor era” Dally and Towles (2001) – “Route packets, not wires: on-chip interconnection networks” Wingard (2001) – “MicroNetwork-based integration of SoCs” Rijpkema, Goossens and Wielage (2001) – “A router architecture for networks on silicon” De Micheli and Benini (2002) – “Networks on chip: A new paradigm for systems on chip design” Bolotin, Cidon Ginosar and Kolodny (2004) – “QNoC: QoS architecture and design process for network on chip”
Evolution or Paradigm Shift? Network link Network router Computing module Bus Architectural paradigm shift Replace wire spaghetti by an intelligent network infrastructure Design paradigm shift Busses and signals replaced by packets Organizational paradigm shift Create a new discipline, a new infrastructure responsibility
Characteristics of a paradigm shift successful Characteristics of a paradigm shift Addresses a critical and topical need Enables a quantum leap in productivity and application Resistance from legacy experts Requires a major change of mindset and skills! Think: Networking not Bus evolution!
Critical needs addressed by NoC 1) Efficient interconnect: delay, power, noise, scalability, reliability Module 2) Increase system integration productivity 3) Enable Chip Multi Processors
NoC offers Area and Power Scalability For Same Performance, compare the Wire-area and power: NoC: Simple Bus: Point-to Point: Segmented Bus: E. Bolotin at al. , “Cost Considerations in Network on Chip”, Integration, special issue on Network on Chip, October 2004
4 Decades of Network 101 Evolved from busses and p-t-p connections Extensive architectures, modeling and analysis research Architecture is about optimizing network costs Different goals and element costs => different architectures: Local Area Networks (LANs) Metropolitan Area Networks (MANs) System interconnect networks (SAN, InfiniBand …) WAN (TCP/IP, ATM…) Wireless networks Cross layered design Early architecture standardization is an optimization burden!
4 Decades of Network 101
Local Area Networks (LANs) Critical need Distributing operations and sharing of heterogeneous systems Constraints Standardization Main Cost Incremental cost (NICs, wiring) Typical optimized architecture: Low cost hubs/switches Tree like architecture Exploit low cost local BW Shared media Broadcast Host embedded NICs
System interconnect (SAN, InfiniBand) Critical need Create a powerful specialized system from low cost units Constraints Low latency Main Cost Total system cost per MIP Typical architecture: Wormhole/cut through Connection based Over-provisioned network High degree/regular topology Specific optimizations (e.g. RDMA)
WAN (TCP/IP, ATM…) Critical need Constraints Main Cost Global application networking (collaboration, WWW, file sharing, voice) Constraints Scalability Heterogeneous user and application QoS requirements Main Cost Physical infrastructure (mainly long distance trunks) Typical architecture of choice: Packet switching Irregular, small degree networks of high speed trunks Optimization of topology and link capacities
CAN optimization The main cost(s) The design envelope (constraints) Collection of designs supported by a given chip Convex hull of traffic requirements all configurations QoS constraints Other requirements (eg: design automation…) The main cost(s) Total Area Power Others Design time, verification and testability, Optimization variables Switching mechanism QoS Topology (incl. links capacities) Routing Flow and congestion control Buffering Application support …..
General purpose computer One NoC does not fit all! Reconfiguration rate during run time CMP ASSP FPGA at boot time at design time ASIC Flexibility single application General purpose computer I. Cidon and K. Goossens, in “Networks on Chips” , G. De Micheli and L. Benini, Morgan Kaufmann, 2006
General purpose computer One NoC does not fit all! Traffic Unpredictability Run time CMP ASSP FPGA At configuration At design time ASIC Flexibility single application General purpose computer A large solution range! I. Cidon and K. Goossens, in “Networks on Chips” , G. De Micheli and L. Benini, Morgan Kaufmann, 2006
Apply paradigm to ASIC based NoC Design envelop / constraints Well define inter-modules traffic Automatic synthesis Variable QoS requirement Main cost Power and area Architecture of choice: Wormhole or small frame switching Small # of buffers, VCs, tables Simple QoS mechanisms (which?) Topology and routing optimized for cost
Example: QNoC Quality-of-service NoC architecture for ASICs Traffic requirements are known a-priori Overall approach Wormhole switching QoS based on priority classes Small buffer/VC budget In-order SP XY routing Irregular topology Optimized link capacities (0,2) (0,0) (1,0) (0,3) (1,4) (0,4) (2,1) (2,0) (2,2) (2,3) (2,4) (4,3) (3,4) (4,4) R (5,0) * E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny., “QNoC: QoS architecture and design process for Network on Chip”, JSA special issue on NoC, 2004.
Quality-of-Service in QNoC Multiple priority classes Define latency Preemptive Possible ASIC classes Signaling Real Time Stream Read-Write DMA Block Transfer Statistical guarantees E.g. <0.01% arrive later then required N T * E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny., “QNoC: QoS architecture and design process for Network on Chip”, JSA special issue on NOC, 2004.
QNoC Design Flow Extract inter-module traffic Place modules Allocate link capacities Verify QoS and cost
QNoC Design Flow Extract inter-module traffic Place modules Allocate link capacities R R R R R R R Module Module Verify QoS and cost
QNoC Design Flow Extract inter-module traffic Place modules Allocate link capacities Verify QoS and cost Optimize capacity for performance/power tradeoff Capacity allocation is a traditional WAN optimization problem, however:
Wormhole Delay Modeling Approximate delay analysis in wormhole networks Multiple Virtual-Channels Different link capacities Different communication demands Queuing delay: Flit interleaving delay approximation: * I. Walter, Z. Guz, I. Cidon, R. Ginosar and A. Kolodny, “Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip,” DATE 2006.
The Capacity Allocation Problem Given: system topology and routing Each flow’s bandwidth (fi ) and delay bound (TiREQ) Minimize total link capacity Such that:
Capacity Allocation – Realistic Example A SoC-like system with realistic traffic demands and delay requirements “Classic” design: 41.8Gbit/sec Using the algorithm: 28.7Gbit/sec Total capacity reduced by 30% 00 01 02 03 10 11 12 13 20 21 22 23 Before optimization After optimization
Optimizing routing on Irregular Mesh Around the Block Dead End Goal: Minimize the total size of routing tables E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny, "Routing Table Minimization for Irregular Mesh NoCs", DATE 2007.
Saving Table Hardware Traditional solutions - full routing tables Destination Based Routing - at router Source Routing – at sources Solution idea: Use Reduced Tables Store only relevant destinations (PLA) Default function (“Go XY” or “Don’t turn”) + Table for deviations
Routing Heuristics for Irregular Mesh Distributed Routing (full tables) X-Y Routing with Deviation Tables Source Routing Source Routing for Deviation Points Random problem instances Systems with real applications
Efficient Routing Results Scaling of Savings Savings Network Size
NoC for Shared Memory CMP Constraints Multiple access to coherent cache Unpredictable traffic pattern QoS requirements (fetch, pre-fetch) Main cost CMP power / area per performance Architecture of choice: Tailored for a given CMP In-order/adaptive routing? Simple QoS mechanisms? Regular topology? is CMP symmetric? Built in support functions (multicast, search…)
NoC can facilitate critical transactions * E.Bolotin, Z. Guz, I.Cidon, R. Ginosar and A. Kolodny, “The Power of Priority: NoC based Distributed Cache Coherency”, NoCs 2007.
Priority NoC: Results
NoC Based FPGA Architecture Functional unit NoC for inter-routing Routers Configurable region – User logic (1) future FPGA will be NoC-based and (2) the design will be 2-tiered, or hierarchical. Configurable network interface
NoC for FPGA Design envelope / constraints Many ASIC like applications for a given FPGA Hard NoC infrastructure – efficient but inflexible Soft logic is reusable but has inferior performance Average NoC cost of most demanding designs Hard grid links and router logic Total configured NoC Logic used Architecture of choice: Regular and uniform grid In-order/load balanced routing Hard logic for links, routers Soft logic for routing algorithms, headers, CNIs Soft NoC tuning (routing, CNI) for a given implementation
NoC Based FPGA Architecture Functional unit NoC for inter-routing Routers Configurable region – User logic (1) future FPGA will be NoC-based and (2) the design will be 2-tiered, or hierarchical. Configurable network interface
Source Toggle XY Unlike TXY, traffic to same destination is not split Maximum capacity similar to TXY The route is a bitwise XOR of source and destination ID Can be extended to weighted source toggle (WOT)
Design Envelope for various distances between the hotspots for WOT Two Hotspots Design Envelope for various distances between the hotspots for WOT Maximum Capacity
Generic NoC Problems Many shared problems across design spectrum, examples: Need for a low latency class of service Verification and predictability Power control of NoCs Centralized vs. distributed control Is single NoC enough per chip? Bus examples suggest otherwise Hot modules slows incoming NoC traffic Off chip systems Shared memory subsystems Expensive functional units
NoC clogging by hot modules IP1 IP2 Interface Interface Interface IP3 HM is not a local problem Transparent to NoC performance Walter, Cidon, Ginosar and Kolodny, ”Access Regulation to Hot-Modules in Wormhole NoCs”, NOCS 2007.
Source Fairness IP (HM) Interface No “fairness” is guarantied since routers’ arbitration is based on local state The further is the source from the destination, its worm has to win more arbitrations The HM module bandwidth isn’t fairly shared
Hot Module Distributed Arbitration Control is distributed or centralized Centralized control can account for dependencies Requests and grants are sent at high service level Requests and grants includes additional data as needed requested quota, source queue size, priority, deadline, etc. Granted quota, scheduling of transmission's, etc. Initial credits hides light load request-grant latency Emphasis: Bypassing (blocked) data packets!
Hot vs. non-Hot ModuleTraffic HM Traffic With Control Other Traffic With Control HM Traffic Without Control Other Traffic Without Control
NoC: A Network AND A Chip Conclusions NoC is a chip design paradigm shift Introduces many diverse and new networking challenges No killer NoC for all chips Should not comply with any X-AN concept May include centralized mechanisms May involve more than one NoC/Bus mechanisms May combine several communication methodologies Low latency NoC/Bus for metadata and urgent signals Beware of early standardization and legacy barriers Mutual benefit for VLSI-Networking collaboration NoC: A Network AND A Chip