Approaching Ideal NoC Latency with Pre-Configured Routes

Slides:



Advertisements
Similar presentations
A Novel 3D Layer-Multiplexed On-Chip Network
Advertisements

Montek Singh COMP Nov 10,  Design questions at various leves ◦ Network Adapter design ◦ Network level: topology and routing ◦ Link level:
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
Dynamic NoC. 2 Limitations of Fixed NoC Communication NoC for reconfigurable devices:  NOC: a viable infrastructure for communication among task dynamically.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
Switching, routing, and flow control in interconnection networks.
Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science.
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Elastic-Buffer Flow-Control for On-Chip Networks
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
Packet Forwarding. A router has several input/output lines. From an input line, it receives a packet. It will check the header of the packet to determine.
Variable Packet Size Buffered Crossbar (CICQ) Switches Manolis Katevenis, Georgios Passas, Dimitrios Simos, Ioannis Papaefstathiou, and Nikos Chrysos FORTH.
The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
SARC Proprietary and Confidential Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Connectors, Repeaters, Hubs, Bridges, Switches, Routers, NIC’s
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.
Graciela Perera Department of Computer Science and Information Systems Slide 1 of 18 INTRODUCTION NETWORKING CONCEPTS AND ADMINISTRATION CSIS 3723 Graciela.
William Stallings Data and Computer Communications
Advanced Caches Smruti R. Sarangi.
Chapter 16– Connecting LANs
Authors: Jiang Xie, Ian F. Akyildiz
Overview Parallel Processing Pipelining
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Digital readout architecture for Velopix
Networking Devices.
Advanced Computer Networks
Lecture 23: Interconnection Networks
CS 268: Router Design Ion Stoica February 27, 2003.
Packet Forwarding.
ESE532: System-on-a-Chip Architecture
Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.
Network Layer.
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Azeddien M. Sllame, Amani Hasan Abdelkader
Cache Memory Presentation I
Lecture 23: Router Design
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
DEPARTMENT OF COMPUTER SCIENCE M.TEJASWINI
Deadlock Free Hardware Router with Dynamic Arbiter
The Xilinx Virtex Series FPGA
Switching, routing, and flow control in interconnection networks
Timing Analysis 11/21/2018.
The Network Layer Network Layer Design Issues:
On-time Network On-chip
PRESENTATION COMPUTER NETWORKS
Switching Techniques.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
EE 122: Lecture 7 Ion Stoica September 18, 2001.
RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS
The Xilinx Virtex Series FPGA
Lecture: Interconnection Networks
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
CS 6290 Many-core & Interconnect
CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t)
Switching, routing, and flow control in interconnection networks
Connectors, Repeaters, Hubs, Bridges, Switches, Routers, NIC’s
Multiprocessors and Multi-computers
Presentation transcript:

Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science (ICS) Foundation for Research & Technology - Hellas (FORTH) P.O.Box 1385, Heraklion, Crete, GR-711-10 GREECE Email: {mihelog,pnevmati,kateveni}@ics.forth.gr

Introduction Problem: Latency NoCs impose. Motivation: Latency introduced to every communication pair. Past work: Achieves 1 cycle/hop at 500 MHz. We extend speculation to routing decisions. Goal: Approach buffered wire latency. Fraction of cycle/hop. NoCs also help communication needs from any source to any destination. NoCs significantly affect area, latency, power, fault-tolerance. May 2007 ICS-FORTH, Greece

Our Approach 400 ps good scenario; 1 cycle otherwise. 130 nm library Inverters approximately every 2 mm. 130 nm implementation library. May 2007 ICS-FORTH, Greece

Preferred Paths Each output has one preferred input. This pref. I/O pair is connected by a single pre-enabled tri-state driver. Pre-enabling is crucial: 200 ps pre-enabled mux; 500 ps otherwise. Later check if flits correctly forwarded. Thus, preferred paths are formed. Reconfigurable at run-time. Custom routes (shapes) allowed. Numbers with 1 mm wire ahead. Our switche’s tristate drivers are slower because: 1). They are tri-states. 2). They fanout to more bits at next hop. One input may be preferred for many outputs. Therefore paths may split. They may not converge. Safe to prevent out-of-order delivery: 1). Old preferred path idle for 1 clock cycle. 2). New preferred path’s FIFO is empty. 3). No more flits in a packet are expected. May 2007 ICS-FORTH, Greece

Switch Architecture - Output Config. & arbitration logic. Stores pref. path config. & arbitrates. 400 ps 1 cycle Pref. path pre-enabled tri-states. 5 inputs – don’t care why for now. FIFOs are served if preferred path has been idle for 1 clock cycle. 1 clock cycle per hop in non-preferred path. Routing combinational logic is placed at each input: We will see next slide. Determines if a flit needs to be enqueued. (dead? Incorrectly forwarded?) Reasons for architecture: 1). Mad postman operation. Dead flits not in FIFOs. 2). We don’t have VCs, therefore crosspoint FIFOs not head of line blocking. 3). One arbitration logic per output makes for simpler logic and shorter critical paths. Routing logic tri-state. Input FIFOs. Selectable when non-empty, or flit to be enqueued. May 2007 ICS-FORTH, Greece

Switch Architecture - Input Dead flits: Incorrectly eagerly forwarded. Terminated at end of preferred path. Switch resembles a buffered crossbar. Preferred paths correspond to buffered crossbar’s cut-through paths. Input is not connected to the same port’s output. Preferred path bus restrictive in configuration and simultaneous forwards. Decides if flit needs to be enqueued. May 2007 ICS-FORTH, Greece

Routing Algorithm Deterministic routing employed. Non-preferred paths follow XY routing. We slightly modify XY routing to handle preferred paths: Flit correctly eagerly forwarded if it approaches the destination in any axis. Flit considered dead otherwise. Deterministic routing needed: We need to know if the previous hop regarded this switch as the best next hop at the time, or simply eagerly forwarded wrongly. Global configuration knowledge too expensive. Deadlock-free: Preferred paths the only ones not to follow XY. They do not content for resources, except when congestion. In that case, serve them first. Flits may reach destination in preferred paths not in XY routing fashion. May 2007 ICS-FORTH, Greece

Routing Characteristics Flits in preferred paths may not follow XY routing. Duplicate copies of a flit may be delivered. XY routing. Pref. paths. D Out of order if the flits’ path in a packet changes. We disallow it by applying new configuration safely. Safe to prevent out-of-order delivery: 1). Old preferred path idle for 1 clock cycle. 2). New preferred path’s FIFO is empty. 3). No more flits in a packet are expected. S May 2007 ICS-FORTH, Greece

NoC Topology – Bar Floorplan Application: Tiled CPU and RAM blocks. Each switch is 6x6 and serves 4 PEs. Cannot receive at lowest latency from two adjacent RAMs at the same time. Vertical outputs are slightly more congested (address ins and Y traversal). Data to a PE are sent to both. Data from a PE go through this mini 2x1 switch. May 2007 ICS-FORTH, Greece

Bar Floorplan Would be 8x12: Vertical links drive address inputs. 2 PE data ports served by 1 switch port. Switch distance in Y: Approx 1.1mm. Minimum A: 170μm (13% overhead) RAM blocks 4096x32 bits, column mux 16. 740.07 x 576.64 μm2 May 2007 ICS-FORTH, Greece

Cross Floorplan A: 130μm B: 140μm (18% overhead). May 2007 Advantage: Switch distance in Y minimum. Disadvantage: Area efficiency drops in this case. Gain in X axis small. RAM blocks 4096x32 bits, column mux 16. 740.07 x 576.64 μm2 May 2007 ICS-FORTH, Greece

Layout Results 130 nm implementation library. Typical case. Pref. path latency: 300-420 ps. 450-500 ps (incl. 1mm). 1 cycle/node otherwise. Past work: 1 cycle/node at 500 MHz. Clock frequency 667 MHz Flit width 39 FIFO lines 2 Number of FIFOs 30 Bar area overhead 13% Cross area overhead 18% Number of cells 15 K Number of gates 45 K Total dynamic power 80 mW 400 MHz worst case Unbuffered dummy 1mm wire: 135ps. Pre-configured mux: 200ps. Cross: A: 130μm B: 140μm (18% overhead). Bar: Minimum A: 170μm (13% overhead) Power: Full traffic scenario. Typical case voltage: 1.2V. Total FIFOs: 30 32 payload bits. It’s hard to work fast without sweating. May 2007 ICS-FORTH, Greece

Advanced Issues Deadlock & livelock freedom. Constraints to prevent circle. Keep NoC functional in any case. Out-of-order delivery of flits in the same packet. Apply reconfiguration at a “safe” time. Adaptive routing. May 2007 ICS-FORTH, Greece

Future Work Synchronization issues – A flit may arrive at any time. Impose preferred path constraints. Implement switch asynchronously. Evaluation in complete system. Implement fault-tolerance. Setting that works: Perhaps maximum number of preferred path hops. Asynchronous implement: Easy, but handshaking will delay us. Is handshaking necessary? Seems to be. Perhaps some kind of encoding. Evaluation complete system: Study exactly the benefits. Study a method for choosing preferred paths. Adaptive routing routes around problems (congestion etc). Is beneficial. Synchronization - steps to solve: 1). Find a setting that works – define maximum number of hops. 2). Synchronizers at destinations and switch combinational logic. 3). Use asynchronous FIFOs and make all logic asynchronous. Not hard. May 2007 ICS-FORTH, Greece

Conclusion We approach ideal latency. By pre-enabled tri-state paths. Our NoC is a generalized “mad-postman” [C. R. Jesshope et al, 1989]. Our NoC is easily generalized – topology may need to be changed. Past NoC research can be applied for further optimizations. Latency independent of the clock cycle. May 2007 ICS-FORTH, Greece