Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science (ICS) Foundation for Research & Technology - Hellas (FORTH) P.O.Box 1385, Heraklion, Crete, GR GREECE
May 2007ICS-FORTH, Greece2 Introduction Problem: Latency NoCs impose. Motivation: Latency introduced to every communication pair. Past work: Achieves 1 cycle/hop at 500 MHz. We extend speculation to routing decisions. Goal: Approach buffered wire latency. Fraction of cycle/hop.
May 2007ICS-FORTH, Greece3 Our Approach 400 ps good scenario; 1 cycle otherwise. 130 nm library
May 2007ICS-FORTH, Greece4 Preferred Paths Each output has one preferred input. This pref. I/O pair is connected by a single pre-enabled tri-state driver. Pre-enabling is crucial: 200 ps pre-enabled mux; 500 ps otherwise. Later check if flits correctly forwarded. Thus, preferred paths are formed. Reconfigurable at run-time. Custom routes (shapes) allowed.
May 2007ICS-FORTH, Greece5 Switch Architecture - Output 400 ps 1 cycle Input FIFOs. Selectable when non-empty, or flit to be enqueued. Pref. path pre- enabled tri-states. Routing logic tri-state. Config. & arbitration logic. Stores pref. path config. & arbitrates.
May 2007ICS-FORTH, Greece6 Switch Architecture - Input Dead flits: Incorrectly eagerly forwarded. Terminated at end of preferred path. Switch resembles a buffered crossbar. Decides if flit needs to be enqueued.
May 2007ICS-FORTH, Greece7 Routing Algorithm Deterministic routing employed. Non-preferred paths follow XY routing. We slightly modify XY routing to handle preferred paths: Flit correctly eagerly forwarded if it approaches the destination in any axis. Flit considered dead otherwise.
May 2007ICS-FORTH, Greece8 Routing Characteristics Flits in preferred paths may not follow XY routing. Duplicate copies of a flit may be delivered. XY routing. Pref. paths. D S
May 2007ICS-FORTH, Greece9 NoC Topology – Bar Floorplan Application: Tiled CPU and RAM blocks. Each switch is 6x6 and serves 4 PEs.
May 2007ICS-FORTH, Greece10 Bar Floorplan Would be 8x12: Vertical links drive address inputs. 2 PE data ports served by 1 switch port.
May 2007ICS-FORTH, Greece11 Cross Floorplan
May 2007ICS-FORTH, Greece12 Layout Results 130 nm implementation library. Typical case. Pref. path latency: ps ps (incl. 1mm). 1 cycle/node otherwise. Past work: 1 cycle/node at 500 MHz. Clock frequency667 MHz Flit width39 FIFO lines2 Number of FIFOs30 Bar area overhead13% Cross area overhead18% Number of cells15 K Number of gates45 K Total dynamic power80 mW
May 2007ICS-FORTH, Greece13 Advanced Issues Deadlock & livelock freedom. Constraints to prevent circle. Keep NoC functional in any case. Out-of-order delivery of flits in the same packet. Apply reconfiguration at a “safe” time. Adaptive routing.
May 2007ICS-FORTH, Greece14 Future Work Synchronization issues – A flit may arrive at any time. Impose preferred path constraints. Implement switch asynchronously. Evaluation in complete system. Implement fault-tolerance.
May 2007ICS-FORTH, Greece15 Conclusion We approach ideal latency. By pre-enabled tri-state paths. Our NoC is a generalized “mad- postman” [C. R. Jesshope et al, 1989]. Our NoC is easily generalized – topology may need to be changed. Past NoC research can be applied for further optimizations.
May 2007ICS-FORTH, Greece16 Related Work Most assumed 2D mesh-like topologies. Reconfigurable topologies studied. Various performance enhancement techniques studied. They achieve 1 clock cycle/node at approx. 500 Mhz. Various routing algorithms studied. Recent field: fault-tolerant techniques.
May 2007ICS-FORTH, Greece17 Backpressure FIFO almost full Previous hop’s feeding output alerted. If fed by a preferred path in the previous hop Flits are also enqueued. Preferred path may or may not be broken.
May 2007ICS-FORTH, Greece18 Mad Postman XY routing. Eagerly forward incoming flits to the same axis. Later examine if correctly forwarded. Terminate “dead” flits in later hops. MSG Penalty Source Destination
May 2007ICS-FORTH, Greece19 NoC Topology Application: Tiled CPU and RAM blocks. Each switch is 6x6 and serves 4 PEs. Would be 8x12 without compromises: Vertical links also drive RAM address in. Two PE data ports are served by a single switch port: Data from one PE: