© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) The Black Widow High Radix Clos Network S. Scott, D.Abts, J. Kim, and W.

Slides:



Advertisements
Similar presentations
1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.
Advertisements

1 Keith D. Underwood, Eric Borch May 16, 2011 A Unified Algorithm for both Randomized Deterministic and Adaptive Routing in Torus Networks.
1 Maintaining Packet Order in Two-Stage Switches Isaac Keslassy, Nick McKeown Stanford University.
SE-292 High Performance Computing
Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.
1 Peripheral Component Interconnect (PCI). 2 PCI based System.
1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.
Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis
A Programmable Adaptive Router for a GALS Parallel System Jian Wu APT Group University of Manchester May 2009.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
A Novel 3D Layer-Multiplexed On-Chip Network
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
On-Chip Interconnects Alexander Grubb Jennifer Tam Jiri Simsa Harsha Simhadri Martha Mercaldi Kim, John D. Davis, Mark Oskin, and Todd Austin. “Polymorphic.
PRESENTED BY: PRIYANK GUPTA 04/02/2012 Generic Low Latency NoC Router Architecture for FPGA Computing Systems & A Complete Network on Chip Emulation Framework.
Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Issues in System-Level Direct Networks Jason D. Bakos.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
CPU Chips The logical pinout of a generic CPU. The arrows indicate input signals and output signals. The short diagonal lines indicate that multiple pins.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.
Interconnect Networks
On-Chip Networks and Testing
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Networks-on-Chips (NoCs) Basics
Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan CS258 S99.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Blue Gene/L Torus Interconnection Network N. R. Adiga, et.al IBM Journal.
The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.
A Scalable, Commodity Data Center Network Architecture Jingyang Zhu.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
1 Dynamic Interconnection Networks Miodrag Bolic.
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Lecture 16: Router Design
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Deadlock: Part II.
Team LDPC, SoC Lab. Graduate Institute of CSIE, NTU Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin.
Lecture Note on Switch Architectures. Function of Switch.
1 A quick tutorial on IP Router design Optics and Routing Seminar October 10 th, 2000 Nick McKeown
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Deadlock: Part II - Recovery.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Lecture 23: Interconnection Networks
Physical constraints (1/2)
Exploring Concentration and Channel Slicing in On-chip Network Router
Azeddien M. Sllame, Amani Hasan Abdelkader
Lecture 23: Router Design
Static and Dynamic Networks
Lecture 14: Interconnection Networks
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Interconnection Networks Contd.
RECONFIGURABLE NETWORK ON CHIP ARCHITECTURE FOR AEROSPACE APPLICATIONS
CS 6290 Many-core & Interconnect
Multiprocessors and Multi-computers
Presentation transcript:

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) The Black Widow High Radix Clos Network S. Scott, D.Abts, J. Kim, and W. Dally ISCA

ECE 8813a (2) System Overview Designed for communication intensive systems High BW, low latency synchronization Shared memory Direct load/store architecture Thousands of outstanding memory references 32K processors (72K in principle) Folded Clos topology with side links Local and global fault tolerant routing Adaptive and deterministic routing High radix and side links

ECE 8813a (3) Technology Influence Pin bandwidth and signaling rates have increased while packet sizes ~ constant ASIC technology More logic/pin more logic/bits/sec Off-chip vs. on-chip signaling rates Wiring vs. buffer tradeoffs Use topology to reduce latency Higher radix topologies

ECE 8813a (4) Logical Topology x32 R1R2R1 Folded Organized as two 32x32 virtual crossbars

ECE 8813a (5) Architecture Each channel is three bits wide Radix 64 router chips From S. Scott, D. Abts, J Kim, and W. Dally, The Black Widow High Radix Clos Network, Proceedings of the International Symposium on Computer Architecture, 2007

ECE 8813a (6) Scaling Use side links for incremental scalability Multiple size configurations Half size configurations Physical bandwidth provisioning via packaging From S. Scott, D. Abts, J Kim, and W. Dally, The Black Widow High Radix Clos Network, Proceedings of the International Symposium on Computer Architecture, 2007

ECE 8813a (7) Follow the Packets From input port to column switch From column switch to output port multiplexor Two stage arbitration For output of tile switch For output port Note the size of the arbiters From S. Scott, D. Abts, J Kim, and W. Dally, The Black Widow High Radix Clos Network, Proceedings of the International Symposium on Computer Architecture, 2007 Note wire length due to co-location of output buffers

ECE 8813a (8) Design Tradeoffs High radix routers Scaling of arbitration costs with input queuing Hierarchical design NRE costs tiled architecture Where do I use my abundance of wires? Bus across the row tiles Point to point across the column tiles

ECE 8813a (9) The YARC Pipeline 25 stages no load latency of ns All major blocks and row and column buses are pipelined 24-bit (phit) internal buffer VCT flow control externally, wormhole internal Variable length packets From S. Scott, D. Abts, J Kim, and W. Dally, The Black Widow High Radix Clos Network, Proceedings of the International Symposium on Computer Architecture, 2007

ECE 8813a (10) Packet formats Source routing for maintenance only Optional hash used for deterministic routing 2 VCs for request-reply 256 phit buffers – link delay sized Retransmit handled by credit management Soft errors handled by the NI Physical layer Mapping 8 lane SERDES macros Channel swizzling From S. Scott, D. Abts, J Kim, and W. Dally, The Black Widow High Radix Clos Network, Proceedings of the International Symposium on Computer Architecture, 2007

ECE 8813a (11) Routing Routing – partially adaptive Up routing is adaptive or deterministic Down routing is deterministic Requests are ordered Memory consistency model Requests only use deterministic routing Destination-based, table driven routing in each tile YARC ports are initialized to physical and logical port numbers

ECE 8813a (12) Up Routing: Organization Root detect register for locating the nodes accessible downstream Mask for identifying address bits Node in the sub-tree if unmasked destination and root detection bits match Associative routing table to direct packets Matching entry identifies destination sub-tree May be uplink or side link For processors Root detect = 0x0060 Mask = 0x001F From S. Scott, D. Abts, J Kim, and W. Dally, The Black Widow High Radix Clos Network, Proceedings of the International Symposium on Computer Architecture, 2007

ECE 8813a (13) Up Routing: Adaptive Adaptive if header bit is set Produce a 64 mask of feasible output ports OR operation to produce column mask Route to column based on input buffer occupancy Break ties via matrix arbiter in general Heuristic oCheck MSB of buffer occupancy oRound robin arbitration Route to column based on occupancy Use the row mask to identify candidates Use 2 bits of buffer occupancy

ECE 8813a (14) Up Routing: Deterministic Ex-OR of input port and destination Spread the traffic across up links For memory addresses optionally include bits of the destination address for further spreading Retains ordering relationship between request to same memory location

ECE 8813a (15) Down Routing Deterministic Pick the set of bits in the address Map these to a logical output port Remap table to determine physical output port Enables portability of YARC chip: use across different nodes in the hierarchy

ECE 8813a (16) Fault Tolerance Fixed number of retries on transmission failures Note the use of credits to control retransmission Error notification for failed links Graceful degradation on channel widths during MTTR Soft errors via CRC Error code in header to direct soft error packets to destination NI Timeouts on packet injection Routing table update to avoid faulty components

ECE 8813a (17) Implementation 800 MHz core clock, 6.25 Gbps links Over half the chip is memory From S. Scott, D. Abts, J Kim, and W. Dally, The Black Widow High Radix Clos Network, Proceedings of the International Symposium on Computer Architecture, 2007

ECE 8813a (18) Comparison to Conventional Wisdom Simpler routing than torus No turn or channel restrictions to enforce Addressing is simpler for load balancing Lower cost for same bisection bandwidth? Easier for partitioning Support for virtualization VCT on the links and Wormhole within the switch Cost of on-chip buffers in a tile Design space and the degree of the tile switch Range from Xbar to buffered crossbar

ECE 8813a (19) Summary New design solution for large scale chip-to chip interconnect Revisit on-chip networks Abundance of routing resources Scarcity of buffer resources Complexity of the routing function Switching and arbitration does not scale oWhat can fit in a single clock cycle? Locality of design becomes more critical