Switch Microarchitecture Basics

Slides:



Advertisements
Similar presentations
A Novel 3D Layer-Multiplexed On-Chip Network
Advertisements

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,
Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
On-Chip Networks and Testing
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan CS258 S99.
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
1 Copyright © Monash University ATM Switch Design Philip Branch Centre for Telecommunications and Information Engineering (CTIE) Monash University
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,
Computer Networks with Internet Technology William Stallings
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Lecture 16: Router Design
Buffered Crossbars With Performance Guarantees Shang-Tse (Da) Chuang Cisco Systems EE384Y Thursday, April 27, 2006.
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Deadlock: Part II - Recovery.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
William Stallings Data and Computer Communications
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
Chapter 3 Part 3 Switching and Bridging
Interconnection Structures
Overview Parallel Processing Pipelining
Topics discussed in this section:
The network-on-chip protocol
Chapter 8 Switching Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Buffer Management and Arbiter in a Switch
Lecture 23: Interconnection Networks
Physical constraints (1/2)
Deadlock.
Interconnection Networks: Flow Control
Lecture 23: Router Design
NoC Switch: Basic Design Principles &
Chapter 3 Part 3 Switching and Bridging
Lecture 17: NoC Innovations
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Switching, routing, and flow control in interconnection networks
Lecture: Interconnection Networks
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Lecture: Interconnection Networks
Chapter 3 Part 3 Switching and Bridging
CS 6290 Many-core & Interconnect
Congestion Control (from Chapter 05)
Lecture 25: Interconnection Networks
Congestion Control (from Chapter 05)
Multiprocessors and Multi-computers
Presentation transcript:

Switch Microarchitecture Basics

Reading Relevant papers are cited in the presentations Duato, Yalamanchili, and Ni: Sections 7.2.1, 7.2.2, and 7.2.3 (pages 390-393)

Operation and microarchitecture? Overview Integration with flow control Impact of switching mechanisms What does the message pipeline look like? Basis for optimized operation Operation and microarchitecture?

Physical Channel Router States (head flit): Routing Arbitration/allocating Switch traversal State Information (body&tail): Output port State transition (tail) Free the channel Example: routing of a wormhole switched message Impact of flit types, e.g., head vs. body vs. tail Message pipeline

Physical Channel Router What does the routing function implementation look like – adaptive vs. deterministic? What does the selection function (adaptive routing) implementation look like? What about arbitration? (more soon)

Routing Decisions Formally represented as a routing function Examples: mapping from Destination to output ports (channels) Input port (channel) & destination to output ports (channels) Header flit to output ports (channels), e.g., source routing Distinct for oblivious vs. adaptive routing Implement Turn restrictions here

Implementation of Routing Functions Common implementation forms Finite state machine Table look up Centralized vs. distributed Across input ports (virtual channels) Impact on cycle time, e.g., table size

A Virtual Channel Router Control plane Data plane From L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001

A Virtual Channel Router What does the routing function implementation look like – adaptive vs. deterministic? What does the selection function (adaptive routing) implementation look like? What about abitration? (more soon) Figure from L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001

Pipelined Switch Microarchitecture Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Some Operational Principles Data and control planes operate a three rates Phit, flit, packet Resources are allocated and de-allocated at these rates Fixed clock cycle model What are the atomic switch functions State Management Of resources – allocation Of data – mapping to resources Granularity of allocation/management key to deadlock freedom in pipelined switches

Buffer States Input buffers Output Buffers: free, routing, VCA, transmitting, stalled (flow control) Output port and output virtual channel Flow control information: stop/go, credits Output Buffers: transmitting, stalled (flow control), free Input port and input virtual channel

Pipeline Disruptions Resource availability disruptions VC availability Downstream buffer space not available (lack of credits) Inter-packet gap is a function of deadlock freedom Allocated flow disruptions Switch not available Downstream buffer space not available Disruptions (pipeline bubbles) propagate to the destination through intermediate routers

Look at Channel Dependencies Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

A Look at Channel Dependencies Issue: creating structural dependencies Dependencies between messages due to concurrent use of VC buffers Such dependencies must be globally managed to avoid deadlock Architectural decision: when is a VC freed? When the tail flit releases an input virtual channel When the tail releases the output virtual channel Remember a VC traverses a link!

Buffer Occupancy Deeper pipelining increases the buffer turnaround time and decreases occupancy L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Main Functions of Router Routing, switching and flow control Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators and Arbiters: a closer look Atomic modules not easily amenable to pipelining

Main Functions of Router Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management

Allocation vs. Arbitration Difference between arbitration and allocation? Matching RQ RQ Winner RQ RQ

Allocators Requesters vs. granters  ports or channels matching Requesters vs. granters  ports or channels Formally equivalent to a matching problem Maximal vs. maximum matching Challenge: fast computation

only 1 bit set in any row or column Allocation Request Matrix inputs 1 0 1 1 1 1 0 1 1 0 0 1 0 1 1 0 outputs Grant Matrix Correctness Criteria: only one grant/input only one grant/output inputs 1 0 0 0 0 1 0 0 0 0 0 1 only 1 bit set in any row or column 0 0 1 0 outputs *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Separable Allocators Input first Output First Arbitrate along rows Input Allocator Output Allocator Request Matrix inputs 1 0 1 1 1 1 0 1 1 0 0 1 0 1 1 0 outputs Input first Arbitrate along rows Winners arbitrate along columns Output First Arbitrate along columns Winners arbitrate along rows *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Separable Allocator: Operation Requests Per Input Port Requests Grants r00 g00 r01 g01 r02 g02 r03 g03 From routing functions r30 g30 r31 g31 r32 g32 r33 g33 Multiple request/port Arbitration amongst requests Arbitration is a distinct problem *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Classes of Approaches Exact solutions Heuristics Forms Time consuming Can be computed offline for known patterns Heuristics Single step vs. iterative Pipelined implementations Overlapping computation and switch scheduling Forms Single state allocation vs. separable

Improving Performance Preventing Starvation Function of the arbiter Ensuring “fairness” Dynamically adjust priorities Improving quality of solution Iterative arbitration Winners at one stage may lose the next stage leaving holes in the allocation Key Challenge: speed vs. quality

Switch Allocator Output port allocated for packet duration Non-VC Allocator VC Allocator Output port allocated for packet duration Low state update rate Separable allocator Separate allocator for speculative and non-speculative requests L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Switch Allocation Flits bid on a cycle basis for cross-bar slots Possible to increase the granularity of bids or the duration which a crossbar port can be held SA cannot create deadlock since ports are not held indefinitely Success in SA is accompanied by flow control updates For example, transmitting credits Traversal of tail flit reinitializes input channel, resets input/output buffer allocations L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Virtual Channel Allocation Adaptive/Link Fully adaptive Deterministic routing How are candidates for arbitration created  routing function Alternatives depend on routing flexibility This is the point at which dependencies are created  when deadlock is avoided L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Wavefront Allocator yin pri xpri xout xin rij grantij ypri Grants are generated when a request has both a row and column token The process is seeded by asserting pri along a diagonal. Stagger diagonals to avoid bias yout *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Arbitration Difference between arbitration and allocation? Matching RQ Winner RQ RQ

Arbitration Issues Who? For how long? Prioritized vs. non-prioritized Metrics Fairness Weighted vs. equal Starvation Mutual exclusion For how long? Cycle by cycle Extended number of cycles Who decides: the requestor, the granter, or the resource

Switch Microarchitecture Basic Switch Microarchitecture Input buffers Output buffers Physical channel Physical channel MUX CrossBar DEMUX Link Control Control Link DEMUX MUX ... ... Routing Control and Arbitration Unit Input buffers Output buffers Physical channel Physical channel MUX DEMUX Control Link Link Control DEMUX MUX ... ... © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Fairness Weak Fairness Strong Fairness FIFO fairness Eventually served Strong Fairness Equally served May be weighted FIFO fairness Served in the order requested Local vs. Global fairness Cascaded arbitration *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Global Fairness S3 will get half the link bandwidth S1 will effectively get 1/8th the link bandwidth

Arbitration Techniques Fixed priority Priority encoder Used in older buses Variable priority order Oblivious priority Oblivious to requests and grants Round robin Rotate the priority Weighted round robin Proportional grants *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Main Functions Routing, switching and flow control Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators/arbiters: a closer look Arbiters – atomic modules not amenable to pipelining Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management

Opportunities How can I reduce latency How can I increase throughput IB RC VCA SA ST How can I reduce latency Reduce the number of pipeline stages How can I increase throughput Increase number of messages in transit Improve buffer/wire utilization

Speculation Deterministic Routing Adaptive Routing IB RC VCA SA ST Can I shorten? IB RC VCA SA ST Deterministic Routing Concurrently request output port and VC Adaptive Routing Requests can be made for multiple physical output ports

Speculation What can be speculated? Cross bar pending VC allocation More complex for adaptive routing protocols Speculative flits vs. non-speculative Header flits vs. body & tail flits Non-speculative flits have lower priority Overhead of speculation High traffic loads masks failures in speculation Low traffic loads increase probability of success

Impact of Flow Control & Speculation Base Performance Impact of Flow control L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Main Functions Routing, switching and flow control Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators/arbiters: a closer look Arbiters – atomic modules not amenable to pipelining Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management

Implement routing function for next node Look-Ahead Routing IB RC VCA SA ST Give an example with deterministic routing Implement routing function for next node RC IB VCA SA ST Introduced in the Spider chip, M. Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro, vol. 17, no. 1, Feb. 1997, pp. 34-39.

Look-Ahead Routing Pipelining/overlap across routers Applied to oblivious and deterministic routing functions Table look-up implementation of routing functions Table can encode the index of the table entry in the next node Enables flexible, statically routed, pipelined high speed routers

A Closer Look Can I take this idea further? Set up the complete pipeline at the next node? Break up the packet flow into control and data We have flit reservation flow control! IB RC VCA SA ST From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000

Problem Take a closer look at flow control Goal: Enable this to be zero! Impact on Latency & Throughput Figure From L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001

Idealized Operation R R R R Buffers are always full Need to remove control from (in band) the data path Ideally need the buffers 100% utilized As long as there is data to be transmitted

Existing Solutions Compiler-based scheduling Information must be statically known Precludes dynamic optimizations Adaptations of circuit switching Circuit switching Wave switching Pipelined circuit switching

Looking a Little Deeper Flow control latency is in the critical message latency path How many cycles does FC add/message? Consequently flow control latency determines bisection bandwidth utilization For a fixed topology and routing algorithm router pipeline IB RC VCA SA ST LT

Looking a Little Deeper Key: remove/hide routing/arbitration from (in band) the end-to-end datapath Focus on efficiency of buffer usage rather than solely on efficiency of channel bandwidth Get the benefits of statically scheduled routing with the flexibility of dynamically scheduled routing

Improving buffer occupancy Impact Improving buffer occupancy Latency Throughput Shift the latency/throughput curve to the right Higher saturation throughput

Approach Similar to pipelined circuit switching in that control flits setup a path Data flits are transmitted without examination Unique in the goal of hiding/overlapping routing & arbitration overheads with data transmission Applicable to any deterministic routing protocol

Pipelined Switch Microarchitecture Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Router & Routing Schedule Setup Scheduling information for data flits Scheduled input/output transfers From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000

Scheduling Reservations Route Schedule departures Schedule arrivals From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000

Some Details Buffers actually allocated the cycle before it arrives Placeholders in the table Do not know the future Better utilization without need for buffer transfers Note credits are turned on in advance reducing buffer turnaround time Injection protocol similar to switch protocol Early arrivals (data) handled via free buffer pool

Architectural Issues Control flits traverse faster network Upper metal layers Narrow control flits + wide data flits Good match for on-chip networks

Overhead Comparison Virtual channel FC Overhead: buffer queues pointers, channel status bits, credit counts Data flits: VCID and type field Flit reservation FC Overhead: Control buffers and I/O reservation table Data flits: payload only Control flits: arrival times Approximately 2% for 256-bit data flits

Some Performance Results Base latency improves to 27 cycles from 32 cycles From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000

Impact of Scheduling Horizon Larger horizon improves the probability of successfully scheduling a flit Larger horizon can be exploited only if control flits lead (proportionally) data flits Importance of relative bandwidths of control and data networks Control flit lead time has little impact when control and data flits use the same network.

System Design Issues Per flit vs. all or nothing scheduling Book keeping vs. simplicity Ratio of control flits to data flits Overhead encapsulated in control flits Determines the capacity of control network Buffer pool implementation Reservations preclude the need for physical partitioning of buffer resources Buffer allocation at arrival time

Flit Reservation Conclusion Resource reservation to improve Throughput: reduce idle time Latency: hide routing and arbitration cost In-band or out-of-band control flow to hide reservations Significant improvements in saturation behavior

Energy/delay performance dominated by routers vs. links A Still Closer Look Can I take this idea even further? What is an ideal network? Direct connections to destinations Can I approximate an ideal network? Remember express (physical) channels? Energy/delay performance dominated by routers vs. links From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

Goal Approach the energy and performance characteristics of the ideal interconnect fabric Direct connections between every pair of nodes Key Idea: virtualize physical express links Express traffic has higher priority in use of physical bandwidth Virtual express links skip router pipeline stages at intermediate routers

Key Idea Express Channels Express channels: the goal was to approach Manhattan wire delay by using express physical channels Express Virtual Channel How can we approach this goal without adding physical wiring  express virtual channels that span a set of physical routers

Ideal Network Properties packet size propagation velocity router delay congestion factor bandwidth average #hops S D ideal interconnect for a source-destination pair Router power interconnect transmission power/bit manhattan wire distance

Performance Gap Increasingly aggressive routers do better at low loads but eventually have no room to speculate Evan at low loads, gap is substantive From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

Express Virtual Channels express links Bypass Node Source/Sink Node Add long distance “channels” Virtualize these physical channels Express virtual links (EVCs) bypass stages of the router pipeline EVCs do not cross dimensions

Router Pipelines Baseline pipeline Express pipeline Look-ahead routing single bit signal EVCs have priority in switch allocation Eliminate BW, RC, VCA and SA Aggressive pipeline Bypass the crossbar – latch to latch transfer From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

Impact on Energy and Throughput Pipeline eliminate several energy consuming stages Throughput Better utilization of wire bandwidth at high loads How do other schemes like speculation fare? From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

Recap: Baseline Router Microarchitecture From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

EVC Router Architecture

Router Microarchitecture: A Bypass Node Lookahead signals to set up the bypass through the switch Non Express Pipeline HD flit BW VA SA ST LT BW VA SA ST LT Body/Tail flit Express Pipeline HD flit ST LT Latch ST LT Latch Body/Tail flit Aggressive pipelining can reduce pipeline to LT (bypass switch altogether) From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

Router Microarchitecture: A Source/Sink Node Choice of allocator depends on the number of hops to destination Non Express Pipeline HD flit BW VA SA ST LT BW VA SA ST LT Body/Tail flit Express Pipeline HD flit ST LT Latch ST LT Latch Body/Tail flit From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

EVC Flow Control express links Bypass node From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

EVC Flow Control Flow control must be managed across multiple links Credit processing Look-ahead signal Non-express router pipeline stages There is a lookahead signal one step ahead of the EVC flit. Also included is the number of non-express router pipeline stages at a sink node (other end of the EVC link) Credit propagation delay Flit delay to first bypass node K-1 bypass routers Note: EVCs require deeper buffers! express links Bypass node

Buffer Management From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

Dynamic EVCs Available in a range of distances to lmax 1 2 3 4 5 6 Available in a range of distances to lmax Every node is a source/sink node and every node is a bypass node All routers are identical Unlike static EVCs, dynamic EVCs can adapt to the exact packet route EVCs remain prioritized over local packets Partition VCs across all EVC lengths

Router Microarchitecture: Dynamic EVCs From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

Routing with Dynamic EVCs Smaller steps or larger steps? Load balancing Distribution of VCs across EVCs Non-uniform vs. uniform Improve utilization (longer hop EVCs underutilized) Starvation “pause” tokens upstream to EVC source after threshold Dynamic EVC implementations propagate pause tokens for (lmax -1) links

Some Observations EVC traffic consumes less energy Improves throughput Pipeline stages skipped Less buffering requirements Improves throughput Buffer Management: Static Need deeper buffers for longer source/sink round trip credit delay Use upper metal layers for faster credit loop transmission Buffer Management: Dynamic Buffer pools with stop-and-go flow control One buffer reserved for each VC to ensure progress Use multiple thresholds for dynamic EVCs

Performance Uniform traffic Speculation failures begin to catch up in the baseline Contention reduction for EVCs increases throughput From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

Performance (cont.) EVCs reduce contention Effectively partitioning traffic and pre-allocating resources across nodes Also reduces energy Performance difference is sensitive to aggressiveness of the pipeline From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

EVC Summary Effectively performing non-local, pre-allocation of resources Reduces contention Saves energy Improves throughput Can make better use of wire bandwidth if headroom exists Hence should be better than heterogeneous networks

Buffering Strategies

Pipelined Switch Microarchitecture Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Definition of switch speedup Reading Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992. Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616. R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002. B. Prabhakar and N. Mckeown, “On the Speedup Required for Combined Input Output Queuing,” Proceedings of the IEEE International Symposium on Information Theory, August 1998 Definition of switch speedup

Need for Buffering Flow control – downstream buffers are not available Conflict – multiple concurrent requests for the same output port Decisions – routing/processing the packet

Basic Buffer Organization FIFO Strict FIFOs require traversal of the full queue Circular Queue (CQ) Efficient FIFO implementation Strict ordering leads to (HOL) blocking Analogy with in-order instruction issue Central buffering with dynamic allocation Effective sharing of buffer space

Issue: Utilization vs. Throughput Share memory resources High utilization via dynamic allocation Stresses I/O rates (channel or switch) Partitioned Resources Scaling to match I/O rates Less flexibility in sharing  lower utilization Options? Multi-porting Physical partitioning Application across physical, virtual, and switch

Key Design Issues Where do we place buffers? Input Queued, Output Queued, and Input & Output Queued Decouple internal datapath from physical link transmission Centrally buffered Buffered cross bars How are the buffers designed/organized? FIFO, circular queue (CQ), statically allocated multiqueue (SAMQ), dynamically allocated multiqueue (DAMQ) Impact of buffering strategies link vs. switch speeds Arbitration and scheduling Impact on flow control Consider multicast traffic

Challenges with Central Buffers High bandwidth requirements N read ports and n write ports Wide I/O design Problematic for variable length packets – fast hardware allocation and de-allocation with wide I/O Uneven traffic One output port can monopolize storage There are ways around this!

Switch Microarchitecture Basic Switch Microarchitecture Input buffers Output buffers Physical channel Physical channel MUX CrossBar DEMUX Link Control Control Link DEMUX MUX ... ... Routing Control and Arbitration Unit Input buffers Output buffers Physical channel Physical channel MUX DEMUX Control Link Link Control DEMUX MUX ... ... Switch input speedup Switch input & output speedup © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Switch Speedup Ratio of the switch bandwidth to the line bandwidth Speedup requirements for maximal throughput Generally speedup is limited to 20%-30% of maximum Goal is increase output port utilization and maximize throughput Difference between speedup and parallelism

Independent Buffers Will utilize input buffering FIFO buffers naturally accommodate variable length packets, but….

Buffer Organization HOL blocking at an input port Input port i Output port X+ Input buffers X+ Output port X- Y- X- Y+ Y- X+ Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch

2D mesh, no VCs, DOR routing Buffer Organization HOL blocking at an input port using a single queue per port VC0 X X 2D mesh, no VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization HOL blocking is reduced when using virtual channels (2 queues) Input port i Output port X+ Input buffers X+ X- X+ Output port X- DEMUX Y- Y+ Y- Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization HOL blocking removed when using virtual channels (2 queues) VC0 VC1 X 2D mesh, 2 VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization X X X HOL blocking remains when using virtual channels (2 queues) VC0 VC1 X X X No VCs available 2D mesh, 2 VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization HOL blocking is avoided at switch using VOQs (need k queues) Input port i Output port X+ Input buffers X+ X+ Output port X- X- DEMUX Y+ Y- Y- Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization X X HOL blocking avoided at roots using VOQs, but not at branches!! Y+ X+ Y- X- HOL blocking at neighboring switch!! Y+ X+ Y- X- X However!!!! X Y+ X+ Y- X- 2D mesh, VOQs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Dynamically Allocated Buffers How should a fixed amount of storage be designed and used? Challenge of variable sized packets Goals: Avoid head of line blocking Per flow performance function of total buffer storage Advantages of FIFO without the disadvantages Baseline Centrally buffered, dynamically allocated (CBDA) switch Note that fifos make it easier to handle variable sized packets. Note that cut-through designs required dual ported buffers Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.

FIFO Queues Single buffer per port Only one read port and one write port per queue Efficient for variable length packets Logical extension Statically allocate for multiple output ports

Statically Allocated Fully Connected (SAFC) Complexity Multiple switches to be controlled Flow control bandwidth is O(#buffers/port) Pre-routing to determine target buffer at next router Multiple queue controllers Efficiency Packets access only one fourth the buffer space Variable length packets cannot make efficient use of buffer space N/4 4X1 Crossbars Fully connected since each input buffer has a direct connection to each output port that it belongs to. This is effectively a 16x4 cross bar except all crosspoints do not have to be fully populated since a buffer goes to only one output port. A fifo is like a dynamically allocated space with respect to buffer utilization. This is amplified in networks with variable length packets. Note that pre-routing limits the routing function, i.e., once you are bound to a queue you can no longer go to another port (not adaptive routing) Outputs

Statically Allocated Multiqueue (SAMQ) N/4 Single buffer statically allocated to multiple queues Only one read port and one write port per queue Efficiency and pre-routing are still concerns Logical extension Improve storage efficiency and per port queuing  DAMQs

Dynamically Allocated Multiqueue (DAMQ) head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Per block (allocation) and per queue (destination routing) data structures Incoming packet  head of the free list Length and write register counters for each block Routing adds packet to the tail of the queue Concurrent access to registers/buffers speeds turnaround time

Dynamically Allocated Multiqueue (DAMQ) head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Five lists at each port List of packets for each (other) output port Free list List of packets for the processor port Implements virtual channels

DAMQ Properties Supports fast cut through Dual ported SRAM Happens only on empty buffers and free output port Dual ported SRAM Write and read bus Buffers and registers can be accessed in parallel Shift register based block addressing Separate registers for read and write operations Fast operation Per port management Three FSMs: buffer manager, routing, transmission Note implicit routing restriction associated with queue insertion

Performance The centrally buffered switch (CBDA) represents the idealized option Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.

Queue Designs for Adaptive Routing Goal: queue structures that are supportive of true fully adaptive routing Compare with existing designs

Buffer Implementation Buffer Organizations Implementation of VOQs via DAMQs Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Input 1 Input N Output 1 Output N Switch A Switch B Input 1 Output 1 . . . Queue (1,1) Queue (1,N) . . . . . . Output N . . . Input N Queue (N,1) Queue (N,N) Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Input 1 Input N Output 1 Output N Switch A Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Switch B Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Switch C © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Impact of Routing Algorithms Queue structures historically focused on single path routing protocols and associated issues For example head of line blocking (HOL) Virtual output queues essentially pre-routes the packet Restricts routing freedom, i.e., what if another port becomes available in the next cycle What about adaptive routing protocols? What does HOL mean now? What is the “issue” logic now?

Supporting Adaptive Routing Goal Support true, fully adaptive routing Key Issue Multiple output ports are candidates for a packet Flexibility in issuing packets to “available” output port Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.

DAMQs revisited Dynamically Allocated Fully Connected (DAFC) Queue BW vs. crossbar complexity Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.

Recruiting Candidates In each queue, recruit registers identify candidates for other output ports Exclude the native port and the reverse direction When a port queue is empty, recruit packets from other queues Need to hide register updates Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.

VC DAMQs Decouple queues from output ports – these are now VCs Assign queues to output ports Does not eliminate HOL – play the odds Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.

Evaluation of Implementations Assumptions VCT, input buffering, hierarchical buffer management, 8 packet buffers Use a common baseline approach

Implementation of CQs Tradeoff in granularity vs. cost Overlapping flit and phit access Two Log N bit counters for CQ Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.

DAMQ Implementation Model (2K +N) Log N-bit pointer registers Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.

DAMQs with Recruit Registers 2 x (K-1) x (K-2) Log N-bit registers Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.

Operation Cost Asymmetry of reads and writes Pointer updates in the more complex schemes Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.

Performance Impact of VCs allocation Flexibility pays Static vs. dynamic Flexibility pays Cost in storage and speed How about power? Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.

A Closer Look at DAMQs Only one output port can be reading from an input port at a time Conflicts at input ports leads to idle link cycles Solution: pipeline memory (buffers) (see * below) No more centralized arbitration! Multicast Replication should not incur significant synchronization penalties *M. Katevenis, P Vatsolaki and A. Efthymiou, “Pipelined Shared Buffer Memory for VLSI Switches,” Proceedings of ACM SIGCOMM, August 1005, pp. 39-48

Taxonomy of Buffering Schemes R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002

HIPIQS: Basic Problem with DAMQs head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Single input (write) port Single output (read) port

Basic Idea Key idea: use pipelined buffers and input queued switches Read 0 Read 1 Read 2 Read 3 Key idea: use pipelined buffers and input queued switches Concurrent reads from output ports Simple to replicate R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002

Equivalent to Virtual Output Queuing or buffered cross bars Input Module Equivalent to Virtual Output Queuing or buffered cross bars Fast path R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002

Pipelined Buffer Management R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002

Application Proposed for use in multistage networks Pays for performance via increased connectivity O(K3f) Performance that of the output queuing with central buffers

Buffered Crossbars Provide one buffer for each input-output port pair Can achieve 100% throughput Captures VOQ principles

Buffered Crossbar . . . Memory Input 1 Input 2 Input N Output 1 Arbiter Output 1 Output 2 Output N . . . © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffering Summary Basic set of design decisions Allocation Static vs. dynamic Physical partitioning Across switch, port, virtual channel Buffer bandwidth Pipelined, multi-ported Location Input, output, or centralized Combinations to meet specific deployment needs

Microarchitecture Summary Power and area Buffers and cross bars account for the majority of power Packet latency through the switch Arbitration Queuing structure Microarchitectural techniques not that different from that found in cores Pipelined router Speculative router pipeline Inter-router pipelining/overlap Buffering Techniques