Switch Microarchitecture Basics
Reading Relevant papers are cited in the presentations Duato, Yalamanchili, and Ni: Sections 7.2.1, 7.2.2, and 7.2.3 (pages 390-393)
Operation and microarchitecture? Overview Integration with flow control Impact of switching mechanisms What does the message pipeline look like? Basis for optimized operation Operation and microarchitecture?
Physical Channel Router States (head flit): Routing Arbitration/allocating Switch traversal State Information (body&tail): Output port State transition (tail) Free the channel Example: routing of a wormhole switched message Impact of flit types, e.g., head vs. body vs. tail Message pipeline
Physical Channel Router What does the routing function implementation look like – adaptive vs. deterministic? What does the selection function (adaptive routing) implementation look like? What about arbitration? (more soon)
Routing Decisions Formally represented as a routing function Examples: mapping from Destination to output ports (channels) Input port (channel) & destination to output ports (channels) Header flit to output ports (channels), e.g., source routing Distinct for oblivious vs. adaptive routing Implement Turn restrictions here
Implementation of Routing Functions Common implementation forms Finite state machine Table look up Centralized vs. distributed Across input ports (virtual channels) Impact on cycle time, e.g., table size
A Virtual Channel Router Control plane Data plane From L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001
A Virtual Channel Router What does the routing function implementation look like – adaptive vs. deterministic? What does the selection function (adaptive routing) implementation look like? What about abitration? (more soon) Figure from L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001
Pipelined Switch Microarchitecture Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
Some Operational Principles Data and control planes operate a three rates Phit, flit, packet Resources are allocated and de-allocated at these rates Fixed clock cycle model What are the atomic switch functions State Management Of resources – allocation Of data – mapping to resources Granularity of allocation/management key to deadlock freedom in pipelined switches
Buffer States Input buffers Output Buffers: free, routing, VCA, transmitting, stalled (flow control) Output port and output virtual channel Flow control information: stop/go, credits Output Buffers: transmitting, stalled (flow control), free Input port and input virtual channel
Pipeline Disruptions Resource availability disruptions VC availability Downstream buffer space not available (lack of credits) Inter-packet gap is a function of deadlock freedom Allocated flow disruptions Switch not available Downstream buffer space not available Disruptions (pipeline bubbles) propagate to the destination through intermediate routers
Look at Channel Dependencies Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
A Look at Channel Dependencies Issue: creating structural dependencies Dependencies between messages due to concurrent use of VC buffers Such dependencies must be globally managed to avoid deadlock Architectural decision: when is a VC freed? When the tail flit releases an input virtual channel When the tail releases the output virtual channel Remember a VC traverses a link!
Buffer Occupancy Deeper pipelining increases the buffer turnaround time and decreases occupancy L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
Main Functions of Router Routing, switching and flow control Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators and Arbiters: a closer look Atomic modules not easily amenable to pipelining
Main Functions of Router Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management
Allocation vs. Arbitration Difference between arbitration and allocation? Matching RQ RQ Winner RQ RQ
Allocators Requesters vs. granters ports or channels matching Requesters vs. granters ports or channels Formally equivalent to a matching problem Maximal vs. maximum matching Challenge: fast computation
only 1 bit set in any row or column Allocation Request Matrix inputs 1 0 1 1 1 1 0 1 1 0 0 1 0 1 1 0 outputs Grant Matrix Correctness Criteria: only one grant/input only one grant/output inputs 1 0 0 0 0 1 0 0 0 0 0 1 only 1 bit set in any row or column 0 0 1 0 outputs *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004
Separable Allocators Input first Output First Arbitrate along rows Input Allocator Output Allocator Request Matrix inputs 1 0 1 1 1 1 0 1 1 0 0 1 0 1 1 0 outputs Input first Arbitrate along rows Winners arbitrate along columns Output First Arbitrate along columns Winners arbitrate along rows *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004
Separable Allocator: Operation Requests Per Input Port Requests Grants r00 g00 r01 g01 r02 g02 r03 g03 From routing functions r30 g30 r31 g31 r32 g32 r33 g33 Multiple request/port Arbitration amongst requests Arbitration is a distinct problem *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004
Classes of Approaches Exact solutions Heuristics Forms Time consuming Can be computed offline for known patterns Heuristics Single step vs. iterative Pipelined implementations Overlapping computation and switch scheduling Forms Single state allocation vs. separable
Improving Performance Preventing Starvation Function of the arbiter Ensuring “fairness” Dynamically adjust priorities Improving quality of solution Iterative arbitration Winners at one stage may lose the next stage leaving holes in the allocation Key Challenge: speed vs. quality
Switch Allocator Output port allocated for packet duration Non-VC Allocator VC Allocator Output port allocated for packet duration Low state update rate Separable allocator Separate allocator for speculative and non-speculative requests L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
Switch Allocation Flits bid on a cycle basis for cross-bar slots Possible to increase the granularity of bids or the duration which a crossbar port can be held SA cannot create deadlock since ports are not held indefinitely Success in SA is accompanied by flow control updates For example, transmitting credits Traversal of tail flit reinitializes input channel, resets input/output buffer allocations L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
Virtual Channel Allocation Adaptive/Link Fully adaptive Deterministic routing How are candidates for arbitration created routing function Alternatives depend on routing flexibility This is the point at which dependencies are created when deadlock is avoided L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
Wavefront Allocator yin pri xpri xout xin rij grantij ypri Grants are generated when a request has both a row and column token The process is seeded by asserting pri along a diagonal. Stagger diagonals to avoid bias yout *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004
Arbitration Difference between arbitration and allocation? Matching RQ Winner RQ RQ
Arbitration Issues Who? For how long? Prioritized vs. non-prioritized Metrics Fairness Weighted vs. equal Starvation Mutual exclusion For how long? Cycle by cycle Extended number of cycles Who decides: the requestor, the granter, or the resource
Switch Microarchitecture Basic Switch Microarchitecture Input buffers Output buffers Physical channel Physical channel MUX CrossBar DEMUX Link Control Control Link DEMUX MUX ... ... Routing Control and Arbitration Unit Input buffers Output buffers Physical channel Physical channel MUX DEMUX Control Link Link Control DEMUX MUX ... ... © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Fairness Weak Fairness Strong Fairness FIFO fairness Eventually served Strong Fairness Equally served May be weighted FIFO fairness Served in the order requested Local vs. Global fairness Cascaded arbitration *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004
Global Fairness S3 will get half the link bandwidth S1 will effectively get 1/8th the link bandwidth
Arbitration Techniques Fixed priority Priority encoder Used in older buses Variable priority order Oblivious priority Oblivious to requests and grants Round robin Rotate the priority Weighted round robin Proportional grants *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004
Main Functions Routing, switching and flow control Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators/arbiters: a closer look Arbiters – atomic modules not amenable to pipelining Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management
Opportunities How can I reduce latency How can I increase throughput IB RC VCA SA ST How can I reduce latency Reduce the number of pipeline stages How can I increase throughput Increase number of messages in transit Improve buffer/wire utilization
Speculation Deterministic Routing Adaptive Routing IB RC VCA SA ST Can I shorten? IB RC VCA SA ST Deterministic Routing Concurrently request output port and VC Adaptive Routing Requests can be made for multiple physical output ports
Speculation What can be speculated? Cross bar pending VC allocation More complex for adaptive routing protocols Speculative flits vs. non-speculative Header flits vs. body & tail flits Non-speculative flits have lower priority Overhead of speculation High traffic loads masks failures in speculation Low traffic loads increase probability of success
Impact of Flow Control & Speculation Base Performance Impact of Flow control L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
Main Functions Routing, switching and flow control Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators/arbiters: a closer look Arbiters – atomic modules not amenable to pipelining Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management
Implement routing function for next node Look-Ahead Routing IB RC VCA SA ST Give an example with deterministic routing Implement routing function for next node RC IB VCA SA ST Introduced in the Spider chip, M. Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro, vol. 17, no. 1, Feb. 1997, pp. 34-39.
Look-Ahead Routing Pipelining/overlap across routers Applied to oblivious and deterministic routing functions Table look-up implementation of routing functions Table can encode the index of the table entry in the next node Enables flexible, statically routed, pipelined high speed routers
A Closer Look Can I take this idea further? Set up the complete pipeline at the next node? Break up the packet flow into control and data We have flit reservation flow control! IB RC VCA SA ST From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000
Problem Take a closer look at flow control Goal: Enable this to be zero! Impact on Latency & Throughput Figure From L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001
Idealized Operation R R R R Buffers are always full Need to remove control from (in band) the data path Ideally need the buffers 100% utilized As long as there is data to be transmitted
Existing Solutions Compiler-based scheduling Information must be statically known Precludes dynamic optimizations Adaptations of circuit switching Circuit switching Wave switching Pipelined circuit switching
Looking a Little Deeper Flow control latency is in the critical message latency path How many cycles does FC add/message? Consequently flow control latency determines bisection bandwidth utilization For a fixed topology and routing algorithm router pipeline IB RC VCA SA ST LT
Looking a Little Deeper Key: remove/hide routing/arbitration from (in band) the end-to-end datapath Focus on efficiency of buffer usage rather than solely on efficiency of channel bandwidth Get the benefits of statically scheduled routing with the flexibility of dynamically scheduled routing
Improving buffer occupancy Impact Improving buffer occupancy Latency Throughput Shift the latency/throughput curve to the right Higher saturation throughput
Approach Similar to pipelined circuit switching in that control flits setup a path Data flits are transmitted without examination Unique in the goal of hiding/overlapping routing & arbitration overheads with data transmission Applicable to any deterministic routing protocol
Pipelined Switch Microarchitecture Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
Router & Routing Schedule Setup Scheduling information for data flits Scheduled input/output transfers From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000
Scheduling Reservations Route Schedule departures Schedule arrivals From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000
Some Details Buffers actually allocated the cycle before it arrives Placeholders in the table Do not know the future Better utilization without need for buffer transfers Note credits are turned on in advance reducing buffer turnaround time Injection protocol similar to switch protocol Early arrivals (data) handled via free buffer pool
Architectural Issues Control flits traverse faster network Upper metal layers Narrow control flits + wide data flits Good match for on-chip networks
Overhead Comparison Virtual channel FC Overhead: buffer queues pointers, channel status bits, credit counts Data flits: VCID and type field Flit reservation FC Overhead: Control buffers and I/O reservation table Data flits: payload only Control flits: arrival times Approximately 2% for 256-bit data flits
Some Performance Results Base latency improves to 27 cycles from 32 cycles From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000
Impact of Scheduling Horizon Larger horizon improves the probability of successfully scheduling a flit Larger horizon can be exploited only if control flits lead (proportionally) data flits Importance of relative bandwidths of control and data networks Control flit lead time has little impact when control and data flits use the same network.
System Design Issues Per flit vs. all or nothing scheduling Book keeping vs. simplicity Ratio of control flits to data flits Overhead encapsulated in control flits Determines the capacity of control network Buffer pool implementation Reservations preclude the need for physical partitioning of buffer resources Buffer allocation at arrival time
Flit Reservation Conclusion Resource reservation to improve Throughput: reduce idle time Latency: hide routing and arbitration cost In-band or out-of-band control flow to hide reservations Significant improvements in saturation behavior
Energy/delay performance dominated by routers vs. links A Still Closer Look Can I take this idea even further? What is an ideal network? Direct connections to destinations Can I approximate an ideal network? Remember express (physical) channels? Energy/delay performance dominated by routers vs. links From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007
Goal Approach the energy and performance characteristics of the ideal interconnect fabric Direct connections between every pair of nodes Key Idea: virtualize physical express links Express traffic has higher priority in use of physical bandwidth Virtual express links skip router pipeline stages at intermediate routers
Key Idea Express Channels Express channels: the goal was to approach Manhattan wire delay by using express physical channels Express Virtual Channel How can we approach this goal without adding physical wiring express virtual channels that span a set of physical routers
Ideal Network Properties packet size propagation velocity router delay congestion factor bandwidth average #hops S D ideal interconnect for a source-destination pair Router power interconnect transmission power/bit manhattan wire distance
Performance Gap Increasingly aggressive routers do better at low loads but eventually have no room to speculate Evan at low loads, gap is substantive From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007
Express Virtual Channels express links Bypass Node Source/Sink Node Add long distance “channels” Virtualize these physical channels Express virtual links (EVCs) bypass stages of the router pipeline EVCs do not cross dimensions
Router Pipelines Baseline pipeline Express pipeline Look-ahead routing single bit signal EVCs have priority in switch allocation Eliminate BW, RC, VCA and SA Aggressive pipeline Bypass the crossbar – latch to latch transfer From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007
Impact on Energy and Throughput Pipeline eliminate several energy consuming stages Throughput Better utilization of wire bandwidth at high loads How do other schemes like speculation fare? From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007
Recap: Baseline Router Microarchitecture From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007
EVC Router Architecture
Router Microarchitecture: A Bypass Node Lookahead signals to set up the bypass through the switch Non Express Pipeline HD flit BW VA SA ST LT BW VA SA ST LT Body/Tail flit Express Pipeline HD flit ST LT Latch ST LT Latch Body/Tail flit Aggressive pipelining can reduce pipeline to LT (bypass switch altogether) From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007
Router Microarchitecture: A Source/Sink Node Choice of allocator depends on the number of hops to destination Non Express Pipeline HD flit BW VA SA ST LT BW VA SA ST LT Body/Tail flit Express Pipeline HD flit ST LT Latch ST LT Latch Body/Tail flit From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007
EVC Flow Control express links Bypass node From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007
EVC Flow Control Flow control must be managed across multiple links Credit processing Look-ahead signal Non-express router pipeline stages There is a lookahead signal one step ahead of the EVC flit. Also included is the number of non-express router pipeline stages at a sink node (other end of the EVC link) Credit propagation delay Flit delay to first bypass node K-1 bypass routers Note: EVCs require deeper buffers! express links Bypass node
Buffer Management From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007
Dynamic EVCs Available in a range of distances to lmax 1 2 3 4 5 6 Available in a range of distances to lmax Every node is a source/sink node and every node is a bypass node All routers are identical Unlike static EVCs, dynamic EVCs can adapt to the exact packet route EVCs remain prioritized over local packets Partition VCs across all EVC lengths
Router Microarchitecture: Dynamic EVCs From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007
Routing with Dynamic EVCs Smaller steps or larger steps? Load balancing Distribution of VCs across EVCs Non-uniform vs. uniform Improve utilization (longer hop EVCs underutilized) Starvation “pause” tokens upstream to EVC source after threshold Dynamic EVC implementations propagate pause tokens for (lmax -1) links
Some Observations EVC traffic consumes less energy Improves throughput Pipeline stages skipped Less buffering requirements Improves throughput Buffer Management: Static Need deeper buffers for longer source/sink round trip credit delay Use upper metal layers for faster credit loop transmission Buffer Management: Dynamic Buffer pools with stop-and-go flow control One buffer reserved for each VC to ensure progress Use multiple thresholds for dynamic EVCs
Performance Uniform traffic Speculation failures begin to catch up in the baseline Contention reduction for EVCs increases throughput From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007
Performance (cont.) EVCs reduce contention Effectively partitioning traffic and pre-allocating resources across nodes Also reduces energy Performance difference is sensitive to aggressiveness of the pipeline From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007
EVC Summary Effectively performing non-local, pre-allocation of resources Reduces contention Saves energy Improves throughput Can make better use of wire bandwidth if headroom exists Hence should be better than heterogeneous networks
Buffering Strategies
Pipelined Switch Microarchitecture Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001
Definition of switch speedup Reading Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992. Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616. R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002. B. Prabhakar and N. Mckeown, “On the Speedup Required for Combined Input Output Queuing,” Proceedings of the IEEE International Symposium on Information Theory, August 1998 Definition of switch speedup
Need for Buffering Flow control – downstream buffers are not available Conflict – multiple concurrent requests for the same output port Decisions – routing/processing the packet
Basic Buffer Organization FIFO Strict FIFOs require traversal of the full queue Circular Queue (CQ) Efficient FIFO implementation Strict ordering leads to (HOL) blocking Analogy with in-order instruction issue Central buffering with dynamic allocation Effective sharing of buffer space
Issue: Utilization vs. Throughput Share memory resources High utilization via dynamic allocation Stresses I/O rates (channel or switch) Partitioned Resources Scaling to match I/O rates Less flexibility in sharing lower utilization Options? Multi-porting Physical partitioning Application across physical, virtual, and switch
Key Design Issues Where do we place buffers? Input Queued, Output Queued, and Input & Output Queued Decouple internal datapath from physical link transmission Centrally buffered Buffered cross bars How are the buffers designed/organized? FIFO, circular queue (CQ), statically allocated multiqueue (SAMQ), dynamically allocated multiqueue (DAMQ) Impact of buffering strategies link vs. switch speeds Arbitration and scheduling Impact on flow control Consider multicast traffic
Challenges with Central Buffers High bandwidth requirements N read ports and n write ports Wide I/O design Problematic for variable length packets – fast hardware allocation and de-allocation with wide I/O Uneven traffic One output port can monopolize storage There are ways around this!
Switch Microarchitecture Basic Switch Microarchitecture Input buffers Output buffers Physical channel Physical channel MUX CrossBar DEMUX Link Control Control Link DEMUX MUX ... ... Routing Control and Arbitration Unit Input buffers Output buffers Physical channel Physical channel MUX DEMUX Control Link Link Control DEMUX MUX ... ... Switch input speedup Switch input & output speedup © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Switch Speedup Ratio of the switch bandwidth to the line bandwidth Speedup requirements for maximal throughput Generally speedup is limited to 20%-30% of maximum Goal is increase output port utilization and maximize throughput Difference between speedup and parallelism
Independent Buffers Will utilize input buffering FIFO buffers naturally accommodate variable length packets, but….
Buffer Organization HOL blocking at an input port Input port i Output port X+ Input buffers X+ Output port X- Y- X- Y+ Y- X+ Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch
2D mesh, no VCs, DOR routing Buffer Organization HOL blocking at an input port using a single queue per port VC0 X X 2D mesh, no VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Buffer Organization HOL blocking is reduced when using virtual channels (2 queues) Input port i Output port X+ Input buffers X+ X- X+ Output port X- DEMUX Y- Y+ Y- Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Buffer Organization HOL blocking removed when using virtual channels (2 queues) VC0 VC1 X 2D mesh, 2 VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Buffer Organization X X X HOL blocking remains when using virtual channels (2 queues) VC0 VC1 X X X No VCs available 2D mesh, 2 VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Buffer Organization HOL blocking is avoided at switch using VOQs (need k queues) Input port i Output port X+ Input buffers X+ X+ Output port X- X- DEMUX Y+ Y- Y- Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Buffer Organization X X HOL blocking avoided at roots using VOQs, but not at branches!! Y+ X+ Y- X- HOL blocking at neighboring switch!! Y+ X+ Y- X- X However!!!! X Y+ X+ Y- X- 2D mesh, VOQs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Dynamically Allocated Buffers How should a fixed amount of storage be designed and used? Challenge of variable sized packets Goals: Avoid head of line blocking Per flow performance function of total buffer storage Advantages of FIFO without the disadvantages Baseline Centrally buffered, dynamically allocated (CBDA) switch Note that fifos make it easier to handle variable sized packets. Note that cut-through designs required dual ported buffers Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.
FIFO Queues Single buffer per port Only one read port and one write port per queue Efficient for variable length packets Logical extension Statically allocate for multiple output ports
Statically Allocated Fully Connected (SAFC) Complexity Multiple switches to be controlled Flow control bandwidth is O(#buffers/port) Pre-routing to determine target buffer at next router Multiple queue controllers Efficiency Packets access only one fourth the buffer space Variable length packets cannot make efficient use of buffer space N/4 4X1 Crossbars Fully connected since each input buffer has a direct connection to each output port that it belongs to. This is effectively a 16x4 cross bar except all crosspoints do not have to be fully populated since a buffer goes to only one output port. A fifo is like a dynamically allocated space with respect to buffer utilization. This is amplified in networks with variable length packets. Note that pre-routing limits the routing function, i.e., once you are bound to a queue you can no longer go to another port (not adaptive routing) Outputs
Statically Allocated Multiqueue (SAMQ) N/4 Single buffer statically allocated to multiple queues Only one read port and one write port per queue Efficiency and pre-routing are still concerns Logical extension Improve storage efficiency and per port queuing DAMQs
Dynamically Allocated Multiqueue (DAMQ) head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Per block (allocation) and per queue (destination routing) data structures Incoming packet head of the free list Length and write register counters for each block Routing adds packet to the tail of the queue Concurrent access to registers/buffers speeds turnaround time
Dynamically Allocated Multiqueue (DAMQ) head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Five lists at each port List of packets for each (other) output port Free list List of packets for the processor port Implements virtual channels
DAMQ Properties Supports fast cut through Dual ported SRAM Happens only on empty buffers and free output port Dual ported SRAM Write and read bus Buffers and registers can be accessed in parallel Shift register based block addressing Separate registers for read and write operations Fast operation Per port management Three FSMs: buffer manager, routing, transmission Note implicit routing restriction associated with queue insertion
Performance The centrally buffered switch (CBDA) represents the idealized option Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.
Queue Designs for Adaptive Routing Goal: queue structures that are supportive of true fully adaptive routing Compare with existing designs
Buffer Implementation Buffer Organizations Implementation of VOQs via DAMQs Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Input 1 Input N Output 1 Output N Switch A Switch B Input 1 Output 1 . . . Queue (1,1) Queue (1,N) . . . . . . Output N . . . Input N Queue (N,1) Queue (N,N) Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Input 1 Input N Output 1 Output N Switch A Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Switch B Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Switch C © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Impact of Routing Algorithms Queue structures historically focused on single path routing protocols and associated issues For example head of line blocking (HOL) Virtual output queues essentially pre-routes the packet Restricts routing freedom, i.e., what if another port becomes available in the next cycle What about adaptive routing protocols? What does HOL mean now? What is the “issue” logic now?
Supporting Adaptive Routing Goal Support true, fully adaptive routing Key Issue Multiple output ports are candidates for a packet Flexibility in issuing packets to “available” output port Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.
DAMQs revisited Dynamically Allocated Fully Connected (DAFC) Queue BW vs. crossbar complexity Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.
Recruiting Candidates In each queue, recruit registers identify candidates for other output ports Exclude the native port and the reverse direction When a port queue is empty, recruit packets from other queues Need to hide register updates Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.
VC DAMQs Decouple queues from output ports – these are now VCs Assign queues to output ports Does not eliminate HOL – play the odds Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.
Evaluation of Implementations Assumptions VCT, input buffering, hierarchical buffer management, 8 packet buffers Use a common baseline approach
Implementation of CQs Tradeoff in granularity vs. cost Overlapping flit and phit access Two Log N bit counters for CQ Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.
DAMQ Implementation Model (2K +N) Log N-bit pointer registers Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.
DAMQs with Recruit Registers 2 x (K-1) x (K-2) Log N-bit registers Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.
Operation Cost Asymmetry of reads and writes Pointer updates in the more complex schemes Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.
Performance Impact of VCs allocation Flexibility pays Static vs. dynamic Flexibility pays Cost in storage and speed How about power? Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp. 606-616.
A Closer Look at DAMQs Only one output port can be reading from an input port at a time Conflicts at input ports leads to idle link cycles Solution: pipeline memory (buffers) (see * below) No more centralized arbitration! Multicast Replication should not incur significant synchronization penalties *M. Katevenis, P Vatsolaki and A. Efthymiou, “Pipelined Shared Buffer Memory for VLSI Switches,” Proceedings of ACM SIGCOMM, August 1005, pp. 39-48
Taxonomy of Buffering Schemes R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002
HIPIQS: Basic Problem with DAMQs head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Single input (write) port Single output (read) port
Basic Idea Key idea: use pipelined buffers and input queued switches Read 0 Read 1 Read 2 Read 3 Key idea: use pipelined buffers and input queued switches Concurrent reads from output ports Simple to replicate R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002
Equivalent to Virtual Output Queuing or buffered cross bars Input Module Equivalent to Virtual Output Queuing or buffered cross bars Fast path R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002
Pipelined Buffer Management R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002
Application Proposed for use in multistage networks Pays for performance via increased connectivity O(K3f) Performance that of the output queuing with central buffers
Buffered Crossbars Provide one buffer for each input-output port pair Can achieve 100% throughput Captures VOQ principles
Buffered Crossbar . . . Memory Input 1 Input 2 Input N Output 1 Arbiter Output 1 Output 2 Output N . . . © T.M. Pinkston, J. Duato, with major contributions by J. Filch
Buffering Summary Basic set of design decisions Allocation Static vs. dynamic Physical partitioning Across switch, port, virtual channel Buffer bandwidth Pipelined, multi-ported Location Input, output, or centralized Combinations to meet specific deployment needs
Microarchitecture Summary Power and area Buffers and cross bars account for the majority of power Packet latency through the switch Arbitration Queuing structure Microarchitectural techniques not that different from that found in cores Pipelined router Speculative router pipeline Inter-router pipelining/overlap Buffering Techniques