Switch Microarchitecture Basics

Switch Microarchitecture Basics

Reading Relevant papers are cited in the presentations
Duato, Yalamanchili, and Ni: Sections 7.2.1, 7.2.2, and (pages )

Operation and microarchitecture?
Overview Integration with flow control Impact of switching mechanisms What does the message pipeline look like? Basis for optimized operation Operation and microarchitecture?

Physical Channel Router
States (head flit): Routing Arbitration/allocating Switch traversal State Information (body&tail): Output port State transition (tail) Free the channel Example: routing of a wormhole switched message Impact of flit types, e.g., head vs. body vs. tail Message pipeline

Physical Channel Router
What does the routing function implementation look like – adaptive vs. deterministic? What does the selection function (adaptive routing) implementation look like? What about arbitration? (more soon)

Routing Decisions Formally represented as a routing function
Examples: mapping from Destination to output ports (channels) Input port (channel) & destination to output ports (channels) Header flit to output ports (channels), e.g., source routing Distinct for oblivious vs. adaptive routing Implement Turn restrictions here

Implementation of Routing Functions
Common implementation forms Finite state machine Table look up Centralized vs. distributed Across input ports (virtual channels) Impact on cycle time, e.g., table size

A Virtual Channel Router
Control plane Data plane From L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001

A Virtual Channel Router
What does the routing function implementation look like – adaptive vs. deterministic? What does the selection function (adaptive routing) implementation look like? What about abitration? (more soon) Figure from L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001

Pipelined Switch Microarchitecture
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Input buffers Output buffers Physical channel Physical channel CrossBar Control Link Control Link Routing Computation MUX DEMUX MUX DEMUX Input buffers Output buffers Physical channel Physical channel Control Link Control Link Routing Computation DEMUX MUX DEMUX MUX VC Allocation Switch Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Some Operational Principles
Data and control planes operate a three rates Phit, flit, packet Resources are allocated and de-allocated at these rates Fixed clock cycle model What are the atomic switch functions State Management Of resources – allocation Of data – mapping to resources Granularity of allocation/management key to deadlock freedom in pipelined switches

Buffer States Input buffers Output Buffers:
free, routing, VCA, transmitting, stalled (flow control) Output port and output virtual channel Flow control information: stop/go, credits Output Buffers: transmitting, stalled (flow control), free Input port and input virtual channel

Pipeline Disruptions Resource availability disruptions
VC availability Downstream buffer space not available (lack of credits) Inter-packet gap is a function of deadlock freedom Allocated flow disruptions Switch not available Downstream buffer space not available Disruptions (pipeline bubbles) propagate to the destination through intermediate routers

Look at Channel Dependencies

A Look at Channel Dependencies
Issue: creating structural dependencies Dependencies between messages due to concurrent use of VC buffers Such dependencies must be globally managed to avoid deadlock Architectural decision: when is a VC freed? When the tail flit releases an input virtual channel When the tail releases the output virtual channel Remember a VC traverses a link!

Buffer Occupancy Deeper pipelining increases the buffer turnaround time and decreases occupancy L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Main Functions of Router
Routing, switching and flow control Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators and Arbiters: a closer look Atomic modules not easily amenable to pipelining

Main Functions of Router
Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management

Allocation vs. Arbitration
Difference between arbitration and allocation? Matching RQ RQ Winner RQ RQ

Allocators Requesters vs. granters  ports or channels
matching Requesters vs. granters  ports or channels Formally equivalent to a matching problem Maximal vs. maximum matching Challenge: fast computation

only 1 bit set in any row or column
Allocation Request Matrix inputs outputs Grant Matrix Correctness Criteria: only one grant/input only one grant/output inputs only 1 bit set in any row or column outputs *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Separable Allocators Input first Output First Arbitrate along rows
Input Allocator Output Allocator Request Matrix inputs outputs Input first Arbitrate along rows Winners arbitrate along columns Output First Arbitrate along columns Winners arbitrate along rows *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Separable Allocator: Operation
Requests Per Input Port Requests Grants r00 g00 r01 g01 r02 g02 r03 g03 From routing functions r30 g30 r31 g31 r32 g32 r33 g33 Multiple request/port Arbitration amongst requests Arbitration is a distinct problem *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Classes of Approaches Exact solutions Heuristics Forms Time consuming
Can be computed offline for known patterns Heuristics Single step vs. iterative Pipelined implementations Overlapping computation and switch scheduling Forms Single state allocation vs. separable

Improving Performance
Preventing Starvation Function of the arbiter Ensuring “fairness” Dynamically adjust priorities Improving quality of solution Iterative arbitration Winners at one stage may lose the next stage leaving holes in the allocation Key Challenge: speed vs. quality

Switch Allocator Output port allocated for packet duration
Non-VC Allocator VC Allocator Output port allocated for packet duration Low state update rate Separable allocator Separate allocator for speculative and non-speculative requests L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Switch Allocation Flits bid on a cycle basis for cross-bar slots
Possible to increase the granularity of bids or the duration which a crossbar port can be held SA cannot create deadlock since ports are not held indefinitely Success in SA is accompanied by flow control updates For example, transmitting credits Traversal of tail flit reinitializes input channel, resets input/output buffer allocations L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Virtual Channel Allocation
Adaptive/Link Fully adaptive Deterministic routing How are candidates for arbitration created  routing function Alternatives depend on routing flexibility This is the point at which dependencies are created  when deadlock is avoided L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Wavefront Allocator yin pri xpri xout xin rij grantij ypri Grants are generated when a request has both a row and column token The process is seeded by asserting pri along a diagonal. Stagger diagonals to avoid bias yout *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Arbitration Difference between arbitration and allocation? Matching RQ
Winner RQ RQ

Arbitration Issues Who? For how long? Prioritized vs. non-prioritized
Metrics Fairness Weighted vs. equal Starvation Mutual exclusion For how long? Cycle by cycle Extended number of cycles Who decides: the requestor, the granter, or the resource

Switch Microarchitecture
Basic Switch Microarchitecture Input buffers Output buffers Physical channel Physical channel MUX CrossBar DEMUX Link Control Control Link DEMUX MUX ... ... Routing Control and Arbitration Unit Input buffers Output buffers Physical channel Physical channel MUX DEMUX Control Link Link Control DEMUX MUX ... ... © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Fairness Weak Fairness Strong Fairness FIFO fairness
Eventually served Strong Fairness Equally served May be weighted FIFO fairness Served in the order requested Local vs. Global fairness Cascaded arbitration *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Global Fairness S3 will get half the link bandwidth
S1 will effectively get 1/8th the link bandwidth

Arbitration Techniques
Fixed priority Priority encoder Used in older buses Variable priority order Oblivious priority Oblivious to requests and grants Round robin Rotate the priority Weighted round robin Proportional grants *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

Main Functions Routing, switching and flow control
Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators/arbiters: a closer look Arbiters – atomic modules not amenable to pipelining Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management

Opportunities How can I reduce latency How can I increase throughput
IB RC VCA SA ST How can I reduce latency Reduce the number of pipeline stages How can I increase throughput Increase number of messages in transit Improve buffer/wire utilization

Speculation Deterministic Routing Adaptive Routing
IB RC VCA SA ST Can I shorten? IB RC VCA SA ST Deterministic Routing Concurrently request output port and VC Adaptive Routing Requests can be made for multiple physical output ports

Speculation What can be speculated?
Cross bar pending VC allocation More complex for adaptive routing protocols Speculative flits vs. non-speculative Header flits vs. body & tail flits Non-speculative flits have lower priority Overhead of speculation High traffic loads masks failures in speculation Low traffic loads increase probability of success

Impact of Flow Control & Speculation
Base Performance Impact of Flow control L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

Main Functions Routing, switching and flow control
Note modules on Flow Control, Switching Techniques, and Routing Algorithms Allocators/arbiters: a closer look Arbiters – atomic modules not amenable to pipelining Optimizations Intra-router Pipelined, speculative operation of router functions Inter-router pipelining/overlap Hiding route computation Reservation protocols and express channels Buffer management and arbitration High speed and efficient queue management

Implement routing function for next node
Look-Ahead Routing IB RC VCA SA ST Give an example with deterministic routing Implement routing function for next node RC IB VCA SA ST Introduced in the Spider chip, M. Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro, vol. 17, no. 1, Feb. 1997, pp

Look-Ahead Routing Pipelining/overlap across routers
Applied to oblivious and deterministic routing functions Table look-up implementation of routing functions Table can encode the index of the table entry in the next node Enables flexible, statically routed, pipelined high speed routers

A Closer Look Can I take this idea further?
Set up the complete pipeline at the next node? Break up the packet flow into control and data We have flit reservation flow control! IB RC VCA SA ST From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000

Problem Take a closer look at flow control
Goal: Enable this to be zero! Impact on Latency & Throughput Figure From L. S. Peh and W. J. Dally, “A Delay Model for Router Microarchitectures,” IEEE Micro, January-February 2001

Idealized Operation R R R R Buffers are always full Need to remove control from (in band) the data path Ideally need the buffers 100% utilized As long as there is data to be transmitted

Existing Solutions Compiler-based scheduling
Information must be statically known Precludes dynamic optimizations Adaptations of circuit switching Circuit switching Wave switching Pipelined circuit switching

Looking a Little Deeper
Flow control latency is in the critical message latency path How many cycles does FC add/message? Consequently flow control latency determines bisection bandwidth utilization For a fixed topology and routing algorithm router pipeline IB RC VCA SA ST LT

Looking a Little Deeper
Key: remove/hide routing/arbitration from (in band) the end-to-end datapath Focus on efficiency of buffer usage rather than solely on efficiency of channel bandwidth Get the benefits of statically scheduled routing with the flexibility of dynamically scheduled routing

Improving buffer occupancy
Impact Improving buffer occupancy Latency Throughput Shift the latency/throughput curve to the right Higher saturation throughput

Approach Similar to pipelined circuit switching in that control flits setup a path Data flits are transmitted without examination Unique in the goal of hiding/overlapping routing & arbitration overheads with data transmission Applicable to any deterministic routing protocol

Router & Routing Schedule Setup Scheduling information for data flits
Scheduled input/output transfers From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000

Scheduling Reservations
Route Schedule departures Schedule arrivals From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000

Some Details Buffers actually allocated the cycle before it arrives
Placeholders in the table Do not know the future Better utilization without need for buffer transfers Note credits are turned on in advance reducing buffer turnaround time Injection protocol similar to switch protocol Early arrivals (data) handled via free buffer pool

Architectural Issues Control flits traverse faster network
Upper metal layers Narrow control flits + wide data flits Good match for on-chip networks

Overhead Comparison Virtual channel FC
Overhead: buffer queues pointers, channel status bits, credit counts Data flits: VCID and type field Flit reservation FC Overhead: Control buffers and I/O reservation table Data flits: payload only Control flits: arrival times Approximately 2% for 256-bit data flits

Some Performance Results
Base latency improves to 27 cycles from 32 cycles From L. Peh and W. J. Dally, “Flit Reservation Flow Control,” Proceedings of International Symposium on High Performance Computer Architecture, 2000

Impact of Scheduling Horizon
Larger horizon improves the probability of successfully scheduling a flit Larger horizon can be exploited only if control flits lead (proportionally) data flits Importance of relative bandwidths of control and data networks Control flit lead time has little impact when control and data flits use the same network.

System Design Issues Per flit vs. all or nothing scheduling
Book keeping vs. simplicity Ratio of control flits to data flits Overhead encapsulated in control flits Determines the capacity of control network Buffer pool implementation Reservations preclude the need for physical partitioning of buffer resources Buffer allocation at arrival time

Flit Reservation Conclusion
Resource reservation to improve Throughput: reduce idle time Latency: hide routing and arbitration cost In-band or out-of-band control flow to hide reservations Significant improvements in saturation behavior

Energy/delay performance dominated by routers vs. links
A Still Closer Look Can I take this idea even further? What is an ideal network? Direct connections to destinations Can I approximate an ideal network? Remember express (physical) channels? Energy/delay performance dominated by routers vs. links From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

Goal Approach the energy and performance characteristics of the ideal interconnect fabric Direct connections between every pair of nodes Key Idea: virtualize physical express links Express traffic has higher priority in use of physical bandwidth Virtual express links skip router pipeline stages at intermediate routers

Key Idea Express Channels Express channels: the goal was to approach Manhattan wire delay by using express physical channels Express Virtual Channel How can we approach this goal without adding physical wiring  express virtual channels that span a set of physical routers

Ideal Network Properties
packet size propagation velocity router delay congestion factor bandwidth average #hops S D ideal interconnect for a source-destination pair Router power interconnect transmission power/bit manhattan wire distance

Performance Gap Increasingly aggressive routers do better at low loads but eventually have no room to speculate Evan at low loads, gap is substantive From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

Express Virtual Channels
express links Bypass Node Source/Sink Node Add long distance “channels” Virtualize these physical channels Express virtual links (EVCs) bypass stages of the router pipeline EVCs do not cross dimensions

Router Pipelines Baseline pipeline Express pipeline
Look-ahead routing single bit signal EVCs have priority in switch allocation Eliminate BW, RC, VCA and SA Aggressive pipeline Bypass the crossbar – latch to latch transfer From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

Impact on Energy and Throughput
Pipeline eliminate several energy consuming stages Throughput Better utilization of wire bandwidth at high loads How do other schemes like speculation fare? From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

Recap: Baseline Router Microarchitecture
From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proceedings of ISCA 2007

EVC Router Architecture

Router Microarchitecture: A Bypass Node
Lookahead signals to set up the bypass through the switch Non Express Pipeline HD flit BW VA SA ST LT BW VA SA ST LT Body/Tail flit Express Pipeline HD flit ST LT Latch ST LT Latch Body/Tail flit Aggressive pipelining can reduce pipeline to LT (bypass switch altogether) From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

Router Microarchitecture: A Source/Sink Node
Choice of allocator depends on the number of hops to destination Non Express Pipeline HD flit BW VA SA ST LT BW VA SA ST LT Body/Tail flit Express Pipeline HD flit ST LT Latch ST LT Latch Body/Tail flit From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

EVC Flow Control express links Bypass node
From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

EVC Flow Control Flow control must be managed across multiple links
Credit processing Look-ahead signal Non-express router pipeline stages There is a lookahead signal one step ahead of the EVC flit. Also included is the number of non-express router pipeline stages at a sink node (other end of the EVC link) Credit propagation delay Flit delay to first bypass node K-1 bypass routers Note: EVCs require deeper buffers! express links Bypass node

Buffer Management From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

Dynamic EVCs Available in a range of distances to lmax
1 2 3 4 5 6 Available in a range of distances to lmax Every node is a source/sink node and every node is a bypass node All routers are identical Unlike static EVCs, dynamic EVCs can adapt to the exact packet route EVCs remain prioritized over local packets Partition VCs across all EVC lengths

Router Microarchitecture: Dynamic EVCs
From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

Routing with Dynamic EVCs
Smaller steps or larger steps? Load balancing Distribution of VCs across EVCs Non-uniform vs. uniform Improve utilization (longer hop EVCs underutilized) Starvation “pause” tokens upstream to EVC source after threshold Dynamic EVC implementations propagate pause tokens for (lmax -1) links

Some Observations EVC traffic consumes less energy Improves throughput
Pipeline stages skipped Less buffering requirements Improves throughput Buffer Management: Static Need deeper buffers for longer source/sink round trip credit delay Use upper metal layers for faster credit loop transmission Buffer Management: Dynamic Buffer pools with stop-and-go flow control One buffer reserved for each VC to ensure progress Use multiple thresholds for dynamic EVCs

Performance Uniform traffic
Speculation failures begin to catch up in the baseline Contention reduction for EVCs increases throughput From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

Performance (cont.) EVCs reduce contention
Effectively partitioning traffic and pre-allocating resources across nodes Also reduces energy Performance difference is sensitive to aggressiveness of the pipeline From A. Kumar, L.-S. Peh, P. Kundu and N. Jha, “ Express Virtual Channels: Towards the Ideal Interconnection Fabric, Proceedings of ISCA 2007

EVC Summary Effectively performing non-local, pre-allocation of resources Reduces contention Saves energy Improves throughput Can make better use of wire bandwidth if headroom exists Hence should be better than heterogeneous networks

Buffering Strategies

Definition of switch speedup
Reading Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992. Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002. B. Prabhakar and N. Mckeown, “On the Speedup Required for Combined Input Output Queuing,” Proceedings of the IEEE International Symposium on Information Theory, August 1998 Definition of switch speedup

Need for Buffering Flow control – downstream buffers are not available
Conflict – multiple concurrent requests for the same output port Decisions – routing/processing the packet

Basic Buffer Organization
FIFO Strict FIFOs require traversal of the full queue Circular Queue (CQ) Efficient FIFO implementation Strict ordering leads to (HOL) blocking Analogy with in-order instruction issue Central buffering with dynamic allocation Effective sharing of buffer space

Issue: Utilization vs. Throughput
Share memory resources High utilization via dynamic allocation Stresses I/O rates (channel or switch) Partitioned Resources Scaling to match I/O rates Less flexibility in sharing  lower utilization Options? Multi-porting Physical partitioning Application across physical, virtual, and switch

Key Design Issues Where do we place buffers?
Input Queued, Output Queued, and Input & Output Queued Decouple internal datapath from physical link transmission Centrally buffered Buffered cross bars How are the buffers designed/organized? FIFO, circular queue (CQ), statically allocated multiqueue (SAMQ), dynamically allocated multiqueue (DAMQ) Impact of buffering strategies link vs. switch speeds Arbitration and scheduling Impact on flow control Consider multicast traffic

Challenges with Central Buffers
High bandwidth requirements N read ports and n write ports Wide I/O design Problematic for variable length packets – fast hardware allocation and de-allocation with wide I/O Uneven traffic One output port can monopolize storage There are ways around this!

Switch Microarchitecture
Basic Switch Microarchitecture Input buffers Output buffers Physical channel Physical channel MUX CrossBar DEMUX Link Control Control Link DEMUX MUX ... ... Routing Control and Arbitration Unit Input buffers Output buffers Physical channel Physical channel MUX DEMUX Control Link Link Control DEMUX MUX ... ... Switch input speedup Switch input & output speedup © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Switch Speedup Ratio of the switch bandwidth to the line bandwidth
Speedup requirements for maximal throughput Generally speedup is limited to 20%-30% of maximum Goal is increase output port utilization and maximize throughput Difference between speedup and parallelism

Independent Buffers Will utilize input buffering
FIFO buffers naturally accommodate variable length packets, but….

Buffer Organization HOL blocking at an input port Input port i
Output port X+ Input buffers X+ Output port X- Y- X- Y+ Y- X+ Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch

2D mesh, no VCs, DOR routing
Buffer Organization HOL blocking at an input port using a single queue per port VC0 X X 2D mesh, no VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization HOL blocking is reduced when using virtual channels (2 queues) Input port i Output port X+ Input buffers X+ X- X+ Output port X- DEMUX Y- Y+ Y- Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization HOL blocking removed when using virtual channels (2 queues) VC0 VC1 X 2D mesh, 2 VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization X X X
HOL blocking remains when using virtual channels (2 queues) VC0 VC1 X X X No VCs available 2D mesh, 2 VCs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization HOL blocking is avoided at switch using VOQs (need k queues) Input port i Output port X+ Input buffers X+ X+ Output port X- X- DEMUX Y+ Y- Y- Output port Y+ Output port Y- © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffer Organization X X
HOL blocking avoided at roots using VOQs, but not at branches!! Y+ X+ Y- X- HOL blocking at neighboring switch!! Y+ X+ Y- X- X However!!!! X Y+ X+ Y- X- 2D mesh, VOQs, DOR routing © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Dynamically Allocated Buffers
How should a fixed amount of storage be designed and used? Challenge of variable sized packets Goals: Avoid head of line blocking Per flow performance function of total buffer storage Advantages of FIFO without the disadvantages Baseline Centrally buffered, dynamically allocated (CBDA) switch Note that fifos make it easier to handle variable sized packets. Note that cut-through designs required dual ported buffers Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.

FIFO Queues Single buffer per port
Only one read port and one write port per queue Efficient for variable length packets Logical extension Statically allocate for multiple output ports

Statically Allocated Fully Connected (SAFC)
Complexity Multiple switches to be controlled Flow control bandwidth is O(#buffers/port) Pre-routing to determine target buffer at next router Multiple queue controllers Efficiency Packets access only one fourth the buffer space Variable length packets cannot make efficient use of buffer space N/4 4X1 Crossbars Fully connected since each input buffer has a direct connection to each output port that it belongs to. This is effectively a 16x4 cross bar except all crosspoints do not have to be fully populated since a buffer goes to only one output port. A fifo is like a dynamically allocated space with respect to buffer utilization. This is amplified in networks with variable length packets. Note that pre-routing limits the routing function, i.e., once you are bound to a queue you can no longer go to another port (not adaptive routing) Outputs

Statically Allocated Multiqueue (SAMQ)
N/4 Single buffer statically allocated to multiple queues Only one read port and one write port per queue Efficiency and pre-routing are still concerns Logical extension Improve storage efficiency and per port queuing  DAMQs

Dynamically Allocated Multiqueue (DAMQ)
head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Per block (allocation) and per queue (destination routing) data structures Incoming packet  head of the free list Length and write register counters for each block Routing adds packet to the tail of the queue Concurrent access to registers/buffers speeds turnaround time

Dynamically Allocated Multiqueue (DAMQ)
head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Five lists at each port List of packets for each (other) output port Free list List of packets for the processor port Implements virtual channels

DAMQ Properties Supports fast cut through Dual ported SRAM
Happens only on empty buffers and free output port Dual ported SRAM Write and read bus Buffers and registers can be accessed in parallel Shift register based block addressing Separate registers for read and write operations Fast operation Per port management Three FSMs: buffer manager, routing, transmission Note implicit routing restriction associated with queue insertion

Performance The centrally buffered switch (CBDA) represents the idealized option Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.

Queue Designs for Adaptive Routing
Goal: queue structures that are supportive of true fully adaptive routing Compare with existing designs

Buffer Implementation
Buffer Organizations Implementation of VOQs via DAMQs Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Input 1 Input N Output 1 Output N Switch A Switch B Input 1 Output 1 . . . Queue (1,1) Queue (1,N) . . . . . . Output N . . . Input N Queue (N,1) Queue (N,N) Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Input 1 Input N Output 1 Output N Switch A Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Switch B Queue (1,1) Queue (N,1) Queue (1,N) Queue (N,N) . . . Switch C © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Impact of Routing Algorithms
Queue structures historically focused on single path routing protocols and associated issues For example head of line blocking (HOL) Virtual output queues essentially pre-routes the packet Restricts routing freedom, i.e., what if another port becomes available in the next cycle What about adaptive routing protocols? What does HOL mean now? What is the “issue” logic now?

Supporting Adaptive Routing
Goal Support true, fully adaptive routing Key Issue Multiple output ports are candidates for a packet Flexibility in issuing packets to “available” output port Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp

DAMQs revisited Dynamically Allocated Fully Connected (DAFC)
Queue BW vs. crossbar complexity Y. Tamir and G. Frazier, “Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches,” IEEE Trans. on Computers, Vol. 41, No. 6, pp. 725–734, June, 1992.

Recruiting Candidates
In each queue, recruit registers identify candidates for other output ports Exclude the native port and the reverse direction When a port queue is empty, recruit packets from other queues Need to hide register updates Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp

VC DAMQs Decouple queues from output ports – these are now VCs
Assign queues to output ports Does not eliminate HOL – play the odds Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp

Evaluation of Implementations
Assumptions VCT, input buffering, hierarchical buffer management, 8 packet buffers Use a common baseline approach

Implementation of CQs Tradeoff in granularity vs. cost
Overlapping flit and phit access Two Log N bit counters for CQ Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp

DAMQ Implementation Model
(2K +N) Log N-bit pointer registers Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp

DAMQs with Recruit Registers
2 x (K-1) x (K-2) Log N-bit registers Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp

Operation Cost Asymmetry of reads and writes
Pointer updates in the more complex schemes Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp

Performance Impact of VCs allocation Flexibility pays
Static vs. dynamic Flexibility pays Cost in storage and speed How about power? Y. Choi and T. M. Pinkston, “Evaluation of queue designs for true fully adaptive routers,” Journal of Parallel and Distributed Computing, 6(2004), pp

A Closer Look at DAMQs Only one output port can be reading from an input port at a time Conflicts at input ports leads to idle link cycles Solution: pipeline memory (buffers) (see * below) No more centralized arbitration! Multicast Replication should not incur significant synchronization penalties *M. Katevenis, P Vatsolaki and A. Efthymiou, “Pipelined Shared Buffer Memory for VLSI Switches,” Proceedings of ACM SIGCOMM, August 1005, pp

Taxonomy of Buffering Schemes
R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002

HIPIQS: Basic Problem with DAMQs
head of the free list Point to free list when empty tail head write bus Per queue data structures read bus pointer registers (free list pointer destination not shown) Single input (write) port Single output (read) port

Basic Idea Key idea: use pipelined buffers and input queued switches
Read 0 Read 1 Read 2 Read 3 Key idea: use pipelined buffers and input queued switches Concurrent reads from output ports Simple to replicate R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002

Equivalent to Virtual Output Queuing or buffered cross bars
Input Module Equivalent to Virtual Output Queuing or buffered cross bars Fast path R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002

Pipelined Buffer Management
R. Sivaram, C. Stunkel, and D. K. Panda, “HIPIQS: A High Performance Switch Architecture Using Input Queuing, IEEE Transactions on Parallel And Distributed Systems, March 2002

Application Proposed for use in multistage networks
Pays for performance via increased connectivity O(K3f) Performance that of the output queuing with central buffers

Buffered Crossbars Provide one buffer for each input-output port pair
Can achieve 100% throughput Captures VOQ principles

Buffered Crossbar . . . Memory Input 1 Input 2 Input N Output 1
Arbiter Output 1 Output 2 Output N . . . © T.M. Pinkston, J. Duato, with major contributions by J. Filch

Buffering Summary Basic set of design decisions
Allocation Static vs. dynamic Physical partitioning Across switch, port, virtual channel Buffer bandwidth Pipelined, multi-ported Location Input, output, or centralized Combinations to meet specific deployment needs

Microarchitecture Summary
Power and area Buffers and cross bars account for the majority of power Packet latency through the switch Arbitration Queuing structure Microarchitectural techniques not that different from that found in cores Pipelined router Speculative router pipeline Inter-router pipelining/overlap Buffering Techniques

Switch Microarchitecture Basics

Similar presentations

Presentation on theme: "Switch Microarchitecture Basics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Switch Microarchitecture Basics

Similar presentations

Presentation on theme: "Switch Microarchitecture Basics"— Presentation transcript:

Similar presentations

About project

Feedback