Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Packet Switches
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2 Packet switches q In a circuit switch, path of a sample is determined at time of connection establishment q No need for a sample header--position in frame used q In a packet switch, packets carry a destination field or label q Need to look up destination port on-the-fly q Datagram switches q lookup based on entire destination address (longest-prefix match) q Cell or Label-switches q lookup based on VCI or Labels q L2 Switches, L3 Switches, L4-L7 switches q Key difference is in lookup function (I.e. filtering), not in switching (I.e not in forwarding)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3 Shared Memory Switches q Dual-ported RAM q Incoming cells converted from serial to parallel q Elegant, but memory speeds & port counts don’t scale q Output buffering q 100% throughput under heavy load q Minimize buffers q Eg: CNET Prelude, Hitachi shared buffer s/w, AT&T GCNS-2000
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 4 Shared memory fabrics: more… q Memory interface hardware expensive => many “ports” share fewer memory interfaces q Eg: dual-ported memory q Separate low-speed bus lines for controller
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5 Shared Medium Switches q Share medium (I.e. bus/ring etc) instead of memory q Medium has to be N times as fast q Address filters & output buffers at the medium speed also! q TDM + round robin q Egs: IBM PARIS & plaNET s/w, Fore Forerunner ASX-100, NEC ATOM
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6 Fully Interconnected Switches q Full interconnections q Broadcast + address-filters q Multicasting is natural q Output queuing q All hardware same speed => scalable q Quadratic growth of buffers/filters q Knockout switch (AT&T) reduced # of buffers: fixed L (=8) buffers per output + a tournament method to eliminate packets q Small residual packet loss rate (1/million) q Egs: Fujitsu bus matrix, GTE SPANet
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7 Crossbar: “Switched” interconnections q 2N media (I.e. buses), BUT… q Use “switches” between each input and output bus instead of broadcasting q Total number of “paths” required = N+M q Number of switching points = NxM q Arbitration/scheduling needed to deal with port contention
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8 Multi-Stage Fabrics q Compromise between pure time-division and pure space division q Attempt to combine advantages of each q Lower cost from time-division q Higher performance from space-division q Technique: Limited Sharing q Eg: Banyan switch q Features q Scalable q Self-routing, I.e. no central controller q Packet queues allowed, but not required q Note: multi-stage switches share the “crosspoints” which have now become “expensive” resources…
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9 Multi-stage switches: fewer crosspoints q Issue: output & internal blocking…
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10 Banyan Switch Fabric (Contd) q Basic building block = 2x2 switch, labelled by 0/1 q Can be synchronous or asynchronous q Asynchronous => packets can arrive at arbitrary times q Synchronous banyan offers TWICE the effective throughput! q Worst case when all inputs receive packets with same label
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 11 Switch fabric element q Goal: “self-routing” fabrics q Build complicated fabrics from a simple elements q Routing rule: if 0, send packet to upper output, else to lower output q If both packets to same output, buffer or drop
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12 Multi-stage Interconnects (MINs): Banyan q Key: reduce the number of crosspoints in a crossbar q 8x8 banyan: Recursive design q Use the first bit to route the cell through the first stage, either to the upper or lower 4x4 network, q Last 2 bits to route the cell through the 4x4 network to the appropriate output port. q Self-routing: output address completely specifies the route through the network (aka digit- controlled routing) q Simple elements, scalable, parallel routing, elements at same speed q Eg: Bellcore Sunshine, Alcatel DN 1100
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13 Banyan Fabric: another view…
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14 Banyan q Simplest self-routing recursive fabric q Two packets want to go to the same output => output blocking q Banyan: packets may block even if they want to go to different outputs => internal blocking! q Unlike crossbar: because it has fewer crosspoints q However, feasible non-blocking schedules exist => pre-sort & shuffle packets to get to such non-blocking schedules
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15 Non-Blocking Batcher-Banyan Batcher SorterSelf-Routing Network Fabric can be used as scheduler. Batcher-Banyan network is blocking for multicast.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16 Blocking in Banyan S/ws: Sorting q Can avoid blocking by choosing order in which packets appear at input ports q If we can q present packets at inputs sorted by output q “trap” duplicates (I.e. going to same o/p port) q remove gaps q precede banyan with a perfect shuffle stage q then no internal blocking q For example: [X, 010, 010, X, 011, X, X, X]: q Sort => [010, 011, 011, X, X, X, X, X] q Trap duplicates => [010, 011, X, X, X, X, X, X] q Shuffle => [010, X, 011, X, X, X, X, X] q Need sort, shuffle, and trap networks
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17 Sorting using Merging q Build sorters from merge networks q Assume we can merge two sorted lists q Sort pairwise, merge, recurse
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18 Putting together: Batcher-Banyan
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19 Scaling Banyan Networks: Challenges 1. Batcher-banyan networks of significant size are physically limited by the possible circuit density and number of input/output pins of the integrated circuit. To interconnect several boards, interconnection complexity and power dissipation place a constraint on the number of boards that can be interconnected 2. The entire set of N cells must be synchronized at every stage 3. Large sizes increases the difficulty of reliability and repairability 4. All modifications to maximize the throughput of space- division networks increase the implementation complexity
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20 Other Non-Blocking Fabrics Clos Network
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21 Other Non-Blocking Fabrics Clos Network Expansion factor required = 2-1/N (but still blocking for multicast)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22 Blocking and Buffering
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23 Blocking in packet switches q Can have both internal and output blocking q Internal q no path to output q Output q trunk unavailable q Unlike a circuit switch, cannot predict if packets will block (why?) q If packet is blocked => must either buffer or drop
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24 Dealing with blocking in packet switches q Over-provisioning q internal links much faster than inputs q Buffers q at input or output q Backpressure q if switch fabric doesn’t have buffers, prevent packet from entering until path is available q Parallel switch fabrics q increases effective switching capacity
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25 Blocking in Banyan Fabric
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 26 Buffering: where? q Input q Output q Internal q Re-circulating
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 27 Queuing: input, output buffers
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28 Switch Fabrics: Buffered crossbar q What happens if packets at two inputs both want to go to same output? q Can defer one at an input buffer q Or, buffer cross-points: complex arbiter
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29 Queuing: Two basic practical techniques Input Queueing Output Queueing Usually a non-blocking switch fabric (e.g. crossbar) Usually a fast bus
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30 Queuing: Output Queueing Individual Output QueuesCentralized Shared Memory Memory b/w = (N+1).R 1 2 N Memory b/w = 2N.R 1 2 N
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 31 Output Queuing
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 32 Input Queuing
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 33 Input Queueing Head of Line Blocking Delay Load 58.6% 100%
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 34 Solution: Input Queueing w/ Virtual output queues (VOQ)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 35 Head-of-Line (HOL) in Input Queuing
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 36 Input Queues Virtual Output Queues Delay Load 100%
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 37 Input Queueing Scheduler Memory b/w = 2R Can be quite complex!
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 38 Input Queueing Scheduling
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 39 Input Queueing Scheduling: Example Request Graph Bipartite Matching (Weight = 18)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 40 Input Queueing Longest Queue First or Oldest Cell First Weight Waiting Time 100% Queue Length { } =
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 41 Input Queueing Scheduling q Maximum Size q Maximizes instantaneous throughput q Does it maximize long-term throughput? q Maximum Weight q Can clear most backlogged queues q But does it sacrifice long-term throughput?
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 42 Input Queuing Why is serving long/old queues better than serving maximum number of queues? When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. When traffic is non-uniform, some queues become longer than others. A good algorithm keeps the queue lengths matched, and services a large number of queues. VOQ # Avg Occupancy Uniform traffic VOQ # Avg Occupancy Non-uniform traffic
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 43 Input Queueing Practical Algorithms q Maximal Size Algorithms q Wave Front Arbiter (WFA) q Parallel Iterative Matching (PIM) q iSLIP q Maximal Weight Algorithms q Fair Access Round Robin (FARR) q Longest Port First (LPF)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 44 iSLIP Requests Grant Accept/Match #1 #2 Round-Robin Selection
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 45 iSLIP Properties q Random under low load q TDM under high load q Lowest priority to MRU q 1 iteration: fair to outputs q Converges in at most N iterations. On average <= log 2 N q Implementation: N priority encoders q Up to 100% throughput for uniform traffic
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 46 iSLIP
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 47 iSLIP
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 48 iSLIP Implementation Grant Accept 1 2 N 1 2 N State N N N Decision log 2 N Programmable Priority Encoder
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 49 Throughput results Theory: Practice: Input Queueing (IQ) Input Queueing (IQ) Input Queueing (IQ) Input Queueing (IQ) 58% [Karol, 1987] IQ + VOQ, Maximum weight matching IQ + VOQ, Maximum weight matching IQ + VOQ, Sub-maximal size matching e.g. PIM, iSLIP. IQ + VOQ, Sub-maximal size matching e.g. PIM, iSLIP. 100% [M et al., 1995] Different weight functions, incomplete information, pipelining. Different weight functions, incomplete information, pipelining. Randomized algorithms 100% [Tassiulas, 1998] 100% [Various] Various heuristics, distributed algorithms, and amounts of speedup Various heuristics, distributed algorithms, and amounts of speedup IQ + VOQ, Maximal size matching, Speedup of two. IQ + VOQ, Maximal size matching, Speedup of two. 100% [Dai & Prabhakar, 2000]
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 50 Speedup: Context MemoryMemory MemoryMemory The placement of memory gives - Output-queued switches - Input-queued switches - Combined input- and output-queued switches A generic switch
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 51 Output-queued switches Best delay and throughput performance - Possible to erect “bandwidth firewalls” between sessions Main problem - Requires high fabric speedup (S = N) Unsuitable for high-speed switching
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 52 Input-queued switches Big advantage - Speedup of one is sufficient Main problem - Can’t guarantee delay due to input contention Overcoming input contention: use higher speedup
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 53 The Speedup Problem Find a compromise: 1 < Speedup << N - to get the performance of an OQ switch - close to the cost of an IQ switch Essential for high speed QoS switching
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 54 Intuition Speedup = 1 Speedup = 2 Fabric throughput =.58 Bernoulli IID inputs Fabric throughput = 1.16 Bernoulli IID inputs I/p efficiency, = 1/1.16 Ave I/p queue = 6.25
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 55 Intuition (continued) Speedup = 3 Fabric throughput = 1.74 Bernoulli IID inputs Input efficiency = 1/1.74 Speedup = 4 Fabric throughput = 2.32 Bernoulli IID inputs Input efficiency = 1/2.32 Ave I/p queue = 0.75 Ave I/p queue = 1.35