Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical.

Similar presentations


Presentation on theme: "High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical."— Presentation transcript:

1 High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical Engineering and Computer Science nickm@stanford.edu balaji@isl.stanford.edu

2 Copyright 1999. All Rights Reserved2 Tutorial Outline Introduction: What is a Packet Switch? Packet Lookup and Classification: Where does a packet go next? Switching Fabrics: How does the packet get there? Output Scheduling: When should the packet leave?

3 Copyright 1999. All Rights Reserved3 Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers

4 Copyright 1999. All Rights Reserved4 Basic Architectural Components Policing Output Scheduling Switching Routing Congestion Control Reservation Admission Control Datapath: per-packet processing

5 Copyright 1999. All Rights Reserved5 Basic Architectural Components Datapath: per-packet processing Forwarding Decision Forwarding Decision Forwarding Decision Forwarding Table Forwarding Table Forwarding Table Interconnect Output Scheduling 1. 2. 3.

6 Copyright 1999. All Rights Reserved6 Where high performance packet switches are used Enterprise WAN access & Enterprise Campus Switch - Carrier Class Core Router - ATM Switch - Frame Relay Switch The Internet Core Edge Router

7 Copyright 1999. All Rights Reserved7 Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers

8 Copyright 1999. All Rights Reserved8 ATM Switch Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with new. Forward cell to outgoing interface. Transmit cell onto link.

9 Copyright 1999. All Rights Reserved9 Ethernet Switch Lookup frame DA in forwarding table. –If known, forward to correct port. –If unknown, broadcast to all ports. Learn SA of incoming frame. Forward frame to outgoing interface. Transmit frame onto link.

10 Copyright 1999. All Rights Reserved10 IP Router Lookup packet DA in forwarding table. –If known, forward to correct port. –If unknown, drop packet. Decrement TTL, update header Cksum. Forward packet to outgoing interface. Transmit packet onto link.

11 Copyright 1999. All Rights Reserved11 Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers

12 Copyright 1999. All Rights Reserved12 First-Generation IP Routers Shared Backplane Line Interface CPU Memory CPU Buffer Memory Line Interface DMA MAC Line Interface DMA MAC Line Interface DMA MAC

13 Copyright 1999. All Rights Reserved13 Second-Generation IP Routers CPU Buffer Memory Line Card DMA MAC Local Buffer Memory Line Card DMA MAC Local Buffer Memory Line Card DMA MAC Local Buffer Memory

14 Copyright 1999. All Rights Reserved14 Third-Generation Switches/Routers Line Card MAC Local Buffer Memory CPU Card Line Card MAC Local Buffer Memory Switched Backplane Line Interface CPU Memory

15 Copyright 1999. All Rights Reserved15 12345678910111213141516 17181920 2122232425262728 29303132 131415161718 192021222324 252627282930 313221 123456 789101112 Fourth-Generation Switches/Routers Clustering and Multistage

16 Copyright 1999. All Rights Reserved16 Packet Switches References J. Giacopelli, M. Littlewood, W.D. Sincoskie “Sunshine: A high performance self-routing broadband packet switch architecture”, ISS ‘90. J. S. Turner “Design of a Broadcast packet switching network”, IEEE Trans Comm, June 1988, pp. 734-743. C. Partridge et al. “A Fifty Gigabit per second IP Router”, IEEE Trans Networking, 1998. N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro Magazine, Jan-Feb 1997.

17 Copyright 1999. All Rights Reserved17 Tutorial Outline Introduction: What is a Packet Switch? Packet Lookup and Classification: Where does a packet go next? Switching Fabrics: How does the packet get there? Output Scheduling: When should the packet leave?

18 Copyright 1999. All Rights Reserved18 Basic Architectural Components Datapath: per-packet processing Forwarding Decision Forwarding Decision Forwarding Decision Forwarding Table Forwarding Table Forwarding Table Interconnect Output Scheduling 1. 2. 3.

19 Copyright 1999. All Rights Reserved19 Forwarding Decisions ATM and MPLS switches –Direct Lookup Bridges and Ethernet switches –Associative Lookup –Hashing –Trees and tries IP Routers –Caching –CIDR –Patricia trees/tries –Other methods Packet Classification

20 Copyright 1999. All Rights Reserved20 ATM and MPLS Switches Direct Lookup VCI Address Memory Data (Port, VCI)

21 Copyright 1999. All Rights Reserved21 Forwarding Decisions ATM and MPLS switches –Direct Lookup Bridges and Ethernet switches –Associative Lookup –Hashing –Trees and tries IP Routers –Caching –CIDR –Patricia trees/tries –Other methods Packet Classification

22 Copyright 1999. All Rights Reserved22 Bridges and Ethernet Switches Associative Lookups Network Address Associated Data Associative Memory or CAM Search Data 48 log 2 N Associated Data Hit? Address { Advantages: Simple Disadvantages Slow High Power Small Expensive

23 Copyright 1999. All Rights Reserved23 Bridges and Ethernet Switches Hashing Hashing Function Memory Address Data Search Data 48 log 2 N Associated Data Hit? Address { 16

24 Copyright 1999. All Rights Reserved24 Lookups Using Hashing An example Hashing Function CRC-16 16 #1#2#3#4 #1#2 #1#2#3 Linked lists Memory Search Data 48 log 2 N Associated Data Hit? Address {

25 Copyright 1999. All Rights Reserved25 Lookups Using Hashing Performance of simple example

26 Copyright 1999. All Rights Reserved26 Lookups Using Hashing Advantages: Simple Expected lookup time can be small Disadvantages Non-deterministic lookup time Inefficient use of memory

27 Copyright 1999. All Rights Reserved27 Trees and Tries Binary Search Tree <> <><> log 2 N N entries Binary Search Trie 01 0101 111010

28 Copyright 1999. All Rights Reserved28 Trees and Tries Multiway tries 16-ary Search Trie 0000, ptr1111, ptr 0000, 01111, ptr 000011110000 0000, 0 1111, ptr 111111111111

29 Copyright 1999. All Rights Reserved29 Trees and Tries Multiway tries Table produced from 2 15 randomly generated 48-bit addresses

30 Copyright 1999. All Rights Reserved30 Forwarding Decisions ATM and MPLS switches –Direct Lookup Bridges and Ethernet switches –Associative Lookup –Hashing –Trees and tries IP Routers –Caching –CIDR –Patricia trees/tries –Other methods Packet Classification

31 Copyright 1999. All Rights Reserved31 Caching Addresses CPU Buffer Memory Line Card DMA MAC Local Buffer Memory Line Card DMA MAC Local Buffer Memory Line Card DMA MAC Local Buffer Memory Fast Path Slow Path

32 Copyright 1999. All Rights Reserved32 Caching Addresses LAN: Average flow < 40 packets WAN: Huge Number of flows Cache = 10% of Full Table Cache Hit Rate

33 Copyright 1999. All Rights Reserved33 IP Routers Class-based addresses Class AClass BClass CD 212.17.9.4 Class A Class B Class C 212.17.9.0Port 4 Exact Match Routing Table: IP Address Space

34 Copyright 1999. All Rights Reserved34 IP Routers CIDR ABCD 0 2 32 -1 0 128.9/16 128.9.0.0 2 16 142.12/19 65/8 Classless: Class-based: 128.9.16.14

35 Copyright 1999. All Rights Reserved35 IP Routers CIDR 0 2 32 -1 128.9/16 128.9.16.14 128.9.16/20128.9.176/20 128.9.19/24 128.9.25/24 Most specific route = “longest matching prefix”

36 Copyright 1999. All Rights Reserved36 IP Routers Metrics for Lookups 128.9.16.14 128.9/16 128.9.16/20 128.9.176/20 128.9.19/24 128.9.25/24 142.12/19 65/8 PrefixPort 3 5 2 7 10 1 3 Lookup time Storage space Update time Preprocessing time

37 Copyright 1999. All Rights Reserved37 IP Router Lookup IPv4 unicast destination address based lookup Dstn Addr Next Hop ---- DestinationNext Hop Forwarding Table Next Hop Computation Forwarding Engine Incoming Packet HEADERHEADER

38 Copyright 1999. All Rights Reserved38 Need more than IPv4 unicast lookups Multicast PIM­SM –Longest Prefix Matching on the source and group address –Try (S,G) followed by (*,G) followed by (*,*,RP) –Check Incoming Interface DVMRP: –Incoming Interface Check followed by (S,G) lookup IPv6 128­bit destination address field Exact address architecture not yet known

39 Copyright 1999. All Rights Reserved39 Lookup Performance Required Gigabit Ethernet (84B packets): 1.49 Mpps LineLine RatePkt­size=40BPkt­size=240B T11.5Mbps4.68Kpps0.78Kpps OC3155Mbps480Kpps80Kpps OC12622Mbps1.94Mpps323Kpps OC482.5Gbps7.81Mpps1.3Mpps OC19210Gbps31.25Mpps5.21Mpps

40 Copyright 1999. All Rights Reserved40 Size of the Routing Table Source: http://www.telstra.net/ops/bgptable.html

41 Copyright 1999. All Rights Reserved41 Ternary CAMs 10.0.0.0R1 10.1.0.0R2 10.1.1.0R3 10.1.3.0R4 255.0.0.0 255.255.0.0 255.255.255.0 255.255.255.25510.1.3.1R4 ValueMask Priority Encoder Next Hop Associative Memory

42 Copyright 1999. All Rights Reserved42 Binary Tries Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 e) 0101 f) 011 g) 100 h) 1010 i) 1100 j) 11110000 a b c d e f g h i j 01

43 Copyright 1999. All Rights Reserved43 Patricia Tree Skip=5 j a bc d e fg 0 1 h i Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 e) 0101 f) 011 g) 100 h) 1010 i) 1100 j) 11110000

44 Copyright 1999. All Rights Reserved44 Patricia Tree Disadvantages Many memory accesses May need backtracking Pointers take up a lot of space Advantages General Solution Extensible to wider fields Avoid backtracking by storing the intermediate-best matched prefix. (Dynamic Prefix Tries) 40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)]

45 Copyright 1999. All Rights Reserved45 Binary search on trie levels P Level 0 Level 29 Level 8

46 Copyright 1999. All Rights Reserved46 Binary search on trie levels 10.0.0.0/8 10.1.0.0/16 10.1.1.0/24 Example Prefixes 10.1.2.0/24 LengthHash 8 12 16 24 Store a hash table for each prefix length to aid search at a particular trie level. 10.2.3.0/24 Example Addrs 10.1.1.4 10.4.4.3 10.2.3.9 10.2.4.8 10.0.0.0/8 10.1.0.0/16 10.1.1.0/24 Example Prefixes 10.1.2.0/24 10.2.3.0/24 10 10.1, 10.2 10.1.1, 10.1.2, 10.2.3

47 Copyright 1999. All Rights Reserved47 Binary search on trie levels Disadvantages Multiple hashed memory accesses. Updates are complex. Advantages Scaleable to IPv6. 33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)]

48 Copyright 1999. All Rights Reserved48 Compacting Forwarding Tables 1 0 0 01 011 1 0 0 0 1 11

49 Copyright 1999. All Rights Reserved49 Compacting Forwarding Tables 1000101011100010100000101011010011000000 R1, 0R5, 0R2, 3R3, 7R4, 9 013 Codeword array Base index array 0 1 03214

50 Copyright 1999. All Rights Reserved50 Compacting Forwarding Tables Disadvantages Scalability to larger tables? Updates are complex. Advantages Extremely small data structure - can fit in cache. 33K entries: 160KB data structure with average 2Mpps [O(W/k)]

51 Copyright 1999. All Rights Reserved51 16-ary Search Trie 0000, ptr1111, ptr 0000, 01111, ptr 000011110000 0000, 0 1111, ptr 111111111111 Multi-bit Tries

52 Copyright 1999. All Rights Reserved52 Compressed Tries L16 L24 L8 Only 3 memory accesses

53 Copyright 1999. All Rights Reserved53 Routing Lookups in Hardware Prefix length Number Most prefixes are 24-bits or shorter

54 Copyright 1999. All Rights Reserved54 Routing Lookups in Hardware 142.19.6.14 Prefixes up to 24-bits 142.19.6 14 1 Next Hop 24 Next Hop 142.19.6 2 24 = 16M entries

55 Copyright 1999. All Rights Reserved55 Routing Lookups in Hardware 128.3.72.44 Prefixes up to 24-bits 128.3.72 44 1 Next Hop 128.3.72 24 0 Pointer 8 Prefixes above 24-bits Next Hop offset base

56 Copyright 1999. All Rights Reserved56 Routing Lookups in Hardware Prefixes up to n-bits 2 n entries: 0 N + M N i j Prefixes longer than N+M bits Next Hop entries

57 Copyright 1999. All Rights Reserved57 Routing Lookups in Hardware Disadvantages Large memory required (9-33MB) Depends on prefix-length distribution. Advantages 20Mpps with 50ns DRAM Easy to implement in hardware Various compression schemes can be employed to decrease the storage requirements: e.g. employ carefully chosen variable length strides, bitmap compression etc.

58 Copyright 1999. All Rights Reserved58 IP Router Lookups References A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables for Fast Routing Lookups”, Sigcomm 1997, pp 3-14. B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and multicolumn search”, Infocom 1998, pp 1248-56, vol. 3. M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP routing lookups”, Sigcomm 1997, pp 25-36. P. Gupta, S. Lin, N.McKeown. “Routing lookups in hardware at memory access speeds”, Infocom 1998, pp 1241-1248, vol. 3. S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998. V. Srinivasan, G.Varghese. “Fast IP lookups using controlled prefix expansion”, Sigmetrics, June 1998.

59 Copyright 1999. All Rights Reserved59 Forwarding Decisions ATM and MPLS switches –Direct Lookup Bridges and Ethernet switches –Associative Lookup –Hashing –Trees and tries IP Routers –Caching –CIDR –Patricia trees/tries –Other methods Packet Classification

60 Copyright 1999. All Rights Reserved60 Providing Value­Added Services Some examples Differentiated services –Regard traffic from Autonomous System #33 as `platinum­grade’ Access Control Lists –Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp Committed Access Rate –Rate limit WWW traffic from sub­interface#739 to 10Mbps Policy­based Routing –Route all voice traffic through the ATM network

61 Copyright 1999. All Rights Reserved61 Packet Classification Action ---- PredicateAction Classifier (Policy Database) Packet Classification Forwarding Engine Incoming Packet HEADERHEADER

62 Copyright 1999. All Rights Reserved62 Multi-field Packet Classification Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.

63 Copyright 1999. All Rights Reserved63 R5 Geometric Interpretation in 2D R4 R3 R2 R1 R7 P2 Field #1 Field #2 R6 Field #1Field #2Data P1 e.g. (128.16.46.23, *) e.g. (144.24/16, 64/24)

64 Copyright 1999. All Rights Reserved64 Proposed Schemes

65 Copyright 1999. All Rights Reserved65 Proposed Schemes (Contd.)

66 Copyright 1999. All Rights Reserved66 Proposed Schemes (Contd.)

67 Copyright 1999. All Rights Reserved67 Grid of Tries R7 R4 R6 R5 R3 R2 R1 Dimension 1 Dimension 2                 

68 Copyright 1999. All Rights Reserved68 Grid of Tries Disadvantages Static solution Not easy to extend to higher dimensions Advantages Good solution for two dimensions 20K entries: 2MB data structure with 9 memory accesses [at most 2W]

69 Copyright 1999. All Rights Reserved69 Classification using Bit Parallelism R4 R3 R2 R1 1 1 0 0 1 0 1 1

70 Copyright 1999. All Rights Reserved70 Classification using Bit Parallelism Disadvantages Large memory bandwidth Hardware optimized Advantages Good solution for multiple dimensions for small classifiers 512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips.

71 Copyright 1999. All Rights Reserved71 Classification Using Multiple Fields Recursive Flow Classification Packet Header F1 F2 F3 F4 Fn Memory Action Memory 2 S = 2 128 2 T = 2 12 2 S = 2 128 2 T = 2 12 2 64 2 24

72 Copyright 1999. All Rights Reserved72 Packet Classification References T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp 191-202. V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203-214. V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, to be presented at Sigcomm 1999. P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999. P. Gupta, N. McKeown, “Packet classification on multiple fields”, Sigcomm 1999.

73 Copyright 1999. All Rights Reserved73 Tutorial Outline Introduction: What is a Packet Switch? Packet Lookup and Classification: Where does a packet go next? Switching Fabrics: How does the packet get there? Output Scheduling: When should the packet leave?

74 Copyright 1999. All Rights Reserved74 Switching Fabrics Output and Input Queueing Output Queueing Input Queueing –Scheduling algorithms –Combining input and output queues –Other non-blocking fabrics –Multicast traffic

75 Copyright 1999. All Rights Reserved75 Basic Architectural Components Datapath: per-packet processing Forwarding Decision Forwarding Decision Forwarding Decision Forwarding Table Forwarding Table Forwarding Table Interconnect Output Scheduling 1. 2. 3.

76 Copyright 1999. All Rights Reserved76 Interconnects Two basic techniques Input Queueing Output Queueing Usually a non-blocking switch fabric (e.g. crossbar) Usually a fast bus

77 Copyright 1999. All Rights Reserved77 Interconnects Output Queueing Individual Output QueuesCentralized Shared Memory Memory b/w = (N+1).R 1 2 N Memory b/w = 2N.R 1 2 N

78 Copyright 1999. All Rights Reserved78 Output Queueing The “ideal” 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2

79 Copyright 1999. All Rights Reserved79 Output Queueing How fast can we make centralized shared memory? Shared Memory 200 byte bus 5ns SRAM 1 2 N 5ns per memory operation Two memory operations per packet Therefore, up to 160Gb/s In practice, closer to 80Gb/s

80 Copyright 1999. All Rights Reserved80 Switching Fabrics Output and Input Queueing Output Queueing Input Queueing –Scheduling algorithms –Other non-blocking fabrics –Combining input and output queues –Multicast traffic

81 Copyright 1999. All Rights Reserved81 Interconnects Input Queueing with Crossbar configuration Data In Data Out Scheduler Memory b/w = 2R

82 Copyright 1999. All Rights Reserved82 Input Queueing Head of Line Blocking Delay Load 58.6% 100%

83 Copyright 1999. All Rights Reserved83 Head of Line Blocking

84 Copyright 1999. All Rights Reserved84

85 Copyright 1999. All Rights Reserved85

86 Copyright 1999. All Rights Reserved86 Input Queueing Virtual output queues

87 Copyright 1999. All Rights Reserved87 Input Queues Virtual Output Queues Delay Load 100%

88 Copyright 1999. All Rights Reserved88 Input Queueing Scheduler Memory b/w = 2R Can be quite complex!

89 Copyright 1999. All Rights Reserved89 Input Queueing Scheduling

90 Copyright 1999. All Rights Reserved90 Input Queueing Scheduling Request Graph 1 2 3 4 1 2 3 4 2 5 2 4 2 7 Bipartite Matching 1 2 3 4 1 2 3 4 (Weight = 18) Question: Maximum weight or maximum size?

91 Copyright 1999. All Rights Reserved91 Input Queueing Scheduling Maximum Size –Maximizes instantaneous throughput –Does it maximize long-term throughput? Maximum Weight –Can clear most backlogged queues –But does it sacrifice long-term throughput?

92 Copyright 1999. All Rights Reserved92 Input Queueing Scheduling 1 2 1 2 1 2 1 2

93 Copyright 1999. All Rights Reserved93 Input Queueing Longest Queue First or Oldest Cell First Weight Waiting Time 100% Queue Length { } =

94 Copyright 1999. All Rights Reserved94 Input Queueing Why is serving long/old queues better than serving maximum number of queues? When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. When traffic is non-uniform, some queues become longer than others. A good algorithm keeps the queue lengths matched, and services a large number of queues. VOQ # Avg Occupancy Uniform traffic VOQ # Avg Occupancy Non-uniform traffic

95 Copyright 1999. All Rights Reserved95 Input Queueing Practical Algorithms Maximal Size Algorithms –Wave Front Arbiter (WFA) –Parallel Iterative Matching (PIM) –iSLIP Maximal Weight Algorithms –Fair Access Round Robin (FARR) –Longest Port First (LPF)

96 Copyright 1999. All Rights Reserved96 Wave Front Arbiter RequestsMatch 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

97 Copyright 1999. All Rights Reserved97 Wave Front Arbiter RequestsMatch

98 Copyright 1999. All Rights Reserved98 Wave Front Arbiter Implementation 1,11,21,31,42,12,22,32,43,13,23,33,44,14,24,34,4 Combinational Logic Blocks

99 Copyright 1999. All Rights Reserved99 Wave Front Arbiter Wrapped WFA (WWFA) Requests Match N steps instead of 2N-1

100 Copyright 1999. All Rights Reserved100 Input Queueing Practical Algorithms Maximal Size Algorithms –Wave Front Arbiter (WFA) –Parallel Iterative Matching (PIM) –iSLIP Maximal Weight Algorithms –Fair Access Round Robin (FARR) –Longest Port First (LPF)

101 Copyright 1999. All Rights Reserved101 Parallel Iterative Matching 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Requests 1 2 3 4 1 2 3 4 Grant 1 2 3 4 1 2 3 4 Accept/Match 1 2 3 4 1 2 3 4 #1 #2 Random Selection 1 2 3 4 1 2 3 4

102 Copyright 1999. All Rights Reserved102 Parallel Iterative Matching Maximal is not Maximum 1 2 3 4 1 2 3 4 RequestsAccept/Match 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

103 Copyright 1999. All Rights Reserved103 Parallel Iterative Matching Analytical Results Number of iterations to converge:

104 Copyright 1999. All Rights Reserved104 Parallel Iterative Matching

105 Copyright 1999. All Rights Reserved105 Parallel Iterative Matching

106 Copyright 1999. All Rights Reserved106 Parallel Iterative Matching

107 Copyright 1999. All Rights Reserved107 Input Queueing Practical Algorithms Maximal Size Algorithms –Wave Front Arbiter (WFA) –Parallel Iterative Matching (PIM) –iSLIP Maximal Weight Algorithms –Fair Access Round Robin (FARR) –Longest Port First (LPF)

108 Copyright 1999. All Rights Reserved108 iSLIP 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Requests 1 2 3 4 1 2 3 4 Grant 1 2 3 4 1 2 3 4 Accept/Match 1 2 3 4 1 2 3 4 #1 #2 Round-Robin Selection 1 2 3 4 1 2 3 4

109 Copyright 1999. All Rights Reserved109 iSLIP Properties Random under low load TDM under high load Lowest priority to MRU 1 iteration: fair to outputs Converges in at most N iterations. On average <= log 2 N Implementation: N priority encoders Up to 100% throughput for uniform traffic

110 Copyright 1999. All Rights Reserved110 iSLIP

111 Copyright 1999. All Rights Reserved111 iSLIP

112 Copyright 1999. All Rights Reserved112 iSLIP Implementation Grant Accept 1 2 N 1 2 N State N N N Decision log 2 N Programmable Priority Encoder

113 Copyright 1999. All Rights Reserved113 Input Queueing References References M. Karol et al. “Input vs Output Queueing on a Space-Division Packet Switch”, IEEE Trans Comm., Dec 1987, pp. 1347-1356. Y. Tamir, “Symmetric Crossbar arbiters for VLSI communication switches”, IEEE Trans Parallel and Dist Sys., Jan 1993, pp.13-27. T. Anderson et al. “High-Speed Switch Scheduling for Local Area Networks”, ACM Trans Comp Sys., Nov 1993, pp. 319-352. N. McKeown, “The iSLIP scheduling algorithm for Input-Queued Switches”, IEEE Trans Networking, April 1999, pp. 188-201. C. Lund et al. “Fair prioritized scheduling in an input-buffered switch”, Proc. of IFIP-IEEE Conf., April 1996, pp. 358-69. A. Mekkitikul et al. “A Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches”, IEEE Infocom 98, April 1998.

114 Copyright 1999. All Rights Reserved114 Switching Fabrics Output and Input Queueing Output Queueing Input Queueing –Scheduling algorithms –Other non-blocking fabrics –Combining input and output queues –Multicast traffic

115 Copyright 1999. All Rights Reserved115 Other Non-Blocking Fabrics Clos Network

116 Copyright 1999. All Rights Reserved116 Other Non-Blocking Fabrics Clos Network Expansion factor required = 2-1/N (but still blocking for multicast)

117 Copyright 1999. All Rights Reserved117 Other Non-Blocking Fabrics Self-Routing Networks 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

118 Copyright 1999. All Rights Reserved118 Other Non-Blocking Fabrics Self-Routing Networks 3 7 5 2 6 0 1 4 7 2 3 5 6 1 0 4 7 5 2 3 1 0 6 4 7 0 5 1 3 4 2 6 7 4 5 6 0 3 1 2 7 6 4 5 3 2 0 2 7 6 5 4 3 2 1 0 000 001 010 011 100 101 110 111 Batcher SorterSelf-Routing Network The Non-blocking Batcher Banyan Network Fabric can be used as scheduler. Batcher-Banyan network is blocking for multicast.

119 Copyright 1999. All Rights Reserved119 Switching Fabrics Output and Input Queueing Output Queueing Input Queueing –Scheduling algorithms –Other non-blocking fabrics –Combining input and output queues –Multicast traffic

120 Copyright 1999. All Rights Reserved120 Speedup Context –input-queued switches –output-queued switches –the speedup problem Early approaches Algorithms Implementation considerations

121 Copyright 1999. All Rights Reserved121 Speedup: Context MemoryMemory MemoryMemory The placement of memory gives - Output-queued switches - Input-queued switches - Combined input- and output-queued switches A generic switch

122 Copyright 1999. All Rights Reserved122 Output-queued switches Best delay and throughput performance - Possible to erect “bandwidth firewalls” between sessions Main problem - Requires high fabric speedup (S = N) Unsuitable for high-speed switching

123 Copyright 1999. All Rights Reserved123 Input-queued switches Big advantage - Speedup of one is sufficient Main problem - Can’t guarantee delay due to input contention Overcoming input contention: use higher speedup

124 Copyright 1999. All Rights Reserved124 A Comparison Line RateMemory BW Access Time Per cell Memory BW Access Time Memory speeds for 32x32 switch Output-queuedInput-queued 100 Mb/s3.3 Gb/s128 ns200 Mb/s 2.12  s 1 Gb/s33 Gb/s12.8 ns2 Gb/s212 ns 2.5 Gb/s82.5 Gb/s5.12 ns5 Gb/s84.8 ns 10 Gb/s330 Gb/s 1.28ns 20 Gb/s21.2 ns

125 Copyright 1999. All Rights Reserved125 The Speedup Problem Find a compromise: 1 < Speedup << N - to get the performance of an OQ switch - close to the cost of an IQ switch Essential for high speed QoS switching

126 Copyright 1999. All Rights Reserved126 Some Early Approaches Probabilistic Analyses - assume traffic models (Bernoulli, Markov-modulated, Numerical Methods - use actual and simulated traffic traces - run different algorithms - set the “speedup dial” at various values non-uniform loading, “friendly correlated”) - obtain mean throughput and delays, bounds on tails - analyze different fabrics (crossbar, multistage, etc)

127 Copyright 1999. All Rights Reserved127 The findings Very tantalizing... - under different settings (traffic, loading, algorithm, etc) - and even for varying switch sizes A speedup of between 2 and 5 was sufficient!

128 Copyright 1999. All Rights Reserved128 Using Speedup 1 1 1 2 2

129 Copyright 1999. All Rights Reserved129 Intuition Speedup = 1 Speedup = 2 Fabric throughput =.58 Bernoulli IID inputs Fabric throughput = 1.16 Bernoulli IID inputs I/p efficiency,  = 1/1.16 Ave I/p queue = 6.25

130 Copyright 1999. All Rights Reserved130 Intuition (continued) Speedup = 3 Fabric throughput = 1.74 Bernoulli IID inputs Input efficiency = 1/1.74 Speedup = 4 Fabric throughput = 2.32 Bernoulli IID inputs Input efficiency = 1/2.32 Ave I/p queue = 0.75 Ave I/p queue = 1.35

131 Copyright 1999. All Rights Reserved131 Issues Need hard guarantees - exact, not average Robustness - realistic, even adversarial, traffic not friendly Bernoulli IID

132 Copyright 1999. All Rights Reserved132 The Ideal Solution Speedup = N ? Speedup << N Inputs Outputs Question: Can we find - a simple and good algorithms - that exactly mimics output-queueing - regardless of switch sizes and traffic patterns?

133 Copyright 1999. All Rights Reserved133 What is exact mimicking? Apply same inputs to an OQ and a CIOQ switch - packet by packet Obtain same outputs - packet by packet

134 Copyright 1999. All Rights Reserved134 Algorithm - MUCF Key concept: urgency value - urgency = departure time - present time

135 Copyright 1999. All Rights Reserved135 MUCF The algorithm - Outputs try to get their most urgent packets - Inputs grant to output whose packet is most urgent, ties broken by port number - Loser outputs for next most urgent packet - Algorithm terminates when no more matchings are possible

136 Copyright 1999. All Rights Reserved136 Stable Marriage Problem MariaHillaryMonica PedroJohnBill Men = Outputs Women = Inputs

137 Copyright 1999. All Rights Reserved137 An example Observation: Only two reasons a packet doesn’t get to its output - Input contention, Output contention - This is why speedup of 2 works!!

138 Copyright 1999. All Rights Reserved138 What does this get us? Speedup of 4 is sufficient for exact emulation of FIFO OQ switches, with MUCF What about non-FIFO OQ switches? E.g. WFQ, Strict priority

139 Copyright 1999. All Rights Reserved139 Other results To exactly emulate an NxN OQ switch - Speedup of 2 - 1/N is necessary and sufficient - Input traffic patterns can be absolutely arbitrary (Hence a speedup of 2 is sufficient for all N) - Emulated OQ switch may use a “monotone” - E.g.: FIFO, LIFO, strict priority, WFQ, etc scheduling policies

140 Copyright 1999. All Rights Reserved140 What gives? Complexity of the algorithms - Extra hardware for processing - Extra run time (time complexity) What is the benefit? - Reduced memory bandwidth requirements Tradeoff: Memory for processing - Moore’s Law supports this tradeoff

141 Copyright 1999. All Rights Reserved141 Implementation - a closer look Main sources of difficulty - Estimating urgency, etc - info is distributed - Matching process - too many iterations? Estimating urgency depends on what is being emulated - Like taking a ticket to hold a place in a queue - FIFO, Strict priorities - no problem - WFQ, etc - problems (and communicating this info among I/ps and O/ps)

142 Copyright 1999. All Rights Reserved142 Implementation (contd) Matching process - A variant of the stable marriage problem - Worst-case number of iterations in switching = N - High probability and average approxly log(N) - Worst-case number of iterations for SMP = N 2

143 Copyright 1999. All Rights Reserved143 Other Work Relax stringent requirement of exact emulation - Least Occupied O/p First Algorithm (LOOFA) - Disallow arbitrary inputs Keeps outputs always busy if there are packets By time-stamping packets, it also exactly mimics E.g. leaky bucket constrained Obtain worst-case delay bounds

144 Copyright 1999. All Rights Reserved144 References for speedup - Y. Oie et al, “Effect of speedup in nonblocking packet switch’’, ICC 89. - A.L Gupta, N.D. Georgana, “Analysis of a packet switch with input and and output buffers and speed constraints”, Infocom 91. - S-T. Chuang et al, “Matching output queueing with a combined input and and output queued switch”, IEEE JSAC, vol 17, no 6, 1999. - B. Prabhakar, N. McKeown, “On the speedup required for combined input and output queued switching”, Automatica, vol 35, 1999. - P. Krishna et al, “On the speedup required for work-conserving crossbar switches”, IEEE JSAC, vol 17, no 6, 1999. - A. Charny, “Providing QoS guarantees in input buffered crossbar switches with speedup”, PhD Thesis, MIT, 1998.

145 Copyright 1999. All Rights Reserved145 Switching Fabrics Output and Input Queueing Output Queueing Input Queueing –Scheduling algorithms –Other non-blocking fabrics –Combining input and output queues –Multicast traffic

146 Copyright 1999. All Rights Reserved146 Multicast Switching The problem Switching with crossbar fabrics Switching with other fabrics

147 Copyright 1999. All Rights Reserved147 Multicasting 1 2 64 35

148 Copyright 1999. All Rights Reserved148 Crossbar fabrics: Method 1 Copy networks Copy network + unicast switching Increased hardware, increased input contention

149 Copyright 1999. All Rights Reserved149 Method 2 Use copying properties of crossbar fabric No fanout-splitting: Easy, but low throughput Fanout-splitting: higher throughput, but not as simple. Leaves “residue”.

150 Copyright 1999. All Rights Reserved150 The effect of fanout-splitting Performance of an 8x8 switch with and without fanout-splitting under uniform IID traffic

151 Copyright 1999. All Rights Reserved151 Placement of residue Key question: How should outputs grant requests? (and hence decide placement of residue)

152 Copyright 1999. All Rights Reserved152 Residue and throughput Result: Concentrating residue brings more new work forward. Hence leads to higher throughput. But, there are fairness problems to deal with. This and other problems can be looked at in a unified way by mapping the multicasting problem onto a variation of Tetris.

153 Copyright 1999. All Rights Reserved153 Multicasting and Tetris Output ports 12354 12354 Input ports Residue

154 Copyright 1999. All Rights Reserved154 Multicasting and Tetris Output ports 12354 12354 Input ports Residue Concentrated

155 Copyright 1999. All Rights Reserved155 Replication by recycling Main idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves. a b e x d y c x y a b c x y d e

156 Copyright 1999. All Rights Reserved156 Replication by recycling (cont’d) Receive Recycle Network ReseqTransmit Output Table Scaleable to large fanouts. Needs resequencing at outputs and introduces variable delays.

157 Copyright 1999. All Rights Reserved157 References for Multicasting J. Hayes et al. “Performance analysis of a multicast switch”, IEEE/ACM Trans. on Networking, vol 39, April 1991. B. Prabhakar et al. “Tetris models for multicast switches”, Proc. of the 30th Annual Conference on Information Sciences and Systems, 1996 B. Prabhakar et al. “Multicast scheduling for input-queued switches”, IEEE JSAC, 1997 J. Turner, “An optimal nonblocking multicast virtual circuit switch”, INFOCOM, 1994

158 Copyright 1999. All Rights Reserved158 Tutorial Outline Introduction: What is a Packet Switch? Packet Lookup and Classification: Where does a packet go next? Switching Fabrics: How does the packet get there? Output Scheduling: When should the packet leave?

159 Copyright 1999. All Rights Reserved159 Output Scheduling What is output scheduling? How is it done? Practical Considerations

160 Copyright 1999. All Rights Reserved160 Output Scheduling scheduler Allocating output bandwidth Controlling packet delay

161 Copyright 1999. All Rights Reserved161 Output Scheduling FIFO Fair Queueing

162 Copyright 1999. All Rights Reserved162 Motivation FIFO is natural but gives poor QoS –bursty flows increase delays for others –hence cannot guarantee delays Need round robin scheduling of packets –Fair Queueing –Weighted Fair Queueing, Generalized Processor Sharing

163 Copyright 1999. All Rights Reserved163 Fair queueing: Main issues Level of granularity –packet-by-packet? (favors long packets) –bit-by-bit? (ideal, but very complicated) Packet Generalized Processor Sharing (PGPS) –serves packet-by-packet –and imitates bit-by-bit schedule within a tolerance

164 Copyright 1999. All Rights Reserved164 How does WFQ work? WR = 1 WG = 5 WP = 2

165 Copyright 1999. All Rights Reserved165 Delay guarantees Theorem If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions.

166 Copyright 1999. All Rights Reserved166 Practical considerations For every packet, the scheduler needs to –classify it into the right flow queue and maintain a linked-list for each flow –schedule it for departure Complexities of both are o(log [# of flows]) –first is hard to overcome –second can be overcome by DRR

167 Copyright 1999. All Rights Reserved167 Deficit Round Robin 50 700 250 400 600 200600100 500 Quantum size 250 500 400 750 1000 Good approximation of FQ Much simpler to implement

168 Copyright 1999. All Rights Reserved168 But... WFQ is still very hard to implement –classification is a problem –needs to maintain too much state information –doesn’t scale well

169 Copyright 1999. All Rights Reserved169 Strict Priorities and Diff Serv Classify flows into priority classes –maintain only per-class queues –perform FIFO within each class –avoid “curse of dimensionality”

170 Copyright 1999. All Rights Reserved170 Diff Serv A framework for providing differentiated QoS –set Type of Service (ToS) bits in packet headers –this classifies packets into classes –routers maintain per-class queues –condition traffic at network edges to conform to class requirements May still need queue management inside the network

171 Copyright 1999. All Rights Reserved171 References for O/p Scheduling - A. Demers et al, “Analysis and simulation of a fair queueing algorithm”, ACM SIGCOMM 1989. - A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the single node - M. Shreedhar, G. Varghese, “Efficient Fair Queueing using Deficit Round Robin”, ACM SIGCOMM, 1995. - K. Nichols, S. Blake (eds), “Differentiated Services: Operational Model and Definitions”, Internet Draft, 1998. case”, IEEE Trans. on Networking, June 1993. - A. Parekh, R. Gallager, “A generalized processor sharing approach to flow control in integrated services networks: the multiple node case”, IEEE Trans. on Networking, August 1993.

172 Copyright 1999. All Rights Reserved172 Problems with traditional queue management –tail drop Active Queue Management –goals –an example –effectiveness Active Queue Management

173 Copyright 1999. All Rights Reserved173 Max Queue Length Tail Drop Queue Management Lock-Out

174 Copyright 1999. All Rights Reserved174 Drop packets only when queue is full –long steady-state delay –global synchronization –bias against bursty traffic Tail Drop Queue Management

175 Copyright 1999. All Rights Reserved175 Max Queue Length Global Synchronization

176 Copyright 1999. All Rights Reserved176 Max Queue Length Bias Against Bursty Traffic

177 Copyright 1999. All Rights Reserved177 Drop from front on full queue Drop at random on full queue  both solve the lock-out problem  both have the full-queues problem Alternative Queue Management Schemes

178 Copyright 1999. All Rights Reserved178 Solve lock-out and full-queue problems –no lock-out behavior –no global synchronization –no bias against bursty flow Provide better QoS at a router –low steady-state delay –lower packet dropping Active Queue Management Goals

179 Copyright 1999. All Rights Reserved179 Problems with traditional queue management –tail drop Active Queue Management –goals  an example –effectiveness Active Queue Management

180 Copyright 1999. All Rights Reserved180 Random Early Detection (RED) l if q avg < min th : admit every packet l else if q avg <= max th : drop an incoming packet with p = (q avg - min th )/(max th - min th ) l else if q avg > max th : drop every incoming packet min th max th P1P1 PkPk P2P2 q avg

181 Copyright 1999. All Rights Reserved181 Effectiveness of RED: Lock-Out Packets are randomly dropped Each flow has the same probability of being discarded

182 Copyright 1999. All Rights Reserved182 Drop packets probabilistically in anticipation of congestion (not when queue is full) Use q avg to decide packet dropping probability: allow instantaneous bursts Randomness avoids global synchronization Effectiveness of RED: Full-Queue

183 Copyright 1999. All Rights Reserved183 What QoS does RED Provide? Lower buffer delay: good interactive service –q avg is controlled to be small Given responsive flows: packet dropping is reduced –early congestion indication allows traffic to throttle back before congestion Given responsive flows: fair bandwidth allocation

184 Copyright 1999. All Rights Reserved184 Unresponsive or aggressive flows Don’t properly back off during congestion Take away bandwidth from TCP compatible flows Monopolize buffer space

185 Copyright 1999. All Rights Reserved185 Control Unresponsive Flows Some active queue management schemes –RED with penalty box –Flow RED (FRED) –Stabilized RED (SRED) identify and penalize unresponsive flows with a bit of extra work

186 Copyright 1999. All Rights Reserved186 Active Queue Management References B. Braden et al. “Recommendations on queue management and congestion avoidance in the internet”, RFC2309, 1998. S. Floyd, V. Jacobson, “Random early detection gateways for congestion avoidance”, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993. D. Lin, R. Morris, “Dynamics on random early detection”, ACM SIGCOMM, 1997 T. Ott et al. “SRED: Stabilized RED”, INFOCOM 1999 S. Floyd, K. Fall, “Router mechanisms to support end-to- end congestion control”, LBL technical report, 1997

187 Copyright 1999. All Rights Reserved187 Tutorial Outline Introduction: What is a Packet Switch? Packet Lookup and Classification: Where does a packet go next? Switching Fabrics: How does the packet get there? Output Scheduling: When should the packet leave?

188 Copyright 1999. All Rights Reserved188 Basic Architectural Components Policing Output Scheduling Switching Routing Congestion Control Reservation Admission Control Datapath: per-packet processing

189 Copyright 1999. All Rights Reserved189 Basic Architectural Components Datapath: per-packet processing Forwarding Decision Forwarding Decision Forwarding Decision Forwarding Table Forwarding Table Forwarding Table Interconnect Output Scheduling 1. 2. 3.


Download ppt "High Performance Switches and Routers: Theory and Practice Sigcomm 99 August 30, 1999 Harvard University Nick McKeown Balaji Prabhakar Departments of Electrical."

Similar presentations


Ads by Google