A Load-Balanced Switch with an Arbitrary Number of Linecards Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown Stanford University
Stanford 100Tb/s Router “Optics in Routers” project Some challenging numbers: 100Tb/s R =160Gb/s linecard rate N =640 linecards Performance guarantees
Router Wish List Scale to High Linecard Speeds No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
Out R R R R/N In R R R R/N Load-Balanced Switch Load-balancing mesh Forwarding mesh
Out R R R R/N In R R R R/N Load-Balanced Switch Load-balancing mesh Forwarding mesh
Out R R R R/N In R R R R/N Combining the Two Meshes One linecard In Out In Out
A Single Combined Mesh In Out In Out In Out In Out R In Out In Out In Out In Out R 2R/N
References on Early Work Initial Work C.-S. Chang, D.-S. Lee and Y.-S. Jou, "Load Balanced Birkhoff-von Neumann Switches, part I: One-Stage Buffering," Computer Communications, Vol. 25, pp , Sigcomm’03 I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N. McKeown, "Scaling Internet Routers Using Optics," ACM SIGCOMM '03, Karlsruhe, Germany, August 2003.
Summary of Early Work Initial Work (C.-S. Chang et al.) Sigcomm‘03 Scheduler No centralized scheduler Architecture Crossbar-based architecture Mesh-based architecture => no reconfiguration Single Mesh Performance guarantees 100% throughput guarantee for weakly-mixing traffic 100% throughput guarantee for any adversarial traffic Average delay within constant from output-queued router No packet reordering
Router Wish List Scale to High Linecard Speeds No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
Example N = R/8
When N is Too Large Decompose into groups (or racks) 4R/4 2R2R2R2R R2R 2R2R R
When N is Too Large Decompose into groups (or racks) 12L 2R 12L Group/Rack 1 Group/Rack G 12L 2R Group/Rack 1 12L 2R Group/Rack G 2RL 2RL/G
Router Wish List Scale to High Linecard Speeds No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
When Linecards are Missing Failures, Incremental Additions, and Removals… 12L 2R 12L Group/Rack 1 Group/Rack G 12L 2R Group/Rack 1 12L 2R Group/Rack G 2RL 2RL/G 2RL Solution: replace mesh with sum of permutations = + + 2RL/G ≤ 2RL 2RL/G G *
Hybrid Electro-Optical Architecture Using MEMS Switches 12L 2R 12L Group/Rack 1 Group/Rack G 12L 2R Group/Rack 1 12L 2R Group/Rack G MEMS Switch MEMS Switch Electronics Optics
12L 2R 12L Group/Rack 1 Group/Rack G 12L 2R Group/Rack 1 12L 2R Group/Rack G MEMS Switch MEMS Switch When Linecards are Missing
Router Wish List Scale to High Linecard Speeds No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
Questions Number of MEMS Switches? TDM Schedule?
All Link Capacities Are Equal 12L 2R 12L Group/Rack 1 Group/Rack G 12L 2R Group/Rack 1 12L 2R Group/Rack G MEMS Switch MEMS Switch MEMS Switch Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R Laser/ Modulator MUX ≤ 2R
Group/Rack R 4R Group/Rack R 4R Example 2 Groups of 2 Linecards 12 2R Group/Rack R Group/Rack 2 4R 2R
Intuition on Worst-Case 12L 2R Group/Rack 1 12L 2R Group/Rack 1 MEMS Switch MEMS Switch MEMS Switch 2RL ≤ 2R L Group/Rack G 1 2R 1 Group/Rack 2 2R 1 Group/Rack 2 2R 1 Group/Rack G 2R G-1
Theorem: M ≤ L+G-1 Number of MEMS Switches Examples:
Questions Number of MEMS Switches? TDM Schedule?
Group A 1 2 2R 4R Group B 12 2R 4R TDM Schedule 12 2R Group A 12 2R Group B 4R 2R
Group A 1 2 2R 4R Group B 12 2R 4R TDM Schedule 12 2R Group A 12 2R Group B 4R 2R Uniform-spreading constraint on linecards Constraints on linecards at each time-slot Constraints on groups at each time-slot
TDM Schedule T+1T+2T+3T+4 Tx LC A1???? Tx LC A2???? Tx LC B1???? Tx LC B2???? Tx Group A Tx Group B
TDM Schedule T+1T+2T+3T+4 Tx LC A1A1A2B1B2 Tx LC A2B2A1A2B1 Tx LC B1B1B2A1A2 Tx LC B2A2B1B2A1 Tx Group A Tx Group B
Bad TDM Schedule T+1T+2T+3T+4 Tx LC A1A1A2B1B2 Tx LC A2B2A1A2B1 Tx LC B1B1B2A1A2 Tx LC B2A2B1B2A1 Tx Group A Tx Group B
TDM Schedule Algorithm Intuition 1. Create TDM schedule between groups: “Group A sends to group B” 2. Assign group connections to specific linecards: “Linecard A1 sends to linecard B3” Theorem: There exists a polynomial-time algorithm to find a correct TDM schedule.
Algorithm Running Time milliseconds number of linecards Worst Case Average Case Best Case [Verilog simulation, linecard placement generated uniformly-at-random among 40 groups, 4ns clock cycle, 1000 runs per case. Source: Srikanth Arekapudi]
Open Questions Greedy TDM algorithm with more capacity? A better switch fabric architecture?
Thank you.