Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Challenges in Modern Multi-Tera- bit Class Switch Design
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2 Tbps System Architecture switch core network processor ingress buffer egress buffer data ingress FC data egress FC input line 1 output line 1 OC-x itf network processor ingress buffer egress buffer input line 1 output line 1 OC-x itf switch fabric interface chips switch fabric line card 1 line card N iRT eRT
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3 Trend: Single POP routers q Very high capacity (10+Tb/s) q Line-rates T1 to OC768 Reasons: q Big multi-rack router more efficient than many single-rack routers, q Easier to manage fewer routers. q Power requirements easier to meet
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 4 Multi Tbps Systems: Goals… q Design of a terabit-class system q Several Tb/s aggregate throughput q 2.5 Tb/s: 256x256 OC-192 or 64x64 OC- 768 q OEM q Achieve wide coverage of application spectrum q Single-stage q Electronic fabric
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5 Trends & Consequences CPU Instructions per minimum length packet 1 Disparity between traffic and router growth 2 1Tb/s Router capacity 2x every 18 months Traffic 2x every year 100Tb/s Consequences: 1.Per-packet processing is getting harder. 2.Efficient, simple processing will become more important. 3.Routers will get faster, simpler and more efficient. (Weren’t they supposed to simple in the first place?)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6 Trends and Consequences (2) Power consumption is out of control 3 Disparity between line-rate and memory access time 4 Consequences: 1.Power efficiency will continue to be important. 2.Memories will seem slower and slower. Are we just going to keep adding more parallelism?
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7 What’s hard, what’s not q Linerate fowarding: q Linerate LPM was an issue for while. q Commercial TCAMs and algorithms available up to 100Gb/s. q 1M prefixes fit in corner of 90nm ASIC. q 2 32 addresses will fit in a $10 DRAM in 8 years q Packet buffering: q Not a problem up to about 10Gb/s; big problem above 10Gb/s. q Header processing: q For basic IPv4 operations: not a problem. q If we keep adding functions, it will be a problem. q More on this later…
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8 What’s hard, what’s not (2) q Switching q If throughput doesn’t matter: q Easy: Lots of multistage, distributed or load- balanced switch fabrics. q If throughput matters: q Use crossbar, VOQs and centralized scheduler q Or multistage fabric and lots of speedup. q If throughput guarantee is required: q Maximal matching, VOQs and speedup of two [Dai & Prabhakar ‘00]; or q Load-balanced 2-stage switch [Chang 01; Sigcomm 03].
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9 Memory: Buffers Memory speed will matter more than size Memory speed will remain a problem. Waiting for slow off-chip memory will become intolerable. Memory size will become less of an issue. Memory Size Packet buffers: Today they are too big; they’ll get smaller.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10 Switching: Myths about CIOQ-based crossbar switches 1. “Input-queued crossbars have low throughput” q An input-queued crossbar can have as high throughput as any switch. 2. “Crossbars don’t support multicast traffic well” q A crossbar inherently supports multicast efficiently. 3. “Crossbars don’t scale well” q Today, it is the number of chip I/Os, not the number of crosspoints, that limits the size of a switch fabric. Expect 5-10Tb/s crossbar switches.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 11 Packet processing gets harder time Instructions per arriving byte What we’d like: (more features) QoS, Multicast, Security, … What will happen
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12 Packet processing gets harder Clock cycles per minimum length packet since 1996
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13 Power Requirement q Do not exceed the per shelf (2 kW), per board (150W), and per chip (20W) budgets q Forced-air cooling, avoid hot-spots q More throughput at same power: Gb/s/W density is increasing q I/O uses an increasing fraction of power (> 50%) q Electrical I/O technology has not kept pace with capacity demand q Low-power, high-density I/O technology is a must q CMOS density increases faster than W/gate decreases q Functionality/chip constrained by power rather than density Power determines the number of chips and boards q Architecture must be able to be distributed accordingly
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14 Packaging Requirement q NEBS compliance q Constrained by q Standard form factors q Power budget at chip, card, rack level q Switch core q Link, connector, chip packaging technology q Connector density (pins/inch) q CMOS density doubles, number of pins +5-10% per generation q This determines the maximum per-chip and per-card throughput q Line cards q Increasing port counts q Prevalent line rate granularity OC-192 (10 Gb/s) q 1 adapter/card > 1 Tb/s systems require multi-rack solutions q Long cables instead of backplane (30 to 100m) q Interconnect accounts for large part of system cost
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15 Packaging q 2.5 Tb/s, 1.6x speedup, 2.5 Gb/s links 8b/10b: 4000 links (diff. pairs)
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16 Switch-Internal Round-Trip (RT) q Physical system size q Direct consequence of packaging q CMOS technology q Clock speeds increasing much slower than density q More parallelism required to increase throughput q Shrinking packet cycle q Line rates have up drastically (OC-3 through OC-768) q Minimum packet size has remained constant Large round-trip (RT) in terms of min. packet duration q Can be (many) tens of packets per port q Used to be only a node-to-node issue, now also inside the node q System-wide clocking and synchronization Evolution of RT
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17 Switch-Internal Round-Trip (RT) switch core network processor ingress buffer egress buffer data ingress FC data egress FC input line 1 output line 1 OC-x itf network processor ingress buffer egress buffer input line 1 output line 1 OC-x itf switch fabric interface chips switch fabric line card 1 line card N iRT eRT Consequences q Performance impact? q All buffers must be scaled by RT q Fabric-internal flow control becomes an important issue iRT eRT
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18 Physical Separation Separating Control & Data Linecard LCSLCS LCSLCS Switch Scheduler Switch Scheduler Switch Port 1: Req Req Control Channel Data Channel Switch Fabric Switch Fabric Buffer or Guard-Band Linecard measures RTT to ~1 cell time 2: Grant/credit Grant Time 3: Data
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19 Speed-Up Requirement q “Industry standard” 2x speed-up q Three flavors q Utilization: compensate SAR overhead q Performance: compensate scheduling inefficiencies q OQ speed-up: memory access time Switch core speed-up S is very costly q Bandwidth is a scarce resource: COST and POWER q Core buffers must run S times faster q Core scheduler must run S times faster q Is it really needed? q SAR overhead reduction q Variable-length packet switching: hard to implement, but may be more cost-effective q Performance: does the gain in performance justify the increase in cost and power? q Depends on application q Low Internet utilization speed-up
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20 Multicast Requirement q Full multicast support q Many multicast groups, full link utilization, no blocking, QoS q Complicates everything q Buffering, queuing, scheduling, flow control, QoS Sophisticated multicast support really needed? q Expensive q Often disabled in the field… q Complexity, billing, potential for abuse, etc. q Again, depends on application
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21 Packet size Requirement q Support very short packets (32-64B) q OC-768 = 8 ns q Short packet duration q Determines speed of control section q Queues and schedulers q Implies longer RT q Wider data paths Do we have to switch short packets individually? q Aggregation techniques q Burst, envelope, container switching, “packing” q Single-stage, multi-path switches q Parallel packet switch
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22 Increase Payload Size Packing Packets in “Cells” Cell Hdr Cell Hdr Cell Hdr Cell Hdr B178B 40B 128B cell payload 100B Packets all in same VOQ: Cell Hdr 234 Start pkt Pkt Len
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23 Can optics help in switching? Buffered or Bufferless Fabric Arbitration Physical Layer Framing & Maintenance Packet Processing Buffer Mgmt & Scheduling Buffer Mgmt & Scheduling Buffer & State Memory Buffer & State Memory Typical IP Router Linecard Lookup Tables Optics Physical Layer Framing & Maintenance Packet Processing Buffer Mgmt & Scheduling Buffer Mgmt & Scheduling Buffer & State Memory Buffer & State Memory Typical IP Router Linecard Lookup Tables Optics Buffered or Bufferless Fabric Arbitration Physical Layer Framing & Maintenance Packet Processing Buffer Mgmt & Scheduling Buffer Mgmt & Scheduling Buffer & State Memory Buffer & State Memory Typical IP Router Linecard Lookup Tables Optics Physical Layer Framing & Maintenance Packet Processing Buffer Mgmt & Scheduling Buffer Mgmt & Scheduling Buffer & State Memory Buffer & State Memory Typical IP Router Linecard Lookup Tables Optics optical electrical Req/Grant
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24 Can optics help? Cynical view: 1. A packet switch (e.g. an IP router) must have buffering. 2. Optical buffering is not feasible. 3. Therefore, optical routers are not feasible. 4. Hence, “optical switches” are circuit switches (e.g. TDM, space or Lambda switches).
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25 Can optics help? Open-minded view: q Optics seem ill-suited to processing intensive functions, or where random access memory is required. q Optics seems well-suited to bufferless, reconfigurable datapaths.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute Tb/s optical router Stanford University Research Project q Collaboration q 4 Professors at Stanford (Mark Horowitz, Nick McKeown, David Miller and Olav Solgaard), and our groups. q Objective q To determine the best way to incorporate optics into routers. q Push technology hard to expose new issues. q Photonics, Electronics, System design q Motivating example: The design of a 100 Tb/s Internet router q Challenging but not impossible (~100x current commercial systems) q It identifies some interesting research problems
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 27 Arbitration 160Gb/s 40Gb/s Optical Switch Line termination IP packet processing Packet buffering Line termination IP packet processing Packet buffering Gb/s Gb/s Electronic Linecard #1 Electronic Linecard #625 Request Grant (100Tb/s = 625 * 160Gb/s) 100Tb/s optical router
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28 Research Problems q Linecard q Memory bottleneck: Address lookup and packet buffering. q Architecture q Arbitration: Computation complexity. q Switch Fabric q Optics: Fabric scalability and speed, q Electronics: Switch control and link electronics, q Packaging: Three surface problem.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29 b 160Gb/s Linecard: Packet Buffering q Problem q Packet buffer needs density of DRAM (40 Gbits) and speed of SRAM (2ns per packet) q Solution q Hybrid solution uses on-chip SRAM and off-chip DRAM. q Identified optimal algorithms that minimize size of SRAM (12 Mbits). q Precisely emulates behavior of 40 Gbit, 2ns SRAM. DRAM 160 Gb/s Queue Manager SRAM
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30 The Arbitration Problem q A packet switch fabric is reconfigured for every packet transfer. q At 160Gb/s, a new IP packet can arrive every 2ns. q The configuration is picked to maximize throughput and not waste capacity. q Known algorithms are too slow.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute Tb/s Router Optical Switch Fabric Racks of 160Gb/s Linecards Optical links
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 32 Racks with 160Gb/s linecards DRAM Queue Manager SRAM Lookup DRAM Queue Manager SRAM Lookup
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 33 Passive Optical Switching n n 1 2 n Midstage Linecard 1 Midstage Linecard 2 Midstage Linecard n Ingress Linecard 1 Ingress Linecard 2 Ingress Linecard n 1 2 n Egress Linecard 1 Egress Linecard 2 Egress Linecard n nn Integrated AWGR or diffraction grating based wavelength router
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 34 Question q Can we use an optical fabric at 100Tb/s with 100% throughput? q Conventional answer: No. q Need to reconfigure switch too often q 100% throughput requires complex electronic scheduler.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 35 Out R R R R/N Two-stage load-balancing switch Load-balancing stageSwitching stage In Out R R R R/N R R R 100% throughput for weakly mixing, stochastic traffic. [C.-S. Chang, Valiant]
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 36 Optical two-stage router Phase 2 Phase 1 Lookup Buffer Lookup Buffer Lookup Buffer Linecards
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute Tb/s Load-Balanced Router L = Gb/s linecards Linecard Rack G = 40 L = Gb/s linecards Linecard Rack 1 L = Gb/s linecards x 40 MEMS Switch Rack < 100W
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 38 Predictions: Core Internet routers q The need for more capacity for a given power and volume budget will mean: q Fewer functions in routers: q Little or no optimization for multicast, q Continued over-provisioning will lead to little or no support for QoS, DiffServ, …, q Fewer unnecessary requirements: q Mis-sequencing will be tolerated, q Latency requirements will be relaxed. q Less programmability in routers, and hence no network processors (NPs used at edge…). q Greater use of optics to reduce power in switch.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 39 Likely Events The need for capacity and reliability will mean: q Widespread replacement of core routers with transport switching based on circuits: q Circuit switches have proved simpler, more reliable, lower power, higher capacity and lower cost per Gb/s. Eventually, this is going to matter. q Internet will evolve to become edge routers interconnected by rich mesh of WDM circuit switches.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 40 Summary q High speed routers: lookup, switching, classification, buffer management q Lookup: Range-matching, tries, multi-way tries q Switching: circuit s/w, crossbar, batcher-banyan, q Queuing: input/output queuing issues q Classification, Scheduling: … q Road ahead to 100 Tbps routers…