Techniques and problems for 100Tb/s routers Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm
Topics System architecture: Specific problems Parallel Packet Switches: Building big routers from lots of little routers. TCP Switching: Exposing (optical) circuits to IP. Load-balancing and mis-sequencing. Specific problems Fast IP lookup and classification. Fast packet buffers: buffering packets at 40+Gb/s line-rates Crossbar scheduling: (other SNRC project). Today Today
Building a big router from lots of little routers What we’d like: R R NxN R R The building blocks we’d like to use: R
Why this might be a good idea Larger overall capacity Faster line rates Redundancy Familiarity “After all, this is how the Internet is built”
Parallel Packet Switch Demultiplexor OQ Switch Multiplexor (sR/k) (sR/k) R R 1 2 3 N=4 1 1 Demultiplexor Multiplexor R OQ Switch R 2 Demultiplexor 2 Multiplexor R R 3 OQ Switch Demultiplexor Multiplexor R R k=3 N=4 (sR/k) (sR/k)
Parallel Packet Switch Results A PPS can precisely emulate a FCFS shared memory switch: S > 2k/(k+2) @ 2 and a centralized algorithm, S = 1 with (N.k) buffers in the demultiplexor and multiplexor, and a distributed algorithm.
Topics System architecture: Specific problems Parallel Packet Switches: Building big routers from lots of little routers. TCP Switching: Exposing (optical) circuits to IP. Load-balancing and mis-sequencing. Specific problems Fast IP lookup and classification. Fast packet buffers: buffering packets at 40+Gb/s line-rates Crossbar scheduling: (other SNRC project). Today
Optics and Routing Seminar: A Motivating System Phictitious 100Tb/s IP Router Switch Linecard #1 Linecard #625 160- 320Gb/s 160- 320Gb/s 40Gb/s Line termination IP packet processing Packet buffering Line termination IP packet processing Packet buffering 40Gb/s 160Gb/s 40Gb/s Arbitration Request 40Gb/s Grant
Basic Flow Buffer Buffer 160Gb/s 160Gb/s IP Lookup Ingress Linecard Egress Linecard Buffer Buffer Forwarding Table 625x625 Switch 160- 320Gb/s 160- 320Gb/s 160Gb/s 160Gb/s Packet parse IP Lookup Buffer Mgmt Buffer Mgmt Request Arbitration Grant
Packet buffers External Line ~5ns for SRAM ~50ns for DRAM Buffer Memory External Line 64-byte wide bus 64-byte wide bus Rough Estimate: 5 or 50ns per memory operation. Two memory operations per packet. Therefore, maximum ~50 or 5 Gb/s. Aside: Buffers need to be large for TCP to work well, so DRAM is usually required.
Packet buffers OC768c = 40Gb/s; RTT*BW = 10Gb; 64 byte “cells” Example of ingress queues on an OC768c linecard: OC768c = 40Gb/s; RTT*BW = 10Gb; 64 byte “cells” One memory operation every 12.8ns Buffer Memory External Line 64-byte wide bus 64-byte wide bus Scheduler or Arbiter
This is not possible today SRAM: + fast enough random access time, but - too expensive, and - too low density to store 10Gb of data. DRAM: + high density means we can store data, but - too slow (typically 50ns random access time)
FIFO caches Memory Hierarchy Small egress SRAM cache of FIFO heads Large DRAM memory holds the middle of FIFOs 5 7 6 8 10 9 11 12 14 13 15 50 52 51 53 54 86 88 87 89 91 90 82 84 83 85 92 94 93 95 1 Q 2 Writing b cells Reading cache of FIFO tails 55 56 96 97 57 58 59 60 Small ingress SRAM Arriving Packets R Arbiter or Scheduler Requests Departing 1 2 Q 3 4 5 6 We’ll focus on the sizing of this SRAM
Earliest Critical Queue First (ECQF) In SRAM: A Dynamic Buffer of size Q(b-1) Q b - 1 Cells Lookahead: Arbiter requests for future Q(b-1) + 1 slots are known. Q(b-1) + 1 Requests in lookahead buffer Determine which queue runs into “trouble” first. green! Replenish “b” cells for the “troubled” queue Q b - 1 Cells
Results Single Address Bus Patient Arbiter: ECQF-MMA (earliest critical queue first), minimizes the size of SRAM buffer to Q(b-1) cells; and guarantees that requested cells are dispatched within Q(b-1)+1 cell slots. Proof: Pigeon-hole principle (same argument used for strictly non-blocking conditions for a Clos network).
Implementation Numbers (64byte cells, b = 8, DRAM T = 50ns) VOQ Switch - 32 ports Brute Force : Egress. SRAM = 10 Gb, no DRAM Patient Arbiter : Egress. SRAM = 115kb, Lat. = 2.9 us, DRAM = 10Gb Impatient Arbiter : Egress. SRAM = 787 kb, DRAM = 10Gb Patient Arbiter(MA) : No SRAM, Lat. = 3.2us, DRAM = 10Gb VOQ Switch – 512 ports Brute Force : Egress. SRAM = 10Gb, no DRAM Patient Arbiter : Egress. SRAM = 1.85Mb, Lat. = 45.9us, DRAM 10Gb Impatient Arbiter : Egress. SRAM = 18.9Mb, DRAM = 10Gb Patient Arbiter(MA) : No SRAM, Lat. = 51.2us, DRAM = 10Gb Add a point for building a buffer with only SRAM as a thing to compare with Check and re-calculate all numbers for b=8 Focus on the “WHY?”
Some Next Steps Generalizing results and design rules for: Selectable degrees of “impatience”, Deterministic vs. statistical bounds. Application to design of large shared memory switches