The Fork-Join Router Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University
Outline Quick Background on Packet Switches What’s the problem? “What if data rates exceed memory bandwidth?” The Fork-Join Router Parallel Packet Switches
First Generation Packet Switches Shared Backplane Line Interface CPU Memory CPU Buffer Memory Line Interface DMA MAC Line Interface DMA MAC Line Interface DMA MAC Fixed length “DMA” blocks or cells. Reassembled on egress linecard Fixed length cells or variable length packets
Second Generation Packet Switches CPU Buffer Memory Line Card DMA MAC Local Buffer Memory Line Card DMA MAC Local Buffer Memory Line Card DMA MAC Local Buffer Memory
Third Generation Packet Switches Line Card MAC Local Buffer Memory CPU Card Line Card MAC Local Buffer Memory Switched Backplane Line Interface CPU Memory
Fourth Generation Packet Switches
Two Basic Techniques Input-queued Crossbar Shared Memory 1+1 = 2 operations per cell time N+N = 2N operations per cell time
Shared Memory The Ideal A ZZ A ZZZ A A Z A ZPIKTD AAAAAAA FXHBAD Numerous work has proven and made possible: –Fairness –Delay Guarantees –Delay Variation Control –Loss Guarantees –Statistical Guarantees
Precise Emulation of an Output Queued Switch NN Output Queued Switch 1 N Combined Input-Output Queued Switch = ? Scheduler
Result Theorem: A speedup of 2-1/N is necessary and sufficient for a combined input- and output-queued switch to precisely emulate an output-queued switch for all traffic. Joint work with Balaji Prabhakar at Stanford.
Outline Quick Background on Packet Switches What’s the problem? “What if data rates exceed memory bandwidth?” The Fork-Join Router Parallel Packet Switches
Buffer Memory How Fast Can I Make a Packet Buffer? Buffer Memory 5ns SRAM Rough Estimate: –5ns per memory operation. –Two memory operations per packet. –Therefore, maximum 51.2Gb/s. –In practice, closer to 40Gb/s. 64-byte wide bus
Buffer Memory Is It Going to Get Better? time Specmarks, Memory size, Gate density time Memory Bandwidth (to core)
Optical Physical Layers… …are Going to Make Things “Worse” DWDM: –More ’s per fiber more “ports” per switch. –# ports: 16, …, 1000’s. Data rate: –More b/s per higher capacity. –Data rates: 2.5Gb/s, 10Gb/s, 40Gb/s, 160Gb/s, …
Approach #1: Ping-pong Buffering Buffer Memory 64-byte wide bus Buffer Memory 64-byte wide bus
Approach #1: Ping-pong Buffering Buffer Memory 64-byte wide bus Buffer Memory 64-byte wide bus Memory bandwidth doubled to ~80 Gb/s
Approach #2: Multiple Parallel Buffers aka Banking, Interleaving Buffer Memory Buffer Memory Buffer Memory Buffer Memory
Outline Quick Background on Packet Switches What’s the problem? “What if data rates exceed memory bandwidth?” The Fork-Join Router Parallel Packet Switches
The Fork-Join Router 1 2 k 1 N rate, R 1 N Router Bufferless
The Fork-Join Router Advantages –k memory bandwidth –k lookup/classification rate –k routing/classification table size Problems –How to demultiplex prior to lookup/classification? –How does the system perform/behave? –Can we predict/guarantee performance?
Outline Quick Background on Packet Switches What’s the problem? “What if data rates exceed memory bandwidth?” The Fork-Join Router Parallel Packet Switches
A Parallel Packet Switch 1 N rate, R 1 N Output Queued Switch Output Queued Switch Output Queued Switch 1 2 k
Parallel Packet Switch Questions 1.Can it be work-conserving? 2.Can it emulate a single big output queued switch? 3.Can it support delay guarantees, strict-priorities, WFQ, …? 4.What happens with multicast?
Parallel Packet Switch Work Conservation rate, R k 1 R/k Input Link Constraint Output Link Constraint
Parallel Packet Switch Work Conservation rate, R k 1 R/k Output Link Constraint
Parallel Packet Switch Work Conservation 1 N rate, R 1 N Output Queued Switch Output Queued Switch Output Queued Switch 1 2 k S(R/k)
Precise Emulation of an Output Queued Switch NN Output Queued Switch 1 N Parallel Packet Switch = ? 1 N 1 N
Parallel Packet Switch Theorems 1.If S > 2k/(k+2) 2 then a parallel packet switch can be work- conserving for all traffic. 2.If S > 2k/(k+2) 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic.
Parallel Packet Switch Theorems 3. If S > 3k/(k+3) 3 then a parallel packet switch can be precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.
An aside Unbuffered Clos Circuit Switch Expansion factor required = 2-1/N
Clos Network I1I1 IXIX a b c O1O1 OXOX m { }m}m }m}m O 1 O 2 O 3 O x I 1 I 2 I 3 I x b <= min(R,m) entries in each row <= min(R,m) entries in each column R middle stage switches
Clos Network I1I1 IXIX a b c O1O1 OXOX m { }m}m }m}m O 1 O 2 O 3 O x I 1 I 2 I 3 I x b <= min(R,m) entries in each row <= min(R,m) entries in each column R middle stage switches Define: UIL(I i ) = used links at switch I i to connect to middle stages. UOL(O i ) = used links at switch O i to connect to middle stages. If we wish to connect I i to O i : When adding connection: |UIL(I i )| <= m-1 and |UOL(O i )| <= m-1 Worst-case: |UIL(I i ) U UOL(O i )| = 2m -2 Therefore, if R >= 2m-2 there are always enough middle stages.
An aside Unbuffered Clos Circuit Switch Expansion factor required = 2-1/N Expansion 2 - 4/(k+2)
Fork-Join Router Project What’s next? Theory: –Extending results to distributed algorithms. –Extending results to multicast. Implementation/Prototyping: –Under discussion...