Download presentation
Presentation is loading. Please wait.
Published byDwayne Carswell Modified over 10 years ago
1
1 Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and Clark Jeffries 26 Sept. 2006 IBM Zurich Research Lab
2
2 Outline Part I Requirements of datacenter link-level flow control (LL-FC) Brief survey of top 3 LL-FC methods PAUSE, aka. On/Off grants credit rate Baseline performance evaluation Part II Selectivity and scope of LL-FC per-what? : LL-FC’s resolution
3
3 Req’ts of.3x’: Next Generation of Ethernet Flow Control for Datacenters 1. Lossless operation No-drop expectation of datacenter apps (storage, IPC) Low latency 2. Selective Discrimination granularity: link, prio/VL, VLAN, VC, flow...? Scope: Backpressure upstream one hop, k-hops, e2e...? 3. Simple... PAUSE-compatible !!
4
4 Generic LL-FC System One link with 2 adjacent buffers: TX (SRC) and RX (DST) Round trip time (RTT) per link is system’s time constant LL-FC issues: link traversal (channel Bw allocation) RX buffer allocation pairwise-communication between channel’s terminations signaling overhead (PAUSE, credit, rate commands) backpressure (BP): increase / decrease injections stop and restart protocol RTT
5
5 FC-Basics: PAUSE (On/Off Grants) Xbar Data Link Down - stream Links TX Queues OQ Threshold Stop Go “Over-run”= Send STOP RX Buffer OQ PAUSE BP Semantics : STOP / GO / STOP.. Threshold Stop Go “Over-run”= Send STOP * Note: Selectivity and granularity of FC domains are not considered here. FC Return path
6
6 FC-Basics: Credits Xbar * Note: Selectivity and granularity of FC domains are not considered here.
7
7 Correctness: Min. Memory for “No Drop” "Minimum“: to operate lossless => O(RTT link ) – Credit : 1 credit = 1 memory location – Grant : 5 (=RTT+1) memory locations Credits – Under full load the single credit is constantly looping between RX and TX RTT=4 => max. performance = f(up-link utilisation) = 25% Grants – Determined by slow restart: if last packet has left the RX queue, it takes an RTT until the next packet arrives
8
8 PAUSE vs. Credit @ M = RTT+1 "Equivalent" = ‘fair’ comparison 1.Credit scheme: 5 credit = 5 memory locations 2.Grant scheme: 5 (=RTT+1) memory locations Performance loss for PAUSE/Grants is due to lack of underflow protection, because if M < 2*RTT the link is not work-conserving (pipeline bubbles on restart) For equivalent (to credit) performance, M=9 is required for PAUSE.
9
9 RX queue Qi=1 (full capacity). Max. flow (input arrivals) during one timestep (Dt = 1) is 1/8. Goal: update the TX probability Ti from any sending node during the time interval [t, t+1) to obtain the new Ti applied during the time interval [t+1, t+2). Algorithm for obtaining Ti(t+1) from Ti(t)... => Initially the offered rate from source0 was set =.100, and from source1 =.025. All other processing rates were.125. Hence all queues show low occupancy. At timestep 20, the flow rate to the sink was reduced to.050 => causing a congestion level in Queue2 of.125/.050 = 2.5 times processing capacity. Results: The average queue occupancies are.23 to.25, except Q3 =.13. The source flows are treated about equally and their long-term sum is about.050 (optimal). FC-Basics: Rate
10
10 Conclusion Part I: Which Scheme is “Better”? PAUSE + simple + scalable (lower overhead of signalling) - 2xM size required Credits (absolute or incremental) + are always lossless, independent of the RTT and memory size + adopted by virtually all modern ICTNs (IBA, PCIe, FC, HT,...) - not trivial for buffer-sharing - protocol reliability - scalability At equal M = RTT, credits show 30+% higher T put vs. PAUSE *Note: Stability of both was formally proven herehere Rate: in-between PAUSE and credits + adopted in adapters + potential good match for BCN (e2e CM) - complexity (cheap fast bridges)
11
11 Part II: Selectivity and Scope of LL-FC “Per-Prio/VL PAUSE” The FC-ed ‘link’ could be a physical channel (e.g. 802.3x) virtual lane (VL, e.g. IBA 2-16 VLs) virtual channel (VC, larger figure) ... Per-Prio/VL PAUSE is the often proposed PAUSE v2.0... Yet, is it good enough for the next decade of datacenter Ethernet? Evaluation of IBA vs. PCIe/As vs. NextGen-Bridge (PrizmaCi)
12
12 Already Implemented in IBA (and other ICTNs...) IBA has 15 FC-ed VLs for QoS SL-to-VL mapping is performed per hop, according to capabilities However, IBA doesn’t have VOQ-selective LL-FC “selective” = per switch (virtual) output port So what? Hogging - aka buffer monopolization, HOL 1 -blocking, output queue lockup, single-stage congestion, saturation tree (k=0) How can we prove that hogging really occurs in IBA? A. Back-of-the-envelope reasoning B. Analytical modeling of stability and work-conservation (papers available) C. Comparative simulations: IBA, PCI-AS etc. (next slides)
13
13 Simulation: parallel backup to a RAID across an IBA switch TX / SRC 16 independent IBA sources, e.g. 16 “producer” CPU/threads SRC behavior: greedy, using any communication model (UD) SL: BE service discipline on a single VL –(the other VLs suffer of their own ) Fabrics (single stage) 16x16 IBA generic SE 16x16 PCI-AS switch 16x16 Prizma CI switch RX / DST 16 HDD “consumers” t 0 : initially each HDD sinks data at full 1x (100%) t sim : during simulation HDD[0] enters thermal recalibration or sector remapping; consequently »HDD[0] progressively slows down its incoming link throughput: 90, 80,..., 10% IBA SE Hogging Scenario
14
14 First: Friendly Bernoulli Traffic 2 Sources (A, B) sending @ (12x + 4x) to 16*1x End Nodes (C..R) link 0 throughput reduction aggregate throughput achievable performance actual IBA performance R Throughput loss Fig. from IBA Spec
15
15 Myths and Fallacies about Hogging Isn’t IBA’s static rate control sufficient? No, because it is STATIC IBA’s VLs are sufficient...?! No. VLs and ports are orthogonal dimensions of LL-FC 1. VLs are for SL and QoS => VLs are assigned to prios, not ports! 2. Max. no. of VLs = 15 << max (SE_degree x SL) = 4K Can the SE buffer partitioning solve hogging, blocking and sat_trees, at least in single SE systems? No. 1. Partitioning makes sense only w/ Status-based FC (per bridge output port - see PCIe/AS SBFC); IBA doesn’t have a native Status-based FC 2. Sizing becomes the issue => we need dedication per I and O ports M = O( SL * max{RTT, MTU} * N 2 ) very large number! Academic papers and theoretical disertations prove stability and work- conservation, but the amounts of required M are large
16
16 Conclusion Part II: Selectivity and Scope of LL-FC Despite 16 VLs, IBA/DCE is exposed to the “transistor effect”: any single flow can modulate the aggregate T put of all the others Hogging (HOL 1 -blocking) requires a solution even for the smallest IBA/DCE system (single hop) Prios/VL and VOQ/VC are 2 orthogonal dimensions of LL-FC Q: QoS violation as price of ‘non-blocking’ LL-FC? Possible granularities of LL-FC queuing domains: A. CM can serve in single hop fabrics also as LL-FC B. Introduce VOQ-FC: intermediate coarser grain no. VCs = max{VOQ} * max{VL} = 64..4096 x 2..16 <= 64K VCs Alternative: 802.1p (map prios to 8 VLs) +.1q (map VLANs to 4K VCs)? Was proposed in 802.3ar...
17
17 Backup
18
18 Switch[k+1] RX Port[k+1, i] RX Mgnt. Unit (Buffer Allocation) LL-FC TX Unit “return path of LL-FC token" VOQ[1] VOQ[n] LL-FC Reception TX Scheduler RX Buffer LL-FC Between Two Bridges "send packet" Switch[k] TX Port[k,j]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.