Less is More: Trading a little Bandwidth for Ultra-Low Latency in the Data Center Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda
Latency in Data Centers Latency is becoming a primary performance metric in DC Low latency applications High-frequency trading High-performance computing Large-scale web applications RAMClouds (want < 10μs RPCs) Desire predictable low-latency delivery of individual packets -and in fact, the needs of these applications have given rise to work on RAMClouds which ----- Meeting Notes (4/25/12 23:45) ----- End with: "What all these applications would like from the network, is predictable low-latency delivery of individual packets. For instance, one of the goals in RAMCloud is to allow 5-10us RPCs end-to-end accross the fabric. But as I'll show, this is very difficult to do in todays networks.
Why Does Latency Matter? Who does she know? What has she done? Traditional Application Large-scale Web Application Data Center App Tier App Logic App. Logic App Logic Alice << 1µs latency Fabric 0.5-10ms latency Data Structures Single machine Perhaps the easiest way to see this is through a comparison that John Ousterhout makes to explain why large-scale web applications should move their data to RAMClouds. If you consider the way an application was deployed before the days of the web, to run the application you load … But of course, the problem is that this is limited to the capacity of that one machine. So, to deploy a large web application, what we do now is take racks of machines and distribute our application logic on them and take a bunch of other racks of machines and distributed our data on them. And now the application has to go over the network to access its data. And the latency for this today is 4-5 orders of magnitude larger than what we had in that single machine. So if you think about this and ask how many reads of small objects can I do in these models, that one machine is actually better than these thousands of machines. So while we’ve scaled compute and storage, we haven’t scaled data access rates. And this is a big problem for large-scale applications. Facebook can only make 100-150 internal requests per page. Data Tier Latency limits data access rate Fundamentally limits applications Possibly 1000s of RPCs per operation Microseconds matter, even at the tail (e.g., 99.9th percentile) Eric Minnie Pics Apps Videos
Goal: Reduce queuing delays to zero. Reducing Latency Software and hardware are improving Kernel bypass, RDMA; RAMCloud: software processing ~1μs Low latency switches forward packets in a few 100ns Baseline fabric latency (propagation, switching) under 10μs is achievable. Queuing delay: random and traffic dependent Can easily reach 100s of microseconds or even milliseconds One 1500B packet = 12μs @ 1Gbps Goal: Reduce queuing delays to zero.
Low Latency AND High Throughput Data Center Workloads: Short messages [100B-10KB] Large flows [1MB-100MB] Low Latency High Throughput We want baseline fabric latency AND high throughput.
Why do we need buffers? Main reason: to create “slack” Handle temporary oversubscription Absorb TCP’s rate fluctuations as it discovers path bandwidth Example: Bandwidth-delay product rule of thumb A single TCP flow needs C×RTT buffers for 100% Throughput. Throughput Buffer Size 100% B B ≥ C×RTT B < C×RTT
Overview of our Approach Main Idea Use “phantom queues” Signal congestion before any queuing occurs Use DCTCP [SIGCOMM’10] Mitigate throughput loss that can occur without buffers Use hardware pacers Combat burstiness due to offload mechanisms like LSO and Interrupt coalescing
Review: DCTCP Switch: Source: B K Mark Don’t Switch: Set ECN Mark when Queue Length > K. Source: React in proportion to the extent of congestion less fluctuations Reduce window size based on fraction of marked packets. ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 5% Start with: “How can we extract multi-bit information from single-bit stream of ECN marks?” - Standard deviation: TCP (33.6KB), DCTCP (11.5KB)
DCTCP vs TCP (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Not ns2. Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, From Alizadeh et al [SIGCOMM’10] ECN Marking Thresh = 30KB
Achieving Zero Queuing Delay Incoming Traffic TCP TCP: ~1–10ms Incoming Traffic DCTCP K C DCTCP: ~100μs ~Zero Latency How do we get this?
Phantom Queue Key idea: Associate congestion with link utilization, not buffer occupancy Virtual Queue (Gibbens & Kelly 1999, Kunniyur & Srikant 2001) Switch Link Speed C Marking Thresh. γC Bump on Wire (NetFPGA implementation) γ < 1: Creates “bandwidth headroom”
Throughput & Latency vs. PQ Drain Rate Switch latency (mean)
The Need for Pacing TCP traffic is very bursty Made worse by CPU-offload optimizations like Large Send Offload and Interrupt Coalescing Causes spikes in queuing, increasing latency Example. 1Gbps flow on 10G NIC 65KB bursts every 0.5ms
Impact of Interrupt Coalescing Receiver CPU (%) Throughput (Gbps) Burst Size (KB) disabled 99 7.7 67.4 rx-frames=2 98.7 9.3 11.4 rx-frames=8 75 9.5 12.2 rx-frames=32 53.2 16.5 rx-frames=128 30.7 64.0 More Interrupt Coalescing Lower CPU Utilization & Higher Throughput More Burstiness
Hardware Pacer Module Algorithmic challenges: At what rate to pace? Found dynamically: Which flows to pace? Elephants: On each ACK with ECN bit set, begin pacing the flow with some probability. Outgoing Packets From Server NIC Un-paced Traffic TX Token Bucket Rate Limiter Flow Association Table R QTB
Throughput & Latency vs. PQ Drain Rate (with Pacing) Throughput Switch latency (mean)
No Pacing vs Pacing (Mean Latency)
No Pacing vs Pacing (99th Percentile Latency)
DCTCP Congestion Control The HULL Architecture Phantom Queue Hardware Pacer DCTCP Congestion Control
Implementation and Evaluation PQ, Pacer, and Latency Measurement modules implemented in NetFPGA DCTCP in Linux (patch available online) Evaluation 10 server testbed Numerous micro-benchmarks Static & dynamic workloads Comparison with ‘ideal’ 2-priority QoS scheme Different marking thresholds, switch buffer sizes Effect of parameters Large-scale ns-2 simulations
Dynamic Flow Experiment 20% load 9 senders 1 receiver (80% 1KB flows, 20% 10MB flows). Load: 20% Switch Latency (μs) 10MB FCT (ms) Avg 99th TCP 111.5 1,224.8 110.2 349.6 DCTCP-30K 38.4 295.2 106.8 301.7 DCTCP-6K-Pacer 6.6 59.7 111.8 320.0 DCTCP-PQ950-Pacer 2.8 18.6 125.4 359.9
Conclusion The HULL architecture combines Phantom queues DCTCP Hardware pacing We trade some bandwidth (that is relatively plentiful) for significant latency reductions (often 10-40x compared to TCP and DCTCP). … by trading some bandwidth which is a resource that is relatively plentiful in modern data centers ….
Thank you!