Download presentation
Presentation is loading. Please wait.
1
George Prekas, Marios Kogias, Edouard Bugnion
ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks George Prekas, Marios Kogias, Edouard Bugnion My name is Marios Kogias and I am going to present to you our system "ZygOS", done in collaboration with George Prekas and our advisor Ed Bugnion in EPFL. ZygOS is the greek word for balancing scales. So we named our system as such since we are dealing with careful load-balancing of tiny tasks within a server.
2
My name is Marios Kogias and I am going to present to you our system "ZygOS", done in collaboration with George Prekas and our advisor Ed Bugnion in EPFL. ZygOS is the greek word for balancing scales. So we named our system as such since we are dealing with careful load-balancing of tiny tasks within a server.
3
Problem: Serve μs-scale RPCs
Load Balancer Applications: KV-stores, In-memory DB Datacenter environment: Complex fan-out – fan-in patterns Tail-at-scale problem Tail Latency Service-Level Objectives Goal: Improve throughput at an aggressive tail latency SLO How? Focus within the leaf nodes Reduce system overheads Achieve better scheduling Root Root Root Root Leaf Leaf Leaf Leaf
4
Elementary Queuing Theory
Processor FCFS Processor Sharing Multi/Single Queue Inter-arrival Distribution (λ) Poisson Service Time Distribution (μ) Fixed Exponential Bimodal FCFS λ S μ FCFS No OS overheads Independent of service time Upper performance bound Although this is a systems paper, in the rest of the talk we’ll leverage some basic queuing theory to better understand and reason about scheduling. We model our systems as a collection of processors and queues with an inter-arrival request distribution and a service time distribution. We do that because queuing models describe only the application behavior free of any operating system overheads. As a result, these models are independent of the service time and can be used as upper performance bounds.
5
Connection Delegation
Baseline System Linux Dataplanes Networking Kernel (epoll) Userspace Connection Delegation Partitioned Floating Complexity Medium High Low Work Conservation ✖️ ✔️ Queuing Multi-Queue Single Queue Variety of alternatives: conventional operating systems such as linux, or dataplanes. In the case of partitioned connections each core is responsible for a fixed number of connections. This reduces the system complexity because applications don’t have to deal with synchronisation. This happens at the expense of work-conservation. In this model we can have… On the other hand… This model is work conserving since It’s a single queue model but application have to leverage locks for synchronisation. Can we build a system with low overheads that achieves work conservation?
6
Upcoming Key Observations: Key Contributions
Single queue systems perform theoretically better Dataplanes, despite being multi-queue systems, perform practically better Key Contributions ZygOS combines the best of the two worlds: Reduced system overheads similar to dataplanes Convergence to a single-queue model For the rest of my talk I am gonna focus and analyze the following two observations that will guide our system design. 1. Single queue systems perform theoretically better 2. Despite this fact, dataplanes have proven to perform practically better due to their sweeping simplifications leading to lower overheads. Our system ZygOS combines the best of the two worlds:
7
Analysis Metric to optimize: Load @ Tail-Latency SLO
Run timescale-independent simulations Run synthetic benchmarks on real system Questions: Which model achieves better throughput? Which system converges to its model at low service times? To make the previous observations clear and to guide our design we’ll do the following analysis. We try to optimize throughput under a tail latency SLO and we leverage both queuing model simulations and synthetic benchmarks on real systems. The goal of this analysis is to understand: - Which model achieves better throughput at the SLO? - Which system converges faster to its model as time increases, namely has lower system overheads.
8
Latency vs Load – Queuing model
Fixed Exponential Bimodal Greater mismatch at high dispersion We used 3 distributions to study the impact of service time dispersion on tail latency. Bimodal: 10 % of requests are 11 times faster than the others Single queue models provide better throughput at SLO because of transient load imbalance 99th percentile latency SLO: 10 x AVG[service_time]
9
Latency vs Load – Service Time 10μs
Fixed Exponential Bimodal We repeat the same experiment now using real systems running a benchmark with a synthetic service time. We deployed... and we run the end to end latency on the client side. The red line is the SLO... Maximum throughput we can achieve here is 1.6 MRPS because we run on a 16core machine. Dataplanes perform better because of the reduced overheads. As the service time dispersion increases these benefits diminish. Linux floating connections maintains steady throughput 99th percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014
10
Latency vs Load – Service Time 25μs
Fixed Exponential Bimodal Linux Floating outperforms IX We run the same experiment for 25μs so that the system overheads are reduced. Dataplanes perform better only in very low service times with low dispersion 99th percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014
11
Implement work-stealing to achieve work-conservation in a dataplane
ZygOS Approach Dataplane aspect: Reduced system overheads Share nothing network processing Single Queue system Work conservation Reduction of head of line blocking If we combine the previous findings we come to the design of our system. From dataplanes we want to keep Implement work-stealing to achieve work-conservation in a dataplane
12
Background on IX 3 Ring 3 event-driven app Event Conditions Batched
Syscalls libIX Guest Ring 0 2 4 TCP/IP TCP/IP We started from IX which is a virtualized dataplane. Leverages RSS 5 Timer RX FIFO 6 1 RX TX
13
IX Design ZygOS Design Application layer Shuffle layer Network layer
Event based application that is agnostic to work-stealing Shuffle layer Includes a per core list of ready connections that allows stealing Network layer Coherence- and sync-free network processing NETWORK LAYER: Coherence and sync free, it’s where dataplanes get their benefits.
14
ZygOS Architecture Application Layer Application Layer Shuffle Layer
event-driven app event-driven app libIX libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls This figure shows ZygOS architecture, with each core TCP/IP TCP/IP TCP/IP TCP/IP Network Layer Network Layer Timer Timer RX TX RX TX Home core Remote core
15
Execution Model Shuffle Layer event-driven app event-driven app libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
16
Execution Model Shuffle Layer event-driven app event-driven app libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
17
Execution Model Shuffle Layer event-driven app event-driven app libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
18
Execution Model Shuffle Layer event-driven app event-driven app libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
19
Execution Model Shuffle Layer event-driven app event-driven app libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
20
Execution Model Shuffle Layer event-driven app event-driven app libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
21
Execution Model Shuffle Layer event-driven app event-driven app libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
22
Execution Model Shuffle Layer event-driven app event-driven app libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
23
Dealing with Head of Line Blocking
Definition: Home core busy in userspace Pending tasks Idle remote cores Inter-processor Interrupts if HOL is detected High service time dispersion increases HOL So far we show how ZygOS deals with load imbalance. But the previous design still suffers from head of line blocking. A remote processor sends an IPI to the home core once HOL is detected.
24
HOL – RX Shuffle Layer event-driven app event-driven app libIX libIX
IPI Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
25
HOL – RX Shuffle Layer event-driven app event-driven app libIX libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
26
HOL – RX Shuffle Layer event-driven app event-driven app libIX libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
27
HOL – TX Shuffle Layer event-driven app event-driven app libIX libIX
IPI Ring 3 Guest Ring 0 Shuffle Layer Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
28
HOL – TX Shuffle Layer event-driven app event-driven app libIX libIX
Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core
29
Evaluation Setup Environment: Experiments: Baselines:
10+1 Xeon Servers 16-hyperthread server machine Quanta/Cumulus 48x10GbE switch Experiments: Synthetic micro-benchmarks Silo [SOSP 2013] Memcached Baselines: IX Linux (partitioned and floating connections)
30
Latency vs Load – Service Time 10μs
Fixed Exponential Bimodal 99th percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014
31
Latency vs Load – Service Time 10μs
Fixed Exponential Bimodal ZygOS achieve better latency than IX in cases of low service time dispersion. ZygOS performs better as the dispersion increases and this becomes obvious in the case of the bimodal distribution. 99th percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014
32
Latency vs Load – Service Time 10μs
Fixed Exponential Bimodal Interrupt benefit 99th percentile latency SLO: 10 x AVG[service_time]
33
Latency vs Load – Service Time 10μs
Fixed Exponential Bimodal Closer to Single-Queue Much better throughput than Linux. Much lower tail latency than IX. It achieves the best convergence to the optimal theoretic model It performs well even in cases with service time dispersion because of the interrupt mechanism 99th percentile latency SLO: 10 x AVG[service_time]
34
Silo with TPC-C workload
5 types of transactions Service time variability Average Service Time: 33μs Open-loop experiment 2752 TCP connections 99th percentile latency Silo an in-memory database because it’s an application with potential service time variability. We run TPC-C on top of it which supports 5 types of transactions and we ended up having a multi-model service time distr. median latency 20μs 99th percentile is 203μs How big was the db? = 40GB
35
Silo with TPC-C workload
1.63x speedup over Linux 3.68x lower 99th latency
36
Conclusion ZygOS: A datacenter operating system for low-latency
Reduced System overheads Converges to a single queue model Work conservation through work stealing Reduce HOL through light-weight IPIs We ♥️ opensource I presented you Zygos, a datacenter operating system for low-latency. ZygOS achieves both reduced system overheads while converging to a single queue model. To do that we implement work stealing for work conservation and we leverage light-weight IPIs to reduce HOL.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.