George Prekas, Marios Kogias, Edouard Bugnion

Slides:

Advertisements

Similar presentations

Hadi Goudarzi and Massoud Pedram

Advertisements

CPU Scheduling Tanenbaum Ch 2.4 Silberchatz and Galvin Ch 5.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.

What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.

Practical TDMA for Datacenter Ethernet

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Livio Soares, Michael Stumm University of Toronto

Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.

AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Static Process Scheduling

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

Presented by: Xianghan Pei

Logically Centralized? State Distribution Trade-offs in Software Defined Networks.

Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.

Mohit Aron Peter Druschel Presenter: Christopher Head

Optimizing Distributed Actor Systems for Dynamic Interactive Services

OPERATING SYSTEMS CS 3502 Fall 2017

lecture 5: CPU Scheduling

Copyright ©: Nahrstedt, Angrave, Abdelzaher

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Lesson Objectives Aims Key Words

Copyright ©: Nahrstedt, Angrave, Abdelzaher

Chapter 5a: CPU Scheduling

Dynamic Graph Partitioning Algorithm

Timothy Zhu and Huapeng Zhou

So far we have covered … Basic visualization algorithms

Software Architecture in Practice

Frequency Governors for Cloud Database OLTP Workloads

Chapter 5: CPU Scheduling

Chapter 6: CPU Scheduling

Department of Computer Science University of California, Santa Barbara

The Tail At Scale Dean and Barroso, CACM 2013, Pages 74-80

CS 143A - Principles of Operating Systems

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

OSDI ‘14 Best Paper Award Adam Belay George Prekas Ana Klimovic

TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for Online Search Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vijaykumar.

3: CPU Scheduling Basic Concepts Scheduling Criteria

Process Scheduling B.Ramamurthy 12/5/2018.

Chapter5: CPU Scheduling

Operating systems Process scheduling.

Admission Control and Request Scheduling in E-Commerce Web Sites

Chapter 5: CPU Scheduling

Yiannis Nikolakopoulos

Process scheduling Chapter 5.

Chapter 6: CPU Scheduling

CPU SCHEDULING.

Outline Scheduling algorithms Multi-processor scheduling

Operating System Concepts

Fast Congestion Control in RDMA-Based Datacenter Networks

Process Scheduling B.Ramamurthy 4/11/2019.

Process Scheduling B.Ramamurthy 4/7/2019.

Uniprocessor scheduling

Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00

Chih-Hsun Chou Daniel Wong Laxmi N. Bhuyan

Chapter 6: CPU Scheduling

Database System Architectures

Module 5: CPU Scheduling

Department of Computer Science University of California, Santa Barbara

CS703 - Advanced Operating Systems

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

EdgeWise: A Better Stream Processing Engine for the Edge

Presentation transcript:

George Prekas, Marios Kogias, Edouard Bugnion ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks George Prekas, Marios Kogias, Edouard Bugnion My name is Marios Kogias and I am going to present to you our system "ZygOS", done in collaboration with George Prekas and our advisor Ed Bugnion in EPFL. ZygOS is the greek word for balancing scales. So we named our system as such since we are dealing with careful load-balancing of tiny tasks within a server.

My name is Marios Kogias and I am going to present to you our system "ZygOS", done in collaboration with George Prekas and our advisor Ed Bugnion in EPFL. ZygOS is the greek word for balancing scales. So we named our system as such since we are dealing with careful load-balancing of tiny tasks within a server.

Problem: Serve μs-scale RPCs Load Balancer Applications: KV-stores, In-memory DB Datacenter environment: Complex fan-out – fan-in patterns Tail-at-scale problem Tail Latency Service-Level Objectives Goal: Improve throughput at an aggressive tail latency SLO How? Focus within the leaf nodes Reduce system overheads Achieve better scheduling Root Root Root Root Leaf Leaf Leaf Leaf

Elementary Queuing Theory Processor FCFS Processor Sharing Multi/Single Queue Inter-arrival Distribution (λ) Poisson Service Time Distribution (μ) Fixed Exponential Bimodal FCFS λ S μ FCFS No OS overheads Independent of service time Upper performance bound Although this is a systems paper, in the rest of the talk we’ll leverage some basic queuing theory to better understand and reason about scheduling. We model our systems as a collection of processors and queues with an inter-arrival request distribution and a service time distribution. We do that because queuing models describe only the application behavior free of any operating system overheads. As a result, these models are independent of the service time and can be used as upper performance bounds.

Connection Delegation Baseline System Linux Dataplanes Networking Kernel (epoll) Userspace Connection Delegation Partitioned Floating Complexity Medium High Low Work Conservation ✖️ ✔️ Queuing Multi-Queue Single Queue Variety of alternatives: conventional operating systems such as linux, or dataplanes. In the case of partitioned connections each core is responsible for a fixed number of connections. This reduces the system complexity because applications don’t have to deal with synchronisation. This happens at the expense of work-conservation. In this model we can have… On the other hand… This model is work conserving since It’s a single queue model but application have to leverage locks for synchronisation. Can we build a system with low overheads that achieves work conservation?

Upcoming Key Observations: Key Contributions Single queue systems perform theoretically better Dataplanes, despite being multi-queue systems, perform practically better Key Contributions ZygOS combines the best of the two worlds: Reduced system overheads similar to dataplanes Convergence to a single-queue model For the rest of my talk I am gonna focus and analyze the following two observations that will guide our system design. 1. Single queue systems perform theoretically better 2. Despite this fact, dataplanes have proven to perform practically better due to their sweeping simplifications leading to lower overheads. Our system ZygOS combines the best of the two worlds:

Analysis Metric to optimize: Load @ Tail-Latency SLO Run timescale-independent simulations Run synthetic benchmarks on real system Questions: Which model achieves better throughput? Which system converges to its model at low service times? To make the previous observations clear and to guide our design we’ll do the following analysis. We try to optimize throughput under a tail latency SLO and we leverage both queuing model simulations and synthetic benchmarks on real systems. The goal of this analysis is to understand: - Which model achieves better throughput at the SLO? - Which system converges faster to its model as time increases, namely has lower system overheads.

Latency vs Load – Queuing model Fixed Exponential Bimodal Greater mismatch at high dispersion We used 3 distributions to study the impact of service time dispersion on tail latency. Bimodal: 10 % of requests are 11 times faster than the others Single queue models provide better throughput at SLO because of transient load imbalance 99th percentile latency SLO: 10 x AVG[service_time]

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal We repeat the same experiment now using real systems running a benchmark with a synthetic service time. We deployed... and we run the end to end latency on the client side. The red line is the SLO... Maximum throughput we can achieve here is 1.6 MRPS because we run on a 16core machine. Dataplanes perform better because of the reduced overheads. As the service time dispersion increases these benefits diminish. Linux floating connections maintains steady throughput 99th percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014

Latency vs Load – Service Time 25μs Fixed Exponential Bimodal Linux Floating outperforms IX We run the same experiment for 25μs so that the system overheads are reduced. Dataplanes perform better only in very low service times with low dispersion 99th percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014

Implement work-stealing to achieve work-conservation in a dataplane ZygOS Approach Dataplane aspect: Reduced system overheads Share nothing network processing Single Queue system Work conservation Reduction of head of line blocking If we combine the previous findings we come to the design of our system. From dataplanes we want to keep Implement work-stealing to achieve work-conservation in a dataplane

Background on IX 3 Ring 3 event-driven app Event Conditions Batched Syscalls libIX Guest Ring 0 2 4 TCP/IP TCP/IP We started from IX which is a virtualized dataplane. Leverages RSS 5 Timer RX FIFO 6 1 RX TX

IX Design ZygOS Design Application layer Shuffle layer Network layer Event based application that is agnostic to work-stealing Shuffle layer Includes a per core list of ready connections that allows stealing Network layer Coherence- and sync-free network processing NETWORK LAYER: Coherence and sync free, it’s where dataplanes get their benefits.

ZygOS Architecture Application Layer Application Layer Shuffle Layer event-driven app event-driven app libIX libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls This figure shows ZygOS architecture, with each core TCP/IP TCP/IP TCP/IP TCP/IP Network Layer Network Layer Timer Timer RX TX RX TX Home core Remote core

Execution Model Shuffle Layer event-driven app event-driven app libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Execution Model Shuffle Layer event-driven app event-driven app libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Execution Model Shuffle Layer event-driven app event-driven app libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Execution Model Shuffle Layer event-driven app event-driven app libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Execution Model Shuffle Layer event-driven app event-driven app libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Execution Model Shuffle Layer event-driven app event-driven app libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Execution Model Shuffle Layer event-driven app event-driven app libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Execution Model Shuffle Layer event-driven app event-driven app libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Dealing with Head of Line Blocking Definition: Home core busy in userspace Pending tasks Idle remote cores Inter-processor Interrupts if HOL is detected High service time dispersion increases HOL So far we show how ZygOS deals with load imbalance. But the previous design still suffers from head of line blocking. A remote processor sends an IPI to the home core once HOL is detected.

HOL – RX Shuffle Layer event-driven app event-driven app libIX libIX IPI Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

HOL – RX Shuffle Layer event-driven app event-driven app libIX libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

HOL – RX Shuffle Layer event-driven app event-driven app libIX libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

HOL – TX Shuffle Layer event-driven app event-driven app libIX libIX IPI Ring 3 Guest Ring 0 Shuffle Layer Shuffle Layer Shuffle Queue Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

HOL – TX Shuffle Layer event-driven app event-driven app libIX libIX Ring 3 Guest Ring 0 Shuffle Layer Shuffle Queue Shuffle Layer Shuffle Queue Remote Syscalls TCP/IP TCP/IP TCP/IP TCP/IP Timer Timer RX TX RX TX Home core Remote core

Evaluation Setup Environment: Experiments: Baselines: 10+1 Xeon Servers 16-hyperthread server machine Quanta/Cumulus 48x10GbE switch Experiments: Synthetic micro-benchmarks Silo [SOSP 2013] Memcached Baselines: IX Linux (partitioned and floating connections)

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal 99th percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal ZygOS achieve better latency than IX in cases of low service time dispersion. ZygOS performs better as the dispersion increases and this becomes obvious in the case of the bimodal distribution. 99th percentile latency SLO: 10 x AVG[service_time] IX, Belay et al. OSDI 2014

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Interrupt benefit 99th percentile latency SLO: 10 x AVG[service_time]

Latency vs Load – Service Time 10μs Fixed Exponential Bimodal Closer to Single-Queue Much better throughput than Linux. Much lower tail latency than IX. It achieves the best convergence to the optimal theoretic model It performs well even in cases with service time dispersion because of the interrupt mechanism 99th percentile latency SLO: 10 x AVG[service_time]

Silo with TPC-C workload 5 types of transactions Service time variability Average Service Time: 33μs Open-loop experiment 2752 TCP connections 99th percentile latency Silo an in-memory database because it’s an application with potential service time variability. We run TPC-C on top of it which supports 5 types of transactions and we ended up having a multi-model service time distr. median latency 20μs 99th percentile is 203μs How big was the db? = 40GB

Silo with TPC-C workload 1.63x speedup over Linux 3.68x lower 99th latency

Conclusion ZygOS: A datacenter operating system for low-latency Reduced System overheads Converges to a single queue model Work conservation through work stealing Reduce HOL through light-weight IPIs We ♥️ opensource I presented you Zygos, a datacenter operating system for low-latency. ZygOS achieves both reduced system overheads while converging to a single queue model. To do that we implement work stealing for work conservation and we leverage light-weight IPIs to reduce HOL. https://github.com/ix-project/zygos