Transparent Checkpoint of Closed Distributed Systems in Emulab Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah,

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Operating System.

User-Mode Linux Ken C.K. Lee

Live Migration of an Entire Network (and its Hosts) Eric Keller, Soudeh Ghorbani, Matthew Caesar, Jennifer Rexford HotNets 2012.

SLA-Oriented Resource Provisioning for Cloud Computing

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.

Lecture 11: Operating System Services. What is an Operating System? An operating system is an event driven program which acts as an interface between.

IO-Lite: A Unified Buffering and Caching System By Pai, Druschel, and Zwaenepoel (1999) Presented by Justin Kliger for CS780: Advanced Techniques in Caching.

1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,

1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.

Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.

1 Modeling and Emulation of Internet Paths Pramod Sanaga, Jonathon Duerig, Robert Ricci, Jay Lepreau University of Utah.

1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

Accurate and Efficient Replaying of File System Traces Nikolai Joukov, TimothyWong, and Erez Zadok Stony Brook University (FAST 2005) USENIX Conference.

1 Fast, Scalable Disk Imaging with Frisbee University of Utah Mike Hibler, Leigh Stoller, Jay Lepreau, Robert Ricci, Chad Barb.

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

1 Flexlab: A Realistic, Controlled, and Friendly Environment for Evaluating Networked Systems Jonathon Duerig, Robert Ricci, Junxing Zhang, Daniel Gebhardt,

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Integrated Scientific Workflow Management for the Emulab Network Testbed Eric Eide, Leigh Stoller, Tim Stack, Juliana Freire, and Jay Lepreau and Jay Lepreau.

Chapter 13 Embedded Systems

Figure 1.1 Interaction between applications and the operating system.

Chapter 11 Operating Systems

Remus: High Availability via Asynchronous Virtual Machine Replication.

1 Sonia Fahmy Ness Shroff Students: Roman Chertov Rupak Sanjel Center for Education and Research in Information Assurance and Security (CERIAS) Purdue.

Bandwidth Measurements for VMs in Cloud Amit Gupta and Rohit Ranchal Ref. Cloud Monitoring Framework by H. Khandelwal, R. Kompella and R. Ramasubramanian.

Replay Debugging for Distributed Systems Dennis Geels, Gautam Altekar, Ion Stoica, Scott Shenker.

Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.

EstiNet Network Simulator & Emulator 2014/06/ 尉遲仲涵.

A Virtual Circuit Multicast Transport Protocol (VCMTP) for Scientific Data Distribution Jie Li and Malathi Veeraraghavan University of Virginia Steve Emmerson.

Remus: VM Replication Jeff Chase Duke University.

Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.

Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.

Large-scale Virtualization in the Emulab Network Testbed Mike Hibler, Robert Ricci, Leigh Stoller Jonathon Duerig Shashi Guruprasad, Tim Stack, Kirk Webb,

Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Based upon slides from Jay Lepreau, Utah Emulab Introduction Shiv Kalyanaraman

MEMORY RESOURCE MANAGEMENT IN VMWARE ESX SERVER 김정수

Storage Systems CSE 598d, Spring 2007 Rethink the Sync April 3, 2007 Mark Johnson.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.

Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Virtual Machine Movement and Hyper-V Replica

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

BDTS and Its Evaluation on IGTMD link C. Chen, S. Soudan, M. Pasin, B. Chen, D. Divakaran, P. Primet CC-IN2P3, LIP ENS-Lyon

Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.

Chapter 13: I/O Systems.

Module 12: I/O Systems I/O hardware Application I/O Interface

Presented by: Daniel Taylor

Current Generation Hypervisor Type 1 Type 2.

Diskpool and cloud storage benchmarks used in IT-DSS

Lesson Objectives Aims Key Words

CS490 Windows Internals Quiz 2 09/27/2013.

Introduction to Operating Systems

EECS 498 Introduction to Distributed Systems Fall 2017

Reference-Driven Performance Anomaly Identification

Intro. To Operating Systems

Chapter 3: Operating-System Structures

Operating System Concepts

13: I/O Systems I/O hardwared Application I/O Interface

Smita Vijayakumar Qian Zhu Gagan Agrawal

Basic Concepts Protection: Security:

Chapter 2: Operating-System Structures

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter 2: Operating-System Structures

Module 12: I/O Systems I/O hardwared Application I/O Interface

Presentation transcript:

Transparent Checkpoint of Closed Distributed Systems in Emulab Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah, School of Computing

Emulab Public testbed for network experimentation 2 Complex networking experiments within minutes

Emulab — precise research tool Realism: – Real dedicated hardware Machines and networks – Real operating systems – Freedom to configure any component of the software stack – Meaningful real-world results Control: – Closed system Controlled external dependencies and side effects – Control interface – Repeatable, directed experimentation 3

Goal: more control over execution Stateful swap-out – Demand for physical resources exceeds capacity – Preemptive experiment scheduling Long-running Large-scale experiments – No loss of experiment state Time-travel – Replay experiments Deterministically or non-deterministically – Debugging and analysis aid 4

Challenge Both controls should preserve fidelity of experimentation Both rely on transparency of distributed checkpoint 5

Transparent checkpoint Traditionally, semantic transparency: – Checkpointed execution is one of the possible correct executions What if we want to preserve performance correctness? – Checkpointed execution is one of the correct executions closest to a non-checkpointed run Preserve measurable parameters of the system – CPU allocation – Elapsed time – Disk throughput – Network delay and bandwidth 6

Traditional view Local case – Transparency = smallest possible downtime – Several milliseconds [Remus] – Background work – Harms realism Distributed case – Lamport checkpoint Provides consistency – Packet delays, timeouts, traffic bursts, replay buffer overflows 7

Main insight Conceal checkpoint from the system under test – But still stay on the real hardware as much as possible “Instantly” freeze the system – Time and execution – Ensure atomicity of checkpoint Single non-divisible action Conceal checkpoint by time virtualization 8

Contributions Transparency of distributed checkpoint Local atomicity – Temporal firewall Execution control mechanisms for Emulab – Stateful swap-out – Time-travel Branching storage 9

Challenges and implementation 10

Checkpoint essentials State encapsulation – Suspend execution – Save running state of the system Virtualization layer 11

Checkpoint essentials State encapsulation – Suspend execution – Save running state of the system Virtualization layer – Suspends the system – Saves its state – Saves in-flight state – Disconnects/reconnects to the hardware 12

First challenge: atomicity Permanent encapsulation is harmful – Too slow – Some state is shared Encapsulated upon checkpoint Externally to VM – Full memory virtualization – Needs declarative description of shared state Internally to VM – Breaks atomicity 13 ?

Atomicity in the local case Temporal firewall – Selectively suspends execution and time – Provides atomicity inside the firewall Execution control in the Linux kernel – Kernel threads – Interrupts, exceptions, IRQs Conceals checkpoint – Time virtualization 14

Second challenge: synchronization ??? $%#! Timeout Lamport checkpoint – No synchronization – System is partially suspended Preserves consistency – Logs in-flight packets Once logged it’s impossible to remove Unsuspended nodes – Time-outs 15

Synchronized checkpoint Synchronize clocks across the system Schedule checkpoint Checkpoint all nodes at once Almost no in-flight packets 16

Bandwidth-delay product Large number of in- flight packets Slow links dominate the log Faster links wait for the entire log to complete Per-path replay? – Unavailable at Layer 2 – Accurate replay engine on every node 17

Checkpoint the network core Leverage Emulab delay nodes – Emulab links are no-delay – Link emulation done by delay nodes Avoid replay of in-flight packets Capture all in-flight packets in core – Checkpoint delay nodes 18

Efficient branching storage To be practical stateful swap-out has to be fast Mostly read-only FS – Shared across nodes and experiments Deltas accumulate across swap-outs Based on LVM – Many optimizations 19

Evaluation

Evaluation plan Transparency of the checkpoint Measurable metrics – Time virtualization – CPU allocation – Network parameters 21

Time virtualization 22 Checkpoint every 5 sec (24 checkpoints) Timer accuracy is 28 μsec Checkpoint adds ±80 μsec error do { usleep(10 ms) gettimeofday() } while () sleep + overhead = 20 ms

CPU allocation 23 Checkpoint every 5 sec (29 checkpoints) do { stress_cpu() gettimeofday() } while() stress + overhead = ms Normally within 9 ms of average Checkpoint adds 27 ms error ls /root – 7ms overhead xm list – 130 ms

Network transparency: iperf 24 No TCP window change No packet drops Throughput drop is due to background activity Average inter-packet time: 18 μsec Checkpoint adds: μsec Checkpoint every 5 sec (4 checkpoints) - 1Gbps, 0 delay network, - iperf between two VMs - tcpdump inside one of VMs - averaging over 0.5 ms

Network transparency: BitTorrent Mbps, low delay 1BT server + 3 clients 3GB file Checkpoint preserves average throughput Checkpoint every 5 sec (20 checkpoints)

Conclusions Transparent distributed checkpoint – Precise research tool – Fidelity of distributed system analysis Temporal firewall – General mechanism to change perception of time for the system – Conceal various external events Future work is time-travel 26

Thank you

Backup 28

Branching storage 29 Copy-on-write as a redo log Linear addressing Free block elimination Read before write elimination

Branching storage 30