Dependability Evaluation and Benchmarking of

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Live migration of Virtual Machines Nour Stefan, SCPD.

Topics to be discussed Introduction Performance Factors Methodology Test Process Tools Conclusion Abu Bakr Siddiq.

Virtual Switching Without a Hypervisor for a More Secure Cloud Xin Jin Princeton University Joint work with Eric Keller(UPenn) and Jennifer Rexford(Princeton)

SLA-Oriented Resource Provisioning for Cloud Computing

© 2014 VMware Inc. All rights reserved. Characterizing Cloud Management Performance Adarsh Jagadeeshwaran CMG INDIA CONFERENCE, December 12, 2014.

Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Availability in Globally Distributed Storage Systems

Xen , Linux Vserver , Planet Lab

1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,

Transparent Checkpoint of Closed Distributed Systems in Emulab Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah,

Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.

Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.

Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.

OnCall: Defeating Spikes with Dynamic Application Clusters Keith Coleman and James Norris Stanford University June 3, 2003.

Slide 3-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 3 Operating System Organization.

Virtualization Infrastructure Administration Cluster Jakub Yaghob.

5205 – IT Service Delivery and Support

Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &

Luigi De Simone Tutor: Prof. Domenico Cotroneo XXIX Cycle - I year presentation Dependability issues in cloud computing infrastructures.

Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.

Software Faults and Fault Injection Models --Raviteja Varanasi.

Qtip Revised project scope July QTIP overview QTIP aims to develop a framework for bottoms up testing of NFVI platforms QTIP aims to test: Computing.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Module 13: Network Load Balancing Fundamentals. Server Availability and Scalability Overview Windows Network Load Balancing Configuring Windows Network.

How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer

Windows 2000 Advanced Server and Clustering Prepared by: Tetsu Nagayama Russ Smith Dale Pena.

Improving Network I/O Virtualization for Cloud Computing.

Session objectives Discuss whether or not virtualization makes sense for Exchange 2013 Describe supportability of virtualization features Explain sizing.

VIRTUAL SWITCH/ROUTER BENCHMARKING Muhammad Durrani Ramki Krishnan Brocade Communications Sarah Banks Akamai 1 © 2013 Brocade Communications Systems, Inc.

02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.

S-Paxos: Eliminating the Leader Bottleneck

VMware vSphere Configuration and Management v6

Fault Localization (Pinpoint) Project Proposal for OPNFV

Project Name Program Name Project Scope Title Project Code and Name Insert Project Branding Image Here.

03/03/051 Performance Engineering of Software and Distributed Systems Research Activities at IIT Bombay Varsha Apte March 3 rd, 2005.

Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.

Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->

Virtual Machine Movement and Hyper-V Replica

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Network customization

REAL-TIME OPERATING SYSTEMS

Lecture 2: Performance Evaluation

Adam Backman Chief Cat Wrangler – White Star Software

Software Architecture in Practice

Virtual laboratories in cloud infrastructure of educational institutions Evgeniy Pluzhnik, Evgeniy Nikulchev, Moscow Technological Institute

Large Distributed Systems

Operating System Structure

Performance Testing Methodology for Cloud Based Applications

Introduction to Networks

Storage Virtualization

Comparison of the Three CPU Schedulers in Xen

Transparent Adaptive Resource Management for Middleware Systems

VMware NSX and Micro-Segmentation

Network Research at CTI

Network Function Virtualization: Challenges and

Virtualization Meetup Discussion

湖南大学-信息科学与工程学院-计算机与科学系

QNX Technology Overview

Network Research at CTI

Virtualization Techniques

Specialized Cloud Architectures

Operating Systems: A Modern Perspective, Chapter 3

Co-designed Virtual Machines for Reliable Computer Systems

Virtual Memory: Working Sets

Client/Server Computing and Web Technologies

Presentation transcript:

Dependability Evaluation and Benchmarking of Network Function Virtualization Infrastructures D. Cotroneo, L. De Simone, A.K. Iannillo, A. Lanzaro, R. Natella Critiware s.r.l. and Federico II University of Naples, Italy

Towards Network Functions Virtualization Virtual network equipment Physical network equipment RGW DPI BRAS IMS EPC ... Telecom workloads have demanding requirements (99.99...% availability) and cannot afford outages Reduced costs, improved manageability, faster innovation Comparable performance and reliability?

Why engineering reliable NFV is challenging? Complex stack of hardware and software off-the-shelf components Exposure to several sources of hardware and software faults Lack of tools and methodologies for testing fault-tolerance Hardware Hypervisor VM Guest OS Virtualization ? ? As a result, it is hard to trust the reliability of NFV services

In this presentation: An experimental methodology for dependability benchmarking of NFV based on fault injection A case study on a virtual IP Multimedia Subsystem (IMS), analyzing: The impact of faults on performance and availability The sensitivity to different types of faults The pitfalls in the design of NFVIs

What is a dependability benchmark? A dependability benchmark evaluates a system in the presence of (deliberately injected) faults Are NFV services still available and high- performing even when a fault is injected? The dependability benchmark includes: measures (KPIs) for characterizing performance and availability procedures, tools, conditions under which measures are obtained

Overview of the benchmarking process Iterated over several different faults Definition of workload, faultload, and measures Fault Injection experiments Computation of measures and reporting Deployment of VNFs over the NFVI Workload and VNFs execution Data collection Testbed clean-up The first part consists in the definition of key performance indicators (KPIs), the faultload (i.e., a set of faults to inject in the NFVI) and the workload (i.e., inputs to submit to the NFVI) that will support the experimental evaluation of an NFVI. Based on these elements, the second part of the methodology consists in the execution of a sequence of fault injection experiments. In each fault injection experiment, the NFVI under evaluation is first configured, by deploying a set of VNFs to exercise the NFVI; then, the workload is submitted to the VNFs running on the NFVI and, during their execution, faults are injected; at the end of the execution, performance and failure data are collected from the target NFVI; then, the experimental testbed is cleaned-up (e.g., by un-deploying VNFs) before starting the next experiment. This process is repeated several times, by injecting a different fault at each fault injection experiment (while using the same workload and collecting the same performance and failure metrics). The execution of fault injection experiments can be supported by automated tools for configuring virtualization infrastructures, for generating network workloads, and for injecting faults. Finally, performance and failure data from all experiments are processed to compute KPIs, and to support the identification of performance/dependability bottlenecks in the target NFVI. ... ... Injection of the i-th fault

Benchmark measures The dependability benchmark measures the quality of service as perceived by NFV users: VNF latency VNF throughput VNF experimental availability Risk Score We compare fault-injected experiments with the QoS objectives and the fault-free experiment (benchmark baseline)

VNF Latency and Throughput trequest End points VNF VNF VNF tresponse VNF Virtualization Layer Off-The-Shelf hardware and software Fault Injection VNF Latency: the time required to process a unit of traffic (such as a packet or a service request) VNF Throughput: the rate of processed traffic (packets or service requests) per second

Characterization of VNF latency Percentiles of the distribution are compared against QoS objectives, e.g.: 50th percentile ≤ 150ms 90th percentile ≤ 250ms 90th percentiles Response latency fault-free 50th percentiles Response latency with faults, good performance Response latency with faults, bad performance Gap from QoS objectives

VNF Experimental Availability End points VNF End points End points VNF VNF VNF Virtualization Layer Off-The-Shelf hardware and software Fault Injection Experimental availability: the percentage of traffic units that are successfully processed

Risk Score The Risk Score is a brief measure that summarizes the risk of experiencing service unavailability and/or performance failures Performance failures Availability failures Weighted average over all faults

Benchmark faultload I/O faults Compute faults Faults in virtualized environments include disruptions in network and storage I/O traffic, in CPUs and memory A fault injector has been implemented as a set of kernel modules for VMware ESXi and Linux Network frame receive/transmit Corruption Drop Delay Host VM Storage block reads/write I/O faults Compute faults Hogs Termination Code corruption Data corruption CPU Memory Host VM

Benchmark workload The VNFs should be exercised using a representative workload Our dependability benchmarking methodology is not tied to a specific choice of the workload Realistic workloads can be generated using load testing and performance benchmarking tools (e.g., Netperf)

Case study: Clearwater IMS Clearwater: an open-source NFV-oriented implementation of IP Multimedia Subsystem (IMS) In a first round of experiments, we test a replicated, load-balanced deployment over several VMs In a second round of experiments, we introduce the automated recovery of VMs (VMware HA cluster) in the setup We use SIPp to generate SIP call set-up requests VMware ESXi replicated servers Fault Injection

Fault injection test plan We inject faults in one of the physical host machines, and faults in a subset of the VMs (Sprout and Homestead) We inject both I/O (network, storage) and compute (memory, CPU) faults, both intermittently and permanently Each fault injection experiment has been repeated three times In total, 93 fault injection experiments have been performed

Experimental availability We computed performance and availability KPIs from logs of the SIPp workload generator Faults have a strong impact on availability Compute faults and Sprout-VM faults have the strongest impact

VNF latency (by fault type) Over than 10% of requests exhibit a latency much higher than 250ms! T50=150ms T90=250ms

Risk Score and problem determination The overall risk score (55%) is quite high and reflects the strong impact of faults The infrastructure was affected by a capacity problem once a VM or Host fails, the remaining replicas are not able to handle the SIP traffic NFVI design choices have a big impact on reliability! e.g., placement of VMs across hosts, topology of virtual networks and storage, allocation of CPUs and memory for VMs, etc.

Evaluating automated recovery mechanisms Fault-free run Faulty run, load-balancing only ~1m Faulty run, load-balancing + automated recovery Fault injected VM recovered Fault tolerance mechanisms require careful tuning, based on experimentation in our experiments, automated VM recovery was too slow and availability still resulted low

Conclusion Performance and availability are critical concerns for NFV NFVIs are very complex, and making design choices is difficult We proposed a dependability benchmark useful to point out dependability issues and to guide designers Future work will extend the evaluation to alternative virtualization technologies

Thank you! Questions?