Latency as a Performability Metric: Experimental Results Pete Broadwell

Slides:

Advertisements

Similar presentations

Conserving Disk Energy in Network Servers ACM 17th annual international conference on Supercomputing Presented by Hsu Hao Chen.

Advertisements

Large-Scale Distributed Systems Andrew Whitaker CSE451.

1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin Network Disks University.

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.

Web Server Benchmarking Using the Internet Protocol Traffic and Network Emulator Carey Williamson, Rob Simmonds, Martin Arlitt et al. University of Calgary.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)

1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.

Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.

Mendosus A SAN-Based Fault Injection Test-Bed for Construction of Highly Available Network Services Xiaoyan Li, Richard Martin, Kiran Nagaraja, Thu D.

Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.

RAIDs Performance Prediction based on Fuzzy Queue Theory Carlos Campos Bracho ECE 510 Project Prof. Dr. Duncan Elliot.

Cooperative Caching Middleware for Cluster-Based Servers Francisco Matias Cuenca-Acuna Thu D. Nguyen Panic Lab Department of Computer Science Rutgers University.

Virtual Machines Measure Up John Staton Karsten Steinhaeuser University of Notre Dame December 15, 2005 Graduate Operating Systems, Fall 2005 Final Project.

OnCall: Defeating Spikes with Dynamic Application Clusters Keith Coleman and James Norris Stanford University June 3, 2003.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Performance Evaluation

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Differentiated Multimedia Web Services Using Quality Aware Transcoding S. Chandra, C.Schlatter Ellis and A.Vahdat InfoCom 2000, IEEE Journal on Selected.

Evaluating a Defragmented DHT Filesystem Jeff Pang Phil Gibbons, Michael Kaminksy, Haifeng Yu, Sinivasan Seshan Intel Research Pittsburgh, CMU.

Adaptive Content Delivery for Scalable Web Servers Authors: Rahul Pradhan and Mark Claypool Presented by: David Finkel Computer Science Department Worcester.

Yaksha: A Self-Tuning Controller for Managing the Performance of 3-Tiered Web Sites Abhinav Kamra, Vishal Misra CS Department Columbia University Erich.

CacheMind: Fast Performance Recovery Using a Virtual Machine Monitor Kenichi Kourai Kyushu Institute of Technology, Japan.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Using Fault Model Enforcement (FME) to Improve Availability EASY ’02 Workshop Kiran Nagaraja, Ricardo Bianchini, Richard Martin, Thu Nguyen Department.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Latency as a Performability Metric for Internet Services Pete Broadwell

PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.

CHAPTER Network Hardware. Chapter Objectives Describe the important basic network hardware and the internetworking hardware Discuss the desired characteristics.

Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.

Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.

WIR FORSCHEN FÜR SIE The Palladio Component Model (PCM) for Performance and Reliability Prediction of Component-based Software Architectures Franz Brosch.

RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.

Data Management for Decision Support Session-5 Prof. Bharat Bhasker.

Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.

 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.

Database Replication Policies for Dynamic Content Applications Gokul Soundararajan, Cristiana Amza, Ashvin Goel University of Toronto EuroSys 2006: Leuven,

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

1 Specification and Implementation of Dynamic Web Site Benchmarks Sameh Elnikety Department of Computer Science Rice University.

Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

1 A New Approach to File System Cache Writeback of Application Data Sorin Faibish – EMC Distinguished Engineer P. Bixby, J. Forecast, P. Armangau and S.

A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.

Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.

Software Architecture based Performance and Reliability Evaluation Vibhu S. Sharma Ph.D. Scholar CSE, IITk.

Difference of Degradation Schemes among Operating Systems -Experimental analysis for web application servers- Hideaki Hibino*(Tokyo Tech) Kenichi Kourai.

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

1 Admission Control and Request Scheduling in E-Commerce Web Sites Sameh Elnikety, EPFL Erich Nahum, IBM Watson John Tracey, IBM Watson Willy Zwaenepoel,

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

The IEEE International Conference on Cluster Computing 2010

By: Kevin Arnold. Simple Definition Brief History RAID Levels Comparison Benefits, Disadvantages Cost Uses Conclusion Questions? Sources.

Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.

Lecture 2: Performance Evaluation

Abhinav Kamra, Vishal Misra CS Department Columbia University

Large Distributed Systems

Latency as a Performability Metric: Experimental Results

Web Server Administration

Dept. of Computer Science, Univ. of Rochester

by Xiang Mao and Qin Chen

Performance Evaluation

Presentation transcript:

Latency as a Performability Metric: Experimental Results Pete Broadwell

Outline 1.Motivation and background Performability overview Project summary 2.Test setup PRESS web server Mendosus fault injection system 3.Experimental results & analysis How to represent latency Questions for future research

Goal of ROC project: develop metrics to evaluate new recovery techniques Performability – class of metrics to describe how a system performs in the presence of faults –First used in fault-tolerant computing field 1 –Now being applied to online services Performability overview 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

Example: microbenchmark RAID disk failure

Rutgers study: performability analysis of a web server, using throughput Other studies (esp. from HP Labs Storage group) also use response time as a metric Assertion: latency and data quality are better than throughput for describing user experience How best to represent latency in performability reports? Project motivation

Goals: 1.Replicate PRESS/Mendosus study with response time measurements 2.Discuss how to incorporate latency into performability statistics Contributions: 1.Provide a latency-based analysis of a web server’s performability (currently rare) 2.Further the development of more comprehensive dependability benchmarks Project overview

Experiment components The Mendosus fault injection system –From Rutgers (Rich Martin) –Goal: low-overhead emulation of a cluster of workstations, injection of likely faults The PRESS web server –Cluster-based, uses cooperative caching. Designed by Carreira et al. (Rutgers) –Perf-PRESS: basic version –HA-PRESS: incorporates hearbeats, master node for automated cluster management Client simulators –Submit set # of requests/sec, based on real traces

User-level daemon (Java) Modified NIC driver SCSI module proc module apps Global Controller (Java) Fault config file Workstations (real or VMs) LAN emu config file Apps config file Emulated LAN Mendosus design

Experimental setup

Fault types CategoryFaultPossible Root Cause NodeNode crashOperator error, OS bug, hardware component failure, power outage Node freezeOS or kernel module bug ApplicationApp crashApplication bug or resource unavailability App hangApplication bug or resource contention with other processes NetworkLink down or flakyBroken, damaged or misattached cable Switch down or flakyDamaged or misconfigured switch, power outage

Test case timeline - Warm-up time: seconds - Time to repair: up to 90 seconds

Simplifying assumptions Operator repairs any non-transient failure after 90 seconds Web page size is constant Faults are independent Each client request is independent of all others (no sessions!) –Request arrival times are determined by a Poisson process (not self-similar) Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secs

Sample result: app crash Perf-PRESSHA-PRESS Throughput Latency

Sample result: node hang Perf-PRESSHA-PRESS Throughput Latency

Total seconds of wait time –Not good for comparing cases with different workloads Average (mean) wait time per request –OK, but requires that expected (normal) response time be given separately Variance of wait time –Not very intuitive to describe. Also, read- only workload means that all variance is toward longer wait times anyway Representing latency

Consider “goodput”-based availability: total responses served total requests Idea: Latency-based “punctuality”: ideal total latency actual total latency Like goodput, maximum value is 1 “Ideal” total latency: average latency for non-fault cases x total #requests (shouldn’t be 0) Representing latency (2)

Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience) –Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec) Representing latency (3)

Availability and punctuality

Data quality, latency and throughput are interrelated –Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”? To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote) 1 –These can be very arbitrary, so it’s important that the demerit formula be straightforward and publicly available Other metrics 1 Zona Research and Keynote Systems, The Need for Speed II, 2001

Rules: –Each aborted (2s) conn: 2 demerits –Each conn error: 1 demerit –Each user timeout (8s): 8 demerits –Each sec of total latency above ideal level: (1 demerit/total #requests) x scaling factor Sample demerit system App hang App crash Node crash Node freeze Link down

Expensive, fast & flaky Expensive & robust, but slow Expensive, robust and fast Online service optimization Performance metrics: throughput, latency & data quality Environment: workload & faults Cost of operations & components Cheap, robust & fast (optimal) Cheap, fast & flaky Cheap & robust, but slow

Conclusions Latency-based punctuality and throughput-based availability give similar results for a read-only web workload Applied workload is very important –Reliability metrics do not (and should not) reflect maximum performance/workload! Latency did not degrade gracefully in proportion to workload –At high loads, PRESS “oscillates” between full service, 100% load shedding

Further Work Combine test results & predicted component failure rates to get long- term performability estimates (are these useful?) Further study will benefit from more sophisticated client & workload simulators Services that generate dynamic content should lead to more interesting data (ex: RUBiS)

Latency as a Performability Metric: Experimental Results Pete Broadwell

Example: long-term model Discrete-time Markov chain (DTMC) model of a RAID-5 disk array 1 p i (t) = probability that system is in state i at time t w i (t) = reward (disk I/O operations/sec) = failure rate of a single disk drive  = disk repair rate D = number of data disks 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997