Latency as a Performability Metric: Experimental Results Pete Broadwell
Outline 1.Motivation and background Performability overview Project summary 2.Test setup PRESS web server Mendosus fault injection system 3.Experimental results & analysis How to represent latency Questions for future research
Goal of ROC project: develop metrics to evaluate new recovery techniques Performability – class of metrics to describe how a system performs in the presence of faults –First used in fault-tolerant computing field 1 –Now being applied to online services Performability overview 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
Example: microbenchmark RAID disk failure
Rutgers study: performability analysis of a web server, using throughput Other studies (esp. from HP Labs Storage group) also use response time as a metric Assertion: latency and data quality are better than throughput for describing user experience How best to represent latency in performability reports? Project motivation
Goals: 1.Replicate PRESS/Mendosus study with response time measurements 2.Discuss how to incorporate latency into performability statistics Contributions: 1.Provide a latency-based analysis of a web server’s performability (currently rare) 2.Further the development of more comprehensive dependability benchmarks Project overview
Experiment components The Mendosus fault injection system –From Rutgers (Rich Martin) –Goal: low-overhead emulation of a cluster of workstations, injection of likely faults The PRESS web server –Cluster-based, uses cooperative caching. Designed by Carreira et al. (Rutgers) –Perf-PRESS: basic version –HA-PRESS: incorporates hearbeats, master node for automated cluster management Client simulators –Submit set # of requests/sec, based on real traces
User-level daemon (Java) Modified NIC driver SCSI module proc module apps Global Controller (Java) Fault config file Workstations (real or VMs) LAN emu config file Apps config file Emulated LAN Mendosus design
Experimental setup
Fault types CategoryFaultPossible Root Cause NodeNode crashOperator error, OS bug, hardware component failure, power outage Node freezeOS or kernel module bug ApplicationApp crashApplication bug or resource unavailability App hangApplication bug or resource contention with other processes NetworkLink down or flakyBroken, damaged or misattached cable Switch down or flakyDamaged or misconfigured switch, power outage
Test case timeline - Warm-up time: seconds - Time to repair: up to 90 seconds
Simplifying assumptions Operator repairs any non-transient failure after 90 seconds Web page size is constant Faults are independent Each client request is independent of all others (no sessions!) –Request arrival times are determined by a Poisson process (not self-similar) Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secs
Sample result: app crash Perf-PRESSHA-PRESS Throughput Latency
Sample result: node hang Perf-PRESSHA-PRESS Throughput Latency
Total seconds of wait time –Not good for comparing cases with different workloads Average (mean) wait time per request –OK, but requires that expected (normal) response time be given separately Variance of wait time –Not very intuitive to describe. Also, read- only workload means that all variance is toward longer wait times anyway Representing latency
Consider “goodput”-based availability: total responses served total requests Idea: Latency-based “punctuality”: ideal total latency actual total latency Like goodput, maximum value is 1 “Ideal” total latency: average latency for non-fault cases x total #requests (shouldn’t be 0) Representing latency (2)
Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience) –Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec) Representing latency (3)
Availability and punctuality
Data quality, latency and throughput are interrelated –Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”? To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote) 1 –These can be very arbitrary, so it’s important that the demerit formula be straightforward and publicly available Other metrics 1 Zona Research and Keynote Systems, The Need for Speed II, 2001
Rules: –Each aborted (2s) conn: 2 demerits –Each conn error: 1 demerit –Each user timeout (8s): 8 demerits –Each sec of total latency above ideal level: (1 demerit/total #requests) x scaling factor Sample demerit system App hang App crash Node crash Node freeze Link down
Expensive, fast & flaky Expensive & robust, but slow Expensive, robust and fast Online service optimization Performance metrics: throughput, latency & data quality Environment: workload & faults Cost of operations & components Cheap, robust & fast (optimal) Cheap, fast & flaky Cheap & robust, but slow
Conclusions Latency-based punctuality and throughput-based availability give similar results for a read-only web workload Applied workload is very important –Reliability metrics do not (and should not) reflect maximum performance/workload! Latency did not degrade gracefully in proportion to workload –At high loads, PRESS “oscillates” between full service, 100% load shedding
Further Work Combine test results & predicted component failure rates to get long- term performability estimates (are these useful?) Further study will benefit from more sophisticated client & workload simulators Services that generate dynamic content should lead to more interesting data (ex: RUBiS)
Latency as a Performability Metric: Experimental Results Pete Broadwell
Example: long-term model Discrete-time Markov chain (DTMC) model of a RAID-5 disk array 1 p i (t) = probability that system is in state i at time t w i (t) = reward (disk I/O operations/sec) = failure rate of a single disk drive = disk repair rate D = number of data disks 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997