Latency as a Performability Metric for Internet Services Pete Broadwell
Outline 1.Performability background/review 2.Latency-related concepts 3.Project status Initial test results Current issues
A goal of ROC project: develop metrics to evaluate new recovery techniques Problem: basic concept of availability assumes system is either “up” or “down” at a given time “Nines” only describe fraction of uptime over a certain interval Motivation 99999
Availability doesn’t describe durations or frequencies of individual outages –Both can strongly influence user perception of service, as well as revenue Availability doesn’t capture system’s capacity to support degraded service –degraded performance during failures –reduced data quality during high load (Web) Why Is Availability Insufficient?
What is “performability”? Combination of performance and dependability measures Classical defn: probabilistic (model- based) measure of a system’s “ability to perform” in the presence of faults 1 –Concept from traditional fault-tolerant systems community, ca –Has since been applied to other areas, but still not in widespread use 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
Performability Example Discrete-time Markov chain (DTMC) model of a RAID-5 disk array 1 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997 p i (t) = probability that system is in state i at time t = failure rate of a single disk drive D = number of data disks = disk repair rate w i (t) = reward (disk I/O operations/sec)
Degraded throughput Average throughput Visualizing Performability Throughput Time I/O operations/sec DETECT Normal throughput FAILURERECOVER REPAIR
Metrics for Web Services Throughput - requests/sec Latency – render time, time to first byte Data quality –harvest (response completeness) –yield (% queries answered) 1 1 E. Brewer, Lessons from Giant-Scale Internet Services, 2001 Time Perf
Applications of Metrics Modeling the expected failure-related performance of a system, prior to deployment Benchmarking the performance of an existing system during various recovery phases Comparing the reliability gains offered by different recovery strategies
Related Projects HP: Automating Data Dependability –uses “time to data access” as one objective for storage systems Rutgers: PRESS/Mendosus –evaluated throughput of PRESS server during injected failures IBM: Autonomic Storage Numerous ROC projects
Arguments for Using Latency as a Metric Originally, performability metrics were meant to capture end-user experience 1 Latency better describes the experience of an end user of a web site –response time >8 sec = site abandonment = lost income $$ 2 Throughput describes the raw processing ability of a service –best used to quantify expenses 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, Zona Research and Keynote Systems, The Need for Speed II, 2001
Current Progress Using Mendosus fault injection system on a 4-node PRESS web server (both from Rutgers) Running latency-based performability tests on the cluster –Inject faults during load test –Record page-load times before, during and after faults
Test Setup Normal version: cooperative caching HA version: cooperative caching + heartbeat monitoring PRESS web server + Mendosus Test clients Emulated switch Request Caching info Page Response
Effect of Component Failure on Performability Metrics Time Perform- ability metric REPAIRFAILURE Throughput Latency
Observations Below saturation, throughput is more dependent on load than latency Above saturation, latency is more dependent on load Time Thru = 6/s Lat =.14s Thru = 3/s Lat =.14s Thru = 7/s Lat =.4s
How to Represent Latency? Average response time over a given time period –Make a distinction between “render time” & “time to first byte”? Deviation from baseline latency –Impose a greater penalty for deviations toward longer wait times?
Response Time with Load Shedding Policy Time Response time (sec) REPAIR 8s Abandonment threshold FAILURE Load-shedding threshold X users get “server too busy” msg
Load Shedding Issues Load shedding means returning 0% data quality – a different kind of performability metric To combine load shedding and latency, define a “demerit” system: Such systems quickly lose generality, however - “Server too busy” msg – 3 demerits - 8 sec response time – 1 demerit/sec
Further Work Collect more experimental results! Compare throughput and latency- based results of normal and high- availability versions of PRESS Evaluate usefulness of “demerit” systems to describe the user experience (latency and data quality)
Latency as a Performability Metric for Internet Services Pete Broadwell