Latency as a Performability Metric for Internet Services Pete Broadwell

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

RAID Redundant Arrays of Independent Disks Courtesy of Satya, Fall 99.
Conserving Disk Energy in Network Servers ACM 17th annual international conference on Supercomputing Presented by Hsu Hao Chen.
Key Metrics for Effective Storage Performance and Capacity Reporting.
Scheduling in Web Server Clusters CS 260 LECTURE 3 From: IBM Technical Report.
High throughput chain replication for read-mostly workloads
Fuzzy Logic and its Application to Web Caching
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
NETE4631:Capacity Planning (3)- Private Cloud Lecture 11 Suronapee Phoomvuthisarn, Ph.D. /
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.
Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.
Reliability and Dependability in Computer Networks CS 552 Computer Networks Side Credits: A. Tjang, W. Sanders.
Differentiated Multimedia Web Services Using Quality Aware Transcoding S. Chandra, C.Schlatter Ellis and A.Vahdat InfoCom 2000, IEEE Journal on Selected.
1 Lessons from Giant-Scale Services IEEE Internet Computing, Vol. 5, No. 4., July/August 2001 Eric A. Brewer University of California, Berkeley, and Iktomi.
On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.
Lecture 8 Epidemic communication, Server implementation.
Using Fault Model Enforcement (FME) to Improve Availability EASY ’02 Workshop Kiran Nagaraja, Ricardo Bianchini, Richard Martin, Thu Nguyen Department.
National Manager Database Services
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Towards Autonomic Hosting of Multi-tier Internet Services Swaminathan Sivasubramanian, Guillaume Pierre and Maarten van Steen Vrije Universiteit, Amsterdam,
Computer Science Cataclysm: Policing Extreme Overloads in Internet Applications Bhuvan Urgaonkar and Prashant Shenoy University of Massachusetts.
Design of Cooperative Vehicle Safety Systems Based on Tight Coupling of Communication, Computing and Physical Vehicle Dynamics Yaser P. Fallah, ChingLing.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Exploiting Application Semantics: Harvest, Yield CS 444A Fall 99 Software for Critical Systems Armando Fox & David Dill © 1999 Armando Fox.
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
Software Reliability Growth. Three Questions Frequently Asked Just Prior to Release 1.Is this version of software ready for release (however “ready” is.
PMIT-6102 Advanced Database Systems
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Data Management for Decision Support Session-5 Prof. Bharat Bhasker.
Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.
Advanced Topics INE2720 Web Application Software Development Essential Materials.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
CWIC Developers Meeting January 29 th 2014 Calin Duma Service Level Agreements High-Availability, Reliability and Performance.
Disk Structure Disk drives are addressed as large one- dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
2015 CWIC Developers Meeting February 19 th 2015 Calin Duma Doug Newman Service Level Agreements High-Availability,
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Price Performance Metrics CS3353. CPU Price Performance Ratio Given – Average of 6 clock cycles per instruction – Clock rating for the cpu – Number of.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Performance Evaluation of Mobile Hotspots in Densely Deployed WLAN Environments Presented by Li Wen Fang Personal Indoor and Mobile Radio Communications.
Latency as a Performability Metric: Experimental Results Pete Broadwell
Maximizing Performance – Why is the disk subsystem crucial to console performance and what’s the best disk configuration. Extending Performance – How.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
1 CS6320 – Web Services and Performance L. Grewe.
Measurement-based Design
RAID Redundant Arrays of Independent Disks
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
DISK SCHEDULING FCFS SSTF SCAN/ELEVATOR C-SCAN C-LOOK.
Latency as a Performability Metric: Experimental Results
Outline Announcements Fault Tolerance.
SpiraTest/Plan/Team Deployment Considerations
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Transaction Properties: ACID vs. BASE
Disk Scheduling The operating system is responsible for using hardware efficiently — for the disk drives, this means having a fast access time and disk.
Summer 2002 at SLAC Ajay Tirumala.
Presentation transcript:

Latency as a Performability Metric for Internet Services Pete Broadwell

Outline 1.Performability background/review 2.Latency-related concepts 3.Project status Initial test results Current issues

A goal of ROC project: develop metrics to evaluate new recovery techniques Problem: basic concept of availability assumes system is either “up” or “down” at a given time “Nines” only describe fraction of uptime over a certain interval Motivation 99999

Availability doesn’t describe durations or frequencies of individual outages –Both can strongly influence user perception of service, as well as revenue Availability doesn’t capture system’s capacity to support degraded service –degraded performance during failures –reduced data quality during high load (Web) Why Is Availability Insufficient?

What is “performability”? Combination of performance and dependability measures Classical defn: probabilistic (model- based) measure of a system’s “ability to perform” in the presence of faults 1 –Concept from traditional fault-tolerant systems community, ca –Has since been applied to other areas, but still not in widespread use 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

Performability Example Discrete-time Markov chain (DTMC) model of a RAID-5 disk array 1 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997 p i (t) = probability that system is in state i at time t = failure rate of a single disk drive D = number of data disks  = disk repair rate w i (t) = reward (disk I/O operations/sec)

Degraded throughput Average throughput Visualizing Performability Throughput Time I/O operations/sec DETECT Normal throughput FAILURERECOVER REPAIR

Metrics for Web Services Throughput - requests/sec Latency – render time, time to first byte Data quality –harvest (response completeness) –yield (% queries answered) 1 1 E. Brewer, Lessons from Giant-Scale Internet Services, 2001 Time Perf

Applications of Metrics Modeling the expected failure-related performance of a system, prior to deployment Benchmarking the performance of an existing system during various recovery phases Comparing the reliability gains offered by different recovery strategies

Related Projects HP: Automating Data Dependability –uses “time to data access” as one objective for storage systems Rutgers: PRESS/Mendosus –evaluated throughput of PRESS server during injected failures IBM: Autonomic Storage Numerous ROC projects

Arguments for Using Latency as a Metric Originally, performability metrics were meant to capture end-user experience 1 Latency better describes the experience of an end user of a web site –response time >8 sec = site abandonment = lost income $$ 2 Throughput describes the raw processing ability of a service –best used to quantify expenses 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, Zona Research and Keynote Systems, The Need for Speed II, 2001

Current Progress Using Mendosus fault injection system on a 4-node PRESS web server (both from Rutgers) Running latency-based performability tests on the cluster –Inject faults during load test –Record page-load times before, during and after faults

Test Setup Normal version: cooperative caching HA version: cooperative caching + heartbeat monitoring PRESS web server + Mendosus Test clients Emulated switch Request Caching info Page Response

Effect of Component Failure on Performability Metrics Time Perform- ability metric REPAIRFAILURE Throughput Latency

Observations Below saturation, throughput is more dependent on load than latency Above saturation, latency is more dependent on load Time Thru = 6/s Lat =.14s Thru = 3/s Lat =.14s Thru = 7/s Lat =.4s

How to Represent Latency? Average response time over a given time period –Make a distinction between “render time” & “time to first byte”? Deviation from baseline latency –Impose a greater penalty for deviations toward longer wait times?

Response Time with Load Shedding Policy Time Response time (sec) REPAIR 8s Abandonment threshold FAILURE Load-shedding threshold X users get “server too busy” msg

Load Shedding Issues Load shedding means returning 0% data quality – a different kind of performability metric To combine load shedding and latency, define a “demerit” system: Such systems quickly lose generality, however - “Server too busy” msg – 3 demerits - 8 sec response time – 1 demerit/sec

Further Work Collect more experimental results! Compare throughput and latency- based results of normal and high- availability versions of PRESS Evaluate usefulness of “demerit” systems to describe the user experience (latency and data quality)

Latency as a Performability Metric for Internet Services Pete Broadwell