Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

RAID Redundant Arrays of Independent Disks Courtesy of Satya, Fall 99.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Large-Scale Distributed Systems Andrew Whitaker CSE451.
CSCE430/830 Computer Architecture
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Multiaccess Problem How to let distributed users (efficiently) share a single broadcast channel? ⇒ How to form a queue for distributed users? The protocols.
Lecture 12: Storage Systems Performance Kai Bu
Latency-sensitive hashing for collaborative Web caching Presented by: Xin Qi Yong Yang 09/04/2002.
NETE4631:Capacity Planning (3)- Private Cloud Lecture 11 Suronapee Phoomvuthisarn, Ph.D. /
Analyzing Multi-channel MAC Protocols for Underwater Sensor Networks Presenter: Zhong Zhou.
System Performance & Scalability i206 Fall 2010 John Chuang.
Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.
A. BobbioBertinoro, March 10-14, Dependability Theory and Methods 5. Markov Models Andrea Bobbio Dipartimento di Informatica Università del Piemonte.
Cooperative Caching Middleware for Cluster-Based Servers Francisco Matias Cuenca-Acuna Thu D. Nguyen Panic Lab Department of Computer Science Rutgers University.
Lecture 3: A Case for RAID (Part 1) Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Single queue modeling. Basic definitions for performance predictions The performance of a system that gives services could be seen from two different.
1 Action Breakout Session Anil, AP, Nina Bhatti, Charles Berdnall, Joe Hellerstein, Wei Hu, Anthony Joseph, Randy Katz, Li, Machi Mukund Kimmo Raatikanen,
Reliability and Dependability in Computer Networks CS 552 Computer Networks Side Credits: A. Tjang, W. Sanders.
Differentiated Multimedia Web Services Using Quality Aware Transcoding S. Chandra, C.Schlatter Ellis and A.Vahdat InfoCom 2000, IEEE Journal on Selected.
1 Part VI System-level Performance Models for the Web © 1998 Menascé & Almeida. All Rights Reserved.
Using Fault Model Enforcement (FME) to Improve Availability EASY ’02 Workshop Kiran Nagaraja, Ricardo Bianchini, Richard Martin, Thu Nguyen Department.
Latency as a Performability Metric for Internet Services Pete Broadwell
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
WIR FORSCHEN FÜR SIE The Palladio Component Model (PCM) for Performance and Reliability Prediction of Component-based Software Architectures Franz Brosch.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
Queueing Theory [Bose] “The basic phenomenon of queueing arises whenever a shared facility needs to be accessed for service by a large number of jobs or.
Modeling and Performance Evaluation of Network and Computer Systems Introduction (Chapters 1 and 2) 10/4/2015H.Malekinezhad1.
Improving Disk Latency and Throughput with VMware Presented by Raxco Software, Inc. March 11, 2011.
Performance Evaluation of Computer Systems Introduction
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
NETE4631:Capacity Planning (2)- Lecture 10 Suronapee Phoomvuthisarn, Ph.D. /
27th, Nov 2001 GLOBECOM /16 Analysis of Dynamic Behaviors of Many TCP Connections Sharing Tail-Drop / RED Routers Go Hasegawa Osaka University, Japan.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
1 Detecting and Reducing Partition Nodes in Limited-routing-hop Overlay Networks Zhenhua Li and Guihai Chen State Key Laboratory for Novel Software Technology.
Lecture 4: State-Based Methods CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at UIUC.
Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.
EEC 688/788 Secure and Dependable Computing Lecture 8 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
NETE4631: Network Information System Capacity Planning (2) Suronapee Phoomvuthisarn, Ph.D. /
Latency as a Performability Metric: Experimental Results Pete Broadwell
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Internet Applications: Performance Metrics and performance-related concepts E0397 – Lecture 2 10/8/2010.
Network Computing Laboratory Load Balancing and Stability Issues in Algorithms for Service Composition Bhaskaran Raman & Randy H.Katz U.C Berkeley INFOCOM.
Destage Algorithms for Disk Arrays with Non-Volatile Caches Anujan Varma Quinn Jacobson.
Queuing Theory Simulation & Modeling.
CS203 – Advanced Computer Architecture Dependability & Reliability.
A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988
Lecture 2: Performance Evaluation
Embracing Failure: A Case for Recovery-Oriented Computing
Action Breakout Session
RAID Redundant Arrays of Independent Disks
Large Distributed Systems
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Software Reliability: 2 Alternate Definitions
Latency as a Performability Metric: Experimental Results
Performance Evaluation of Computer Networks
Storage Systems Performance
Lecture 12: Storage Systems Performance
Performance Evaluation of Computer Networks
Chapter-5 Traffic Engineering.
Presentation transcript:

Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell

Outline 1.Introduction to performability 2.Performability metrics for Internet services Throughput-based metrics (Rutgers) Latency-based metrics (ROC) 3.Analysis and future directions

Goal of ROC project: develop metrics to evaluate new recovery techniques Problem: concept of availability assumes system is either “up” or “down” at a given time Availability doesn’t capture system’s capacity to support degraded service –degraded performance during failures –reduced data quality during high load Motivation

What is “performability”? Combination of performance and dependability measures Classical defn: probabilistic (model- based) measure of a system’s “ability to perform” in the presence of faults 1 –Concept from traditional fault-tolerant systems community, ca –Has since been applied to other areas, but still not in widespread use 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

Performability Example Discrete-time Markov chain (DTMC) model of a RAID-5 disk array 1 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997 p i (t) = probability that system is in state i at time t = failure rate of a single disk drive D = number of data disks  = disk repair rate w i (t) = reward (disk I/O operations/sec)

Performability for Online Services: Rutgers Study Rich Martin (UCB alum) et al. wanted to quantify tradeoffs between web server designs, using a single metric for both performance and availability Approach: –Performed fault injection on PRESS, a locality-aware, cluster-based web server –Measured throughput of cluster during simulated faults and normal operation

REPAIR (human operator) DETECT Degraded Service During a PRESS Component Fault FAILURE STABILIZE RECOVER RESET (optional) Time Throughput Requests/sec

Calculation of Average Throughput, Given Faults Throughput Time Degraded throughput Requests/sec Average throughput Normal throughput

Behavior of a Performability Metric Effect of improving degraded performance PerformabilityPerformance during faults

Behavior of a Performability Metric Effect of improving component availability (shorter MTTR, longer MTTF) MTTRMTTF Performability Aavailability = MTTF MTTF + MTTR

Behavior of a Performability Metric Effect of improving overall performance PerformabilityOverall performance (includes normal operation) Most performability metrics scale linearly as component availability, degraded performance and overall performance increase

Results of Rutgers Study: Design Comparisons

An Alternative Metric: Response Latency Originally, performability metrics were meant to capture end-user experience 1 Latency better describes the experience of an end user of a web site –response time >8 sec = site abandonment = lost income $$ 2 Throughput describes the raw processing ability of a service –best used to quantify expenses 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, Zona Research and Keynote Systems, The Need for Speed II, 2001

Effect of Component Failure on Response Latency Time Response latency (sec) REPAIR 8s Abandonment region FAILURE Annoyance region?

Issues With Latency As a Performability Metric Modeling concerns: –Human element: retries and abandonment –Queuing issues: buffering and timeouts –Unavailability of load balancer due to faults –Burstiness of workload Latency is more accurately modeled at service, rather than end-to-end 1 Alternate approach: evaluate an existing system 1 M. Merzbacher and D. Patterson, Measuring End-User Availability on the Web: Practical Experience, 2002

Analysis Queuing behavior may have a significant effect on latency-based performability evaluation –Long component MTTRs = longer waits, lower latency-based score –High performance in normal case = faster queue reduction after repair, higher latency-based score More study is needed!

Future Work Further collaboration with Rutgers on collecting new measurements for latency-based performability analysis Development of more realistic fault and workload models, other performability factors such as data quality Research into methods for conducting automated performability evaluations of web services

Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell

Back-of-the-Envelope Latency Calculations Attempted to infer average request latency for PRESS servers from Rutgers data set –Required many simplifying assumptions, relying upon knowledge of PRESS server design –Hoped to expose areas in which throughput- and latency-based performability evaluations differ Assumptions: –FIFO queuing w/no timeouts, overflows –Independent faults, constant workload (also the case for throughput-based model) Current models do not capture “completeness” of data returned to user

Comparison of Performability Metrics

Rutgers calculations for long-term performability Goal: metric that scales linearly with both - performance (throughput) and - availability [MTTF / (MTTF + MTTR)] T n = normal throughput for server A I = ideal availability (.99999) Average throughput (AT) = T n during normal operation + per- component throughput during failure Average availability (AA) = AT / T n Performability = T n x [log(A I ) / log(AA)]

Results of Rutgers study: performance comparison

Results of Rutgers study: availability comparison

Results of Rutgers study: performability comparison