Managing the Performance Impact of Administrative Utilities Paper by S. Parekh,K. Rose, J.Hellerstein, S. Lightstone, M.Huras, and V. Chang Presentation.

Slides:

Advertisements

Similar presentations

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

Advertisements

Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms Chenyang Lu, John A. Stankovic, Gang Tao, Sang H. Son Presented by Josh Carl.

IBM T.J. Watson Research Center Sigmetrics 2008 Tutorial: Introduction to Control Theory and Its Application to Computing Systems Self-Tuning Memory Management.

Managing Web server performance with AutoTune agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigu Jangwon Han Seongwon Park

Capacity Planning in a Virtual Environment

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Hadi Goudarzi and Massoud Pedram

Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical.

DEXA 2005 Control-based Quality Adaptation in Data Stream Management Systems (DSMS) Yicheng Tu†, Mohamed Hefeeda‡, Yuni Xia†, Sunil Prabhakar†, and Song.

CPU Scheduling Questions answered in this lecture: What is scheduling vs. allocation? What is preemptive vs. non-preemptive scheduling? What are FCFS,

CprE 458/558: Real-Time Systems (G. Manimaran)1 CprE 458/558: Real-Time Systems (m, k)-firm tasks and QoS enhancement.

Active Queue Management: Theory, Experiment and Implementation Vishal Misra Dept. of Computer Science Columbia University in the City of New York.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

AQM for Congestion Control1 A Study of Active Queue Management for Congestion Control Victor Firoiu Marty Borden.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Adaptive Content Delivery for Scalable Web Servers Authors: Rahul Pradhan and Mark Claypool Presented by: David Finkel Computer Science Department Worcester.

LDU Parametrized Discrete-Time Multivariable MRAC and Application to A Web Cache System Ying Lu, Gang Tao and Tarek Abdelzaher University of Virginia.

Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Automated Workload Management in.

23 September 2004 Evaluating Adaptive Middleware Load Balancing Strategies for Middleware Systems Department of Electrical Engineering & Computer Science.

Report ： Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.

Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.

Adaptive Control of Virtualized Resources in Utility Computing Environments HP Labs: Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang, Sharad Singhal University.

Bargaining Towards Maximized Resource Utilization in Video Streaming Datacenters Yuan Feng 1, Baochun Li 1, and Bo Li 2 1 Department of Electrical and.

© 2006 IBM Corporation Adaptive Self-Tuning Memory in DB2 Adam Storm, Christian Garcia-Arellano, Sam Lightstone – IBM Toronto Lab Yixin Diao, M. Surendra.

How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer

1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.

Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-3 CPU Scheduling Department of Computer Science and Software Engineering.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.

1 A Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers Presentation by Amitayu Das.

1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.

20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

Selling the Storage Edition for Oracle November 2000.

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.

The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.

Automated Control in Cloud Computing: Challenges and Opportunities Harold C. Lim, Shivnath Babu, Jeffrey S. Chase, and Sujay S. Parekh ACM’s First Workshop.

Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Managing Web Server Performance with AutoTune Agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigus Presented by Changha Lee.

Lecture 12 Page 1 CS 111 Online Using Devices and Their Drivers Practical use issues Achieving good performance in driver use.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Capsule Placement in the Service Platform Bhuvan Urgaonkar Timothy Roscoe Systems Group, Sprint ATL.

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.

18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.

OPERATING SYSTEMS CS 3502 Fall 2017

Abhinav Kamra, Vishal Misra CS Department Columbia University

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Introduction to Load Balancing:

Green cloud computing 2 Cs 595 Lecture 15.

Applying Control Theory to Stream Processing Systems

Self-Tuning Memory Management of A Database System

Overview Introduction VPS Understanding VPS Architecture

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

Autonomic Workload Execution Control Using Throttling

Control Theory in Log Processing Systems

Virtual Memory: Working Sets

Presentation transcript:

Managing the Performance Impact of Administrative Utilities Paper by S. Parekh,K. Rose, J.Hellerstein, S. Lightstone, M.Huras, and V. Chang Presentation and Discussion Led by N. Tchervenski CS 848, University of Waterloo November 1, 2006

Outline Introduction – performance impact of administrative utilities Proposed solution Architecture and Control Theory Tests performed Conclusion Discussion

Performance Impact of Administrative Utilities Administrative utilities  Essential to the system  Have performance impact With 24/7 operation, it is never a good time to suffer performance degradation Solution: find a way to slow down

Example of DB Running a Backup * Throughput and response time averaged over 60s intervals

How to Slowdown a Utility Performance impact is dynamic – both for utilities and regular workloads (WLs) Low level approach  per-resource quotas / priorities  difficult to manage Admin Utility Performance Policy - at most x% degradation of production work  How to throttle utilities  SIS – self-imposed sleep  How to translate policy requirement vs. throttling units?

SIS – Self-imposed Sleep

Action Interval and Sleep Fraction Action interval = workTime + sleepTime With action interval being constant, we need just sleep fraction:  Sleep fraction = sleepTime / action interval  Sleep fraction = 0  unthrottled, 1  stopped Suggested value for action interval is at least a few iterations of the “main-loop” of the utility

Throttle Manager Architecture X% sleepTime Action interval = const Linear model based on PI controller

Degradation Estimator Baseline estimator – system performance w/o utilities  Degradation = 1 – performance / baseline How to determine baseline?  Stop all utilities  WL surges, short-term performance, underutilize resources  Linear fitting of Performance = f(sleepTime) = Q1*sleepTime+Q0 Recursive least squares and exponential forgetting

Linear Fit Example of Sleep/Throughput Steady workload, backup throttling kept constant for 20 minute intervals Estimated baseline Actual baseline

Controller Goal: current degradation = degradation limit  Error = degradation limit – current degradation PI controller used  Throttling(k+1) = Kp * error(k) + Ki * Sum(error(i), i=0..k)  Kp – proportional gain – used to increase speed of response  Ki – integral gain – eliminate steady state error  Kp, Ki and control interval can be hard-coded or determined at runtime  Kp and Ki can be estimated by utilizing pole placement from control theory, but experimental results are necessary to confirm results [2] Experiments in this paper:  control interval = 20 seconds  Kp and Ki same across all experiments

Tests Performed Testbed description  DB2 v8.1, 4-CPU RS/6000, 2GB ram, AIX 4.3, 8 physical disks  Workload similar to TPC-C  Initial “warm-up” period of 10 minutes, to stabilize system / bufferpools /etc.  Utility used – parallelized BACKUP – multiple processes reading from multiple tablespaces, and multiple other processes writing to separate disks

OS Priorities vs SIS (Sleep fraction) No performance gain by changing OS priority of backup process OS priority works for CPU intensive WLs, here we have I/O intensive WL. CPU is idle 80% of the time. Linear effect when throttling using sleep WL alone 100% throttling

Dynamic Effect of SIS. Does “Turning the Knob” Actually Do Something? As in previous slide, we don’t get back to 100% throughput when fully throttled, but we’re close. Backup started 15tps avg

Feedback Control X=30% degradation policy

Feedback Control Effectiveness Without BACKUP – 15tps With x=30%, steady workload – 25 users   9.4tps  38% degradation  Why the throttling slump? Throttling system compensates for decreasing resource demands of the backup? With x=30%, Workload surge at 1500s – from 10 users to 25 users.  Pre-surge degradation of 36%  Post-surge degradation of 19%  Still good results, close to the 30% policy

Causes for Deviation Baseline estimator – actual throughput is 15.1 tps vs projected value of 13.2tps.. System stochastics not always estimate degradation correctly  For example, the drop of throttle at t=1800s  Quick to self-correct  correct results in the long term  Short-term violations could be avoided by trading adaptation speed by adjusting the forgetting factor in online estimator.

Conclusion Administrative utilities must be run, but there is no timeslot for them Proposed an application-based throttling mechanism – need to change applications code only, but OS/system independent Easy for administrators to just specify degradation policy Applicable to various systems Main requirements  Utility work be identifiable – put sleep there  Performance can be measured and w/o much overhead

Limitations and Future Work Test on multiple utilities  Throttle each utility separately? Propose and analyze different approaches for the controller  PI algorithm, recursive least squares estimator, etc. How to specify parameters for them?  Automate determination of controller parameters as they are system dependent.

Discussion Why the throttling slump on the feedback control? Even when backup is fully throttled, system may not reach peak performance as before, since it needs more time to stabilize (i.e. bufferpools again). This may be a better explanation for the difference between projected baseline and the actual baseline. Even if tasks were CPU intensive, assigning them priority by OS is not guaranteed to work, since they may interact with other parts of the engine – issue queries, etc.. Can’t slow the engine for that. Obviously this works since it’s been implemented in DB2 v8 and v9 – backup / rebalance / auto-runstats – all I/O intensive tasks. Other ways to limit/control the impact of backup to DB system. Controlling bufferpools / memory. Automatic tuning of memory is introduced in DB2 v9. How to handle peak loads? How do we guarantee QoS?  Can we monitor not only TPS output, but try to “expect” what the WL performance would be, based on # of clients, # of queries compiled/executed, bufferpool activity/misses?

References [1] Sujay Parekh, Kevin Rose, Joseph L. Hellerstein, Sam Lightstone, Matthew Huras, and Victor Chang. Managing the performance impact of administrative utilities. In Self-Managing Distributed Systems - 14th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2003), number 2867 in Lecture Notes in Computer Science. Springer-Verlag, [2] Diao,Y,Gandhi,N.,Hellerstein,J.L.,Parekh,S.,Tilbury,D.M.:Using MIMO feedback control to enforce policies for interrelated metrics with application to the Apache web server. In: Proceedings of Network Operations and Management. (2002)