Reference-Driven Performance Anomaly Identification

Slides:



Advertisements
Similar presentations
Chapter 3 Process Description and Control
Advertisements

Process Description and Control
CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.
Page 1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Process Description and Control. Process concepts n Definitions – replaces task, job – program in execution – entity that can be assigned to and executed.
I/O Systems I/O Hardware Application I/O Interface
Reference: Ian Sommerville, Chap 15  Systems which monitor and control their environment.  Sometimes associated with hardware devices ◦ Sensors: Collect.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
Real-Time Operating Systems RTOS For Embedded systems.
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Chapter 13: I/O Systems.
Real-time Software Design
OPERATING SYSTEMS CS 3502 Fall 2017
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
OPERATING SYSTEMS CS 3502 Fall 2017
Chapter 4: Threads.
Chapter 4: Threads.
Processes and threads.
Process Management Process Concept Why only the global variables?
OPERATING SYSTEMS CS3502 Fall 2017
Chapter 4: Multithreaded Programming
Section 10: Last section! Final review.
Chapter 8 – Processor Scheduling
Semester Review Chris Gill CSE 422S - Operating Systems Organization
Real-time Software Design
Chapter 6: CPU Scheduling
Chapter 3: Windows7 Part 2.
CSCI 315 Operating Systems Design
Chapter 4: Threads.
I/O Systems I/O Hardware Application I/O Interface
Chapter 3: Windows7 Part 2.
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Mid Term review CSC345.
Process Description and Control
Process Description and Control
Dept. of Computer Science, Univ. of Rochester
Threads, Events, and Scheduling
Process Description and Control
Multithreaded Programming
Operating Systems Lecture 1.
Hardware Counter Driven On-the-Fly Request Signatures
Process Description and Control
Process Description and Control
Process Description and Control
Process Description and Control
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Why Threads Are A Bad Idea (for most purposes)
Request Behavior Variations
Uniprocessor scheduling
Chapter 2 Processes and Threads 2.1 Processes 2.2 Threads
CSE 153 Design of Operating Systems Winter 19
Process Description and Control
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Process State Model -Compiled by Sheetal for CSIT
Chapter 4: Threads.
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
Chapter 13: I/O Systems.
CSE 542: Operating Systems
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

Reference-Driven Performance Anomaly Identification Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li University of Rochester SIGMETRICS 2009 11/21/2018

Performance Anomalies Complex software systems (like operating systems and distributed systems): Many system features and configuration settings Wide-ranging workload behaviors and concurrency Their interactions Performance anomalies: Low performance against expectation Due to implementation errors, mis-configurations, or mis-managed interactions, … Anomalies degrade the system performance; make system behaviors undependable SIGMETRICS 2009 11/21/2018

An Example Identified by Our Research Linux anticipatory I/O scheduler HZ is number of timer ticks per second, so (HZ/150) ticks is around 6.7ms. However, inaccurate integer divisions: HZ defaults to 1000 at earlier Linux versions, so anticipation timeout is 6 ticks. It defaults to 250 at Linux 2.6.23, so timeout becomes one tick. Premature timeouts lead to additional disk seeks. /* max time we may wait to anticipate a read (default around 6ms) */ #define default_antic_expire ((HZ / 150) ? HZ / 150 : 1) SIGMETRICS 2009 11/21/2018

Challenges and Goals Challenges: Often involving semantics of multiple system components No obvious failure symptoms; normal performance isn’t always known or even clearly defined Performance anomaly identifications relatively rare: 4% of resolved Linux 2.4/2.6 I/O bugs are performance-oriented Goals: Systematic techniques to identify performance anomalies; improve performance dependability Consider wide-ranging configurations and workload conditions SIGMETRICS 2009 11/21/2018

Reference-driven Anomaly Identification Given two executions T (target) and R (reference): If T performs much worse than R against expectation, we identify T as anomalous to R. Examples: How to systematically derive the expectations? SIGMETRICS 2009 11/21/2018

Kai Shen 11/21/2018 Change Profiles Goal – derive expected performance deviations between reference and target (or with a change of system parameters) Approach – inference from real system measurements Change profile – probabilistic distribution of performance deviations p-value(–0.5) = 0.039 SIGMETRICS 2009 11/21/2018

Scalable Anomaly Quantification Kai Shen 11/21/2018 Scalable Anomaly Quantification Approach: Construct single-para. profiles through real system measurements Analytically synthesize multiple single-para. profiles for scalability Convolution-like synthesis Assuming independent performance effects of different parameters Assemble multi-para. performance deviation distribution using convolutions of single-para. change profiles Generally applicable bounding analysis Bound multi-para. p-value anomaly from single-para. p-values (no need for parameter independence) Find a tight bound (small p-value) through Monte Carlo method SIGMETRICS 2009 11/21/2018

Evaluation Linux I/O case study: Results Kai Shen 11/21/2018 Evaluation Linux I/O case study: Five workload parameters and three system conf. parameters Performance measurements at 300 sampled executions; use each other as references to identify anomalies Anomalies are target executions with p-values 0.05 or less Validate through cause analysis; probable false positive without validated cause Results Linux 2.6.10 – 35 identified; 34 validated; 1 probable false positive Linux 2.6.23 – 12 identified; 9 validated; 3 probable false positives Linux 2.6.23 (target) vs. 2.6.10 (reference) – 15 identified; all validated SIGMETRICS 2009 11/21/2018

Kai Shen 11/21/2018 Comparison Bounding analysis for multi-parameter anomaly quantification Convolution synthesis assuming parameter independence Rank target-reference anomaly using raw perf. difference Convolution identifies more anomalies, but higher false positives SIGMETRICS 2009 11/21/2018

Anomaly Cause Analysis Given symptom (anomalous perf. degradation from reference to target), root cause analysis is still challenging Root cause sometimes lies in complex component interactions Most useful hints often relate to low-level system activities Efficient mechanisms available to acquire large amount of system metrics (some anomaly-related); but difficult to sift through Approach: reference-driven filtering of anomaly-related metrics Compare metric manifestations of an anomalous target and its normal reference Those that differ significantly may be anomaly-related SIGMETRICS 2009 11/21/2018

System Events and Metrics Kai Shen 11/21/2018 System Events and Metrics Traced Events: Process management creation of a kernel thread; process fork or clone; process exit; process wait; process signal; wake up a process; CPU context switch System call enter a system call; exit a system call Memory system allocating pages; freeing pages; swapping pages in; swapping pages out File system file exec; file open; file close; file read; file write; file seek; file ioctl; file prefetch operation; starting to wait for a data buffer; end to wait for a data buffer IO scheduling I/O request arrival at the block level; re-queue an I/O request; dispatch an I/O request; remove an I/O request; I/O request completion SCSI device SCSI read request; SCSI write request Interrupt enter an interrupt handler; Exit an interrupt handler Network socket socket call; socket send; socket receive; socket creation Derived System Metrics: Inter-arrival time of each type of events Delays between causal events delay between a system call enter and exit delay between file system buffer wait start and end delay between a block-level I/O request arrival and is dispatch delay between a block-level I/O request dispatch and its completion Parameter of events file prefetch size SCSI I/O request size file offset of each I/O operation to block device I/O concurrency system call level block level SCSI device level up to 1361 metrics in Linux 2.6.23 SIGMETRICS 2009 11/21/2018

Kai Shen 11/21/2018 A Case Result Top ranked metrics – anticipatory I/O timeouts and anticipation breaks Anomaly cause: incorrect timeout setting when timer ticks per second (HZ) changes from 1000 to 250 in Linux 2.6.23 #define default_antic_expire ((HZ / 150) ? HZ / 150 : 1) SIGMETRICS 2009 11/21/2018

Effects of Anomaly Corrections Kai Shen 11/21/2018 Effects of Anomaly Corrections Anomaly corrections lead to predictable performance behavior patterns SIGMETRICS 2009 11/21/2018

Related Work Peer differencing for debugging Delta debugging [Zeller’02]: differencing program runs of various inputs PeerPressure [Wang et al.’04]: differencing Windows registry settings Triage [Tucek et al.’07]: differencing basic block execution frequency → Target program/system failures; failure symptoms easily identifiable; correct peers presumably known Our performance anomaly identification Challenge: both anomalous and normal performance behaviors are hard to identify in complex systems Key contribution: scalable construction of performance deviation profiles SIGMETRICS 2009 11/21/2018

Summary Principled use of references in performance anomaly identification Scalable construction of performance deviation profiles to identify anomaly symptoms Target-reference differencing of system metric manifestations to help identify anomaly causes Identified real performance problems in Linux and J2EE-based distributed system SIGMETRICS 2009 11/21/2018