Reference-Driven Performance Anomaly Identification

Reference-Driven Performance Anomaly Identification
Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li University of Rochester SIGMETRICS 2009 11/21/2018

Performance Anomalies
Complex software systems (like operating systems and distributed systems): Many system features and configuration settings Wide-ranging workload behaviors and concurrency Their interactions Performance anomalies: Low performance against expectation Due to implementation errors, mis-configurations, or mis-managed interactions, … Anomalies degrade the system performance; make system behaviors undependable SIGMETRICS 2009 11/21/2018

An Example Identified by Our Research
Linux anticipatory I/O scheduler HZ is number of timer ticks per second, so (HZ/150) ticks is around 6.7ms. However, inaccurate integer divisions: HZ defaults to 1000 at earlier Linux versions, so anticipation timeout is 6 ticks. It defaults to 250 at Linux , so timeout becomes one tick. Premature timeouts lead to additional disk seeks. /* max time we may wait to anticipate a read (default around 6ms) */ #define default_antic_expire ((HZ / 150) ? HZ / 150 : 1) SIGMETRICS 2009 11/21/2018

Challenges and Goals Challenges:
Often involving semantics of multiple system components No obvious failure symptoms; normal performance isn’t always known or even clearly defined Performance anomaly identifications relatively rare: 4% of resolved Linux 2.4/2.6 I/O bugs are performance-oriented Goals: Systematic techniques to identify performance anomalies; improve performance dependability Consider wide-ranging configurations and workload conditions SIGMETRICS 2009 11/21/2018

Reference-driven Anomaly Identification
Given two executions T (target) and R (reference): If T performs much worse than R against expectation, we identify T as anomalous to R. Examples: How to systematically derive the expectations? SIGMETRICS 2009 11/21/2018

Kai Shen 11/21/2018 Change Profiles Goal – derive expected performance deviations between reference and target (or with a change of system parameters) Approach – inference from real system measurements Change profile – probabilistic distribution of performance deviations p-value(–0.5) = 0.039 SIGMETRICS 2009 11/21/2018

Scalable Anomaly Quantification
Kai Shen 11/21/2018 Scalable Anomaly Quantification Approach: Construct single-para. profiles through real system measurements Analytically synthesize multiple single-para. profiles for scalability Convolution-like synthesis Assuming independent performance effects of different parameters Assemble multi-para. performance deviation distribution using convolutions of single-para. change profiles Generally applicable bounding analysis Bound multi-para. p-value anomaly from single-para. p-values (no need for parameter independence) Find a tight bound (small p-value) through Monte Carlo method SIGMETRICS 2009 11/21/2018

Evaluation Linux I/O case study: Results
Kai Shen 11/21/2018 Evaluation Linux I/O case study: Five workload parameters and three system conf. parameters Performance measurements at 300 sampled executions; use each other as references to identify anomalies Anomalies are target executions with p-values 0.05 or less Validate through cause analysis; probable false positive without validated cause Results Linux – 35 identified; 34 validated; 1 probable false positive Linux – 12 identified; 9 validated; 3 probable false positives Linux (target) vs (reference) – 15 identified; all validated SIGMETRICS 2009 11/21/2018

Kai Shen 11/21/2018 Comparison Bounding analysis for multi-parameter anomaly quantification Convolution synthesis assuming parameter independence Rank target-reference anomaly using raw perf. difference Convolution identifies more anomalies, but higher false positives SIGMETRICS 2009 11/21/2018

Anomaly Cause Analysis
Given symptom (anomalous perf. degradation from reference to target), root cause analysis is still challenging Root cause sometimes lies in complex component interactions Most useful hints often relate to low-level system activities Efficient mechanisms available to acquire large amount of system metrics (some anomaly-related); but difficult to sift through Approach: reference-driven filtering of anomaly-related metrics Compare metric manifestations of an anomalous target and its normal reference Those that differ significantly may be anomaly-related SIGMETRICS 2009 11/21/2018

System Events and Metrics
Kai Shen 11/21/2018 System Events and Metrics Traced Events: Process management creation of a kernel thread; process fork or clone; process exit; process wait; process signal; wake up a process; CPU context switch System call enter a system call; exit a system call Memory system allocating pages; freeing pages; swapping pages in; swapping pages out File system file exec; file open; file close; file read; file write; file seek; file ioctl; file prefetch operation; starting to wait for a data buffer; end to wait for a data buffer IO scheduling I/O request arrival at the block level; re-queue an I/O request; dispatch an I/O request; remove an I/O request; I/O request completion SCSI device SCSI read request; SCSI write request Interrupt enter an interrupt handler; Exit an interrupt handler Network socket socket call; socket send; socket receive; socket creation Derived System Metrics: Inter-arrival time of each type of events Delays between causal events delay between a system call enter and exit delay between file system buffer wait start and end delay between a block-level I/O request arrival and is dispatch delay between a block-level I/O request dispatch and its completion Parameter of events file prefetch size SCSI I/O request size file offset of each I/O operation to block device I/O concurrency system call level block level SCSI device level up to 1361 metrics in Linux SIGMETRICS 2009 11/21/2018

Kai Shen 11/21/2018 A Case Result Top ranked metrics – anticipatory I/O timeouts and anticipation breaks Anomaly cause: incorrect timeout setting when timer ticks per second (HZ) changes from 1000 to 250 in Linux #define default_antic_expire ((HZ / 150) ? HZ / 150 : 1) SIGMETRICS 2009 11/21/2018

Effects of Anomaly Corrections
Kai Shen 11/21/2018 Effects of Anomaly Corrections Anomaly corrections lead to predictable performance behavior patterns SIGMETRICS 2009 11/21/2018

Related Work Peer differencing for debugging Delta debugging [Zeller’02]: differencing program runs of various inputs PeerPressure [Wang et al.’04]: differencing Windows registry settings Triage [Tucek et al.’07]: differencing basic block execution frequency → Target program/system failures; failure symptoms easily identifiable; correct peers presumably known Our performance anomaly identification Challenge: both anomalous and normal performance behaviors are hard to identify in complex systems Key contribution: scalable construction of performance deviation profiles SIGMETRICS 2009 11/21/2018

Summary Principled use of references in performance anomaly identification Scalable construction of performance deviation profiles to identify anomaly symptoms Target-reference differencing of system metric manifestations to help identify anomaly causes Identified real performance problems in Linux and J2EE-based distributed system SIGMETRICS 2009 11/21/2018

Reference-Driven Performance Anomaly Identification

Similar presentations

Presentation on theme: "Reference-Driven Performance Anomaly Identification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reference-Driven Performance Anomaly Identification

Similar presentations

Presentation on theme: "Reference-Driven Performance Anomaly Identification"— Presentation transcript:

Similar presentations

About project

Feedback