Download presentation
Presentation is loading. Please wait.
1
Dept. of Computer Science, Univ. of Rochester
Kai Shen 1/3/2019 I/O System Performance Debugging Using Model-driven Anomaly Characterization Kai Shen Ming Zhong Chuanpeng Li Dept. of Computer Science, Univ. of Rochester URCS Systems Group Meeting
2
Motivation Implementations of complex systems (e.g., operating systems) contain performance “problems” over-simplification, mishandling of special cases, … … these problems degrade the system performance; make system behavior unpredictable Such problems are hard to identify and understand for complex systems many system features and configuration settings dynamic workload behaviors problems manifest under special conditions Goal comprehensively identify performance problems over wide ranges of system configurations and workload conditions 1/3/2019 FAST'05
3
Bird’s Eye View of Our Approach
Construct models to predict system performance “simple”: modeling system components following their high-level design algorithms “comprehensive”: considering wide ranges of system configuration and workload conditions Model-driven anomaly characterization Discover performance anomalies (discrepancies between model prediction and measured actual performance) Characterize them and attribute them to possible causes What can you do with the anomaly characterizations? making system perform better and more predictable through debugging identifying problematic settings for avoidance 1/3/2019 FAST'05
4
Operating System Support for Disk I/O-Bound Online Servers
Server processing access large disk-resident data Examples: Web servers serving large Web data index searching database-driven server systems Complex workload characteristics affecting performance Operating system support I/O prefetching Disk I/O scheduling (elevator, anticipatory, …) File system layout and meta-data management Memory caching 1/3/2019 FAST'05
5
A “Simple” Yet “Comprehensive” Throughput Model
Decompose a complex system into weakly coupled sub-components (layers) Each layer transforms the workload and alters the I/O throughput Consider wide ranges of workloads and server concurrency 1/3/2019 FAST'05
6
Model-Driven Anomaly Characterization
An OS implementation may deviate from model prediction over-simplification, mishandling of special cases, … … a “performance bug” may only manifest under specific system configurations or workload conditions 1/3/2019 FAST'05
7
Parameter Sampling We choose a set of system configurations and workload properties to check performance anomalies Sample parameters are chosen from a parameter space system configuration x system configuration y workload property z If we choose samples randomly and independently, the chance for missing a bug decreases exponentially as the sample number increases 1/3/2019 FAST'05
8
Sampling Parameter Space
Workload properties server concurrency I/O access pattern application inter-I/O think time OS configurations prefetching: enable (prefetching depth)/disable I/O scheduling: elevator or anticipatory memory caching: enable/disable 1/3/2019 FAST'05
9
Anomaly Clustering system configuration x system configuration y workload property z Anomalous settings may be due to multiple causes (bugs) hard to make observation out of all anomalous settings desirable to cluster anomalous settings into groups likely attributed to individual causes Existing clustering algorithms (EM, K-means) do not handle cross-intersected clusters We perform hyper-rectangle clustering 1/3/2019 FAST'05
10
Anomaly Characterization
hard to derive useful debugging information from a group of anomalous settings succinct characterizations are desirable Characterization is easy after hyper-rectangle clustering simply projecting the hyper-rectangle onto all dimensions 1/3/2019 FAST'05
11
anomaly clustering and characterization
Experimental Setup A micro-benchmark that can be configured to exhibit any desired workload patterns Linux kernel parameter sampling (400 samples) anomaly clustering and characterization for one possible bug human debugging (assisted by a kernel tracing tool) 1/3/2019 FAST'05
12
Result – Top 50 Model/Measurement Errors out of 400 Samples
Measured throughput Error defined as: – Model-predicted throughput 100% 80% Original Linux 60% #1 bug fix #1, #2 fixes Model/measurement error #1, #2, #3 fixes #1, #2, #3, #4 fixes 40% Performance error 20% 0% Sample parameter settings ranked on errors 1/3/2019 FAST'05
13
Result – Anomaly #1 Workload property System configuration The cause
concurrency: 128 and above Stream length: 256KB and above System configuration Prefetching: enabled The cause when the disk queue is “congested”, prefetching is cancelled however, prefetching sometimes include synchronously requested data, which is resubmitted as single-page “makeup” I/O Solutions do not cancel prefetching that includes synchronously requested data or block reads when the disk queue is “congested” 1/3/2019 FAST'05
14
Result – Anomaly #2, #3, #4 Anomaly #2 Anomaly #3 Anomaly #4
concerning the anticipatory I/O scheduler uses average seek distance of past requests to estimate seek time Anomaly #3 concerning the elevator I/O scheduler always search from block address 0 for next request after “reset” Anomaly #4 a large I/O operation is often split into small disk requests, anticipation timer is started after the first disk request returns 1/3/2019 FAST'05
15
Result – Overall Predictability
Model prediction Measured performance Original Linux After four bug fixes 35 35 30 30 25 25 20 20 I/O throughput (in Mbytes/sec) I/O throughput (in Mbytes/sec) 15 15 10 10 5 5 I/O throughput (in MB/sec) I/O throughput (in MBytes/sec) Ranked sample parameter settings Ranked sample parameter settings 1/3/2019 FAST'05
16
Support for Real Applications
Index searching from Ask Jeeves search engine Search workload following a 2002 Ask Jeeves trace Anticipatory I/O scheduler Apache Web server Media clips workload following IBM 1998 World Cup trace Elevator I/O scheduler 30 15 25 20 10 I/O throughput (in Mbytes/sec) 15 I/O throughput (in Mbytes/sec) 10 5 #1, #2, #4 bug fixes #1, #3 bug fixes #1, #2 bug fixes 5 #1 bug fix #1 bug fix I/O throughput (in MB/sec) Original Linux I/O throughput (in MB/sec) Original Linux 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Server concurrency Server concurrency 1/3/2019 FAST'05
17
Related Work I/O system performance modeling Performance debugging
Storage devices [Ruemmler & Wilkes 1994] [Kotz et al. 1994] [Worthington et al. 1994] [Shriver et al. 1998] [Uysal et al. 2001] OS I/O subsystem [Cao et al. 1995] [Shenoy & Vin 1998] [Shriver et al. 1999] Performance debugging Fine-grain system instrumentation & simulation [Goldberg & Hennessy 1993] [Rosenblum et al. 1997] Analyzing online traces [Chen et al. 2002] [Aguilera et al. 2003] Correctness (non-performance) debugging Code analysis [Engler et al. 2001] [Li et al. 2004] Configuration debugging [Nagaraja et al. 2004] [Wang et al. 2004] 1/3/2019 FAST'05
18
Summary Model-driven anomaly characterization
a systematic approach to assist performance debugging for complex systems over wide ranges of runtime conditions for disk I/O-bound online servers, we discovered several performance bugs of Linux kernel Linux kernel patch for bug fix #1 available 1/3/2019 FAST'05
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.