Download presentation
Presentation is loading. Please wait.
Published byAmelia Davis Modified over 9 years ago
1
Experience with Using a Performance Predictor During Development a Distributed Storage System Tale Lauro Beltrão Costa *, João Brunet +, Lile Hattori #, Matei Ripeanu * * NetSysLab/ECE, UBC (University of British Columbia) + DSC, UFCG (Federal University of Campina Grande) # Microsoft Corp.
2
How it is (typically) done? Profilers to monitor behaviour They pinpoint code regions that take too long will receive attention 2 Developers decide when they have reached “good-enough” efficiency High performance must be reached while keeping resource cost low
3
An Example More Storage Nodes More Application Nodes 3 Cluster size: 20 Nodes Target performance is not obvious Wide performance variation among configurations. Application Time (seconds in log scale)
4
Experience with using a performance predictor during the software development process What are the limitations and challenges of using a performance predictor as part of the development process? 4
5
Context: A Distributed Storage System MosaStore, a distributed storage system A manager, several clients, several storage servers Approximately 11,000 lines of code, Around 15 developers involved over time 5 Code & papers at: MosaStore.net
6
Sources of complexity Multiple interacting components with complex interactions Complex data and control paths Contention (network, component level) Variability in the environment Deployment choices (configuration, provisioning ) 6
7
Performance Predictor Performance Predictor 7 Supporting Storage configuration for I/O Intensive Workflows, L. B. Costa, S. Al- Kiswany, H. Yang, M. Ripeanu, ICS,’14 Energy Prediction for I/O Intensive Workflow Applications, H. Yang, L. B. Costa, M. Ripeanu, MTAGS’14
8
Development Flow 8
9
Performance Anomalies Case 1: Lack of Randomness Case 2: Lock Overhead Case 3: Connection Timeout 9
10
Benchmark Time (seconds) Actual vs. Expected 10 Actual vs. Predicted: Large Mismatch
11
Case 3: Connection Timeout ContextClient tries to establish a TCP connection Problem Too many clients try to connect, SYN packets dropped OS timeout to retry (3 seconds) Detection The developers logged and verified the service time of each component Fix Different implementation allowing custom timeout 11
12
Case 3: Impact Benchmark Time (seconds) Use of predictor made performance improvements possible 12
13
Some Other Cases Pipeline Reduce Up to 30% performance improvement Up to 10x smaller variance Benchmark Time (seconds) 13
14
Limitations and Challenges 1. Have accurate predictions Well-know challenge in the area 2. Use of predictor during development Lack of interest after initial improvements There still is a decision related to overhead Takes too long 14
15
Benefits of integrating a performance predictor Brings confidence in the performance results obtained Successful in pointing out scenarios that needed improvement Support the improvement effort 15 Code & papers at: NetSysLab.ece.ubc.ca
16
Concluding Remarks Every tool reflects a decision between the cost and the benefits of employing Our study gives information to support these decisions Predictor helps with this non-functional requirement Up to 30% improvement, 10x less variability Target performance is still not perfect It offers guidance, but not perfect final target 16
17
Backup Slides Debugging Support Debugging Support Case 1: Lack of Randomness Case 1: Lack of Randomness Case 2: Lock Overhead Case 2: Lock Overhead Synthetic Benchmarks Synthetic Benchmarks Storage System Model Storage System Model MosaStore Deployment MosaStore Deployment MosaStore execution path MosaStore execution path 17
18
Synthetic Benchmarks 18 Common patterns in the structure of workflows I/O only to stress the storage system
19
Debugging Support Granularity of the predictor is per component (storage, client, manager) Developers by turn on a logging option measures the time from the reception of a request until its response Once the buggy component and request are spotted, regular debugging starts 19
20
Case 1: Lack of Randomness 20 ContextClient obtains list of storage nodes from manager Problem Manager used same seed List of storage nodes was not shuffled Client accessing storage nodes in the same order Some nodes were hot-spots; others, idle Detection The developers logged and verified the service time of each component Fix Change algorithm that shuffles the list of storage nodes to use a different seed every time it is invoked
21
Case 2: Lock Overhead 21 ContextClients access manager for file’s metadata Problem Too many clients accessing the metadata Lock for large portions of the code Detection The developers logged and verified the service time of each component Fix Reduce the lock scope
22
Storage System Model 22 Net Manager Service Net Storage Service Network core In queue Out queue Service queue Net Client Service Scheduler Application Driver Properties: General Uniform Coarse
23
MosaStore Deployment App. task Local storage App. task Local storage App. task Local storage Workflow-Optimized Storage (shared) Backend Filesystem (e.g., GPFS, NFS) Compute Nodes … Workflow Runtime Engine Stage In/Out Storage hints (e.g., location information) Application hints (e.g., indicating access patterns) POSIX API
24
MosaStore Execution Path 24
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.