Download presentation
Presentation is loading. Please wait.
Published bySuzanna Lawrence Modified over 9 years ago
1
CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, Eric Brewer (UC Berkeley, Stanford U, Tellme Networks, eBay Inc.) Presented by: Arjun R. Nath
2
2 The Problem.. Computing systems increasing in complexity Tending towards large, complex, distributed systems Sometimes there are thousands of machines involved Basic system management is becoming increasingly difficult. Detecting and diagnosing failures to understanding application behaviour is becoming very difficult.
3
3..the Problem Existing techniques such as code-level debuggers, program slicing, process profiling and application logs fail to characterize overall system behaviour. Distribuged debuggers are available but focus on a homogenous subset of the system.
4
4 Goal of the paper Techniques to help us understand large distributed systems. Improve – availability – reliability – manageability Why are we looking at this paper ? (Self-* context) –This paper is about techniques for monitoring of large, complex, distributed systems.
5
5 Two main principles Path-Based Measurement: –Model the system as a collection of paths thru heterogenous components. –Make local observations along the paths and store these. These can be accessed via queries and visualization techniques. (Focus is on correctness rather than performance) Statistical Behaviour Analysis: –Large volumes of system requests are stored for statistical analysis using classical techniques to identify deviations from normal behaviour. This can be applied to live systems or used for offline analysis.
6
6 What is a "Path" ? Associated with a request Control Flow Resources Paths may have inter-path dependencies : shared state, shared database tables, shared filesystems, shared memory. Multiple paths may be grouped together in sessions.
7
Coarse grained paths
8
Fine grained paths
9
9 How do paths help ? Failure Management Evolution (of the system)
10
10 Failure Management... Detection: –Reduce downtime associcated with detection delays –Using paths can help in noticing developing problems before they become severe The Key is to define "normal" behaviour statistically and then check for deviations Diagnosis: –Isolate problems using solely the recorded path observations and then drive the diagnosis process with the path information. –Paths help identify which components are involved in a given failure and aid in identifiying causes.
11
11...Failure Management Impact Analysis: –Helps in knowing the scale of the problem -> estimate time-to-repair –Which other paths are at risk.
12
12 Evolution (of the system) Its very difficult to get an overall picture of how a complex distributed system changes with time: - Software/hardware upgrades, patches, code changes etc. - Systems evolve through changes to their components and also thru changes in how they interact Paths help in revealing system structure and dependencies and tracking changes.
13
Implementation
14
Implementation: Architecture
15
15 …Implementation... Tracers - tracking a request through the target system. –Each request has an identifier associated that is maintained throughout the path –Ids may be stored in extensible headers (HTTP, SOAP) –Tracers are platform specific but can be generic to applications using the same platform (J2EE,.NET) Pinpoint, ObsLogs, SuperCal all have tracers.
16
…Implementation: tools.. Three systems that support path-based analysis
17
...Implementation Aggregator and Repository –Aggregator receives observations from tracers –reconstructs paths using IDs –Stores this in the Repository –There may be also a Central Repository that collects from distributed repositories. Analysis Engines and Visualization. –Single and multi-path analysis –Dedicated engines for various statistical tests –Support for some data mining tools\ –Visualization: Tukey’s boxplots generated using Octave
18
…Implementation A trend specific to recognition time in Tellme application A suggests a regression in a speech grammar in that application. The Tukey boxplots shown illustrate a distribution’s center, spread, and asymmetries by using rectangles to show the upper and lower quartiles and the median, and explicitly plotting each outlier.
19
Limitations and constraints Cannot resolve fault causes at a very detailed level Overheads can be high for fine grained paths Need to decide which observations to include in paths. This is an iterative process. Can be difficult to implement especially for existing systems
20
Its important so understand that Path- based analysis is an aid to fault detection and recovery and not a solution in itself. It is meant to be used in combination with traditional fault handling techniques.
21
Conclusion As systems get more complex, Path-based analysis tools will have increasing importance. Path based fault analysis complements traditional techniques Hardly any fully functional, path-based, fault management tools available. This paper: –Has breadth but lacks depth in some places. –Needs some more data around production environment experiments –Should have concentrated on 1 or 2 implementations and included more details. –Not much info on SuperCal and ObsLogs
22
Other related stuff “Pinpoint” project at Stanford http://swig.stanford.edu/pinpoint.shtml (Some interesting papers here) http://swig.stanford.edu/pinpoint.shtml Magpie project (MicroSoft) Quest Software : Jprobe – Java performance profiler Borland's OptimizeItEnterprise Suite
23
23 That’s all folks, Thank You
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.