Download presentation
Presentation is loading. Please wait.
Published byBeryl Scott Modified over 9 years ago
1
CompSci 296.2 Self-Managing Systems Shivnath Babu
2
2 Today Wrap up sample projects ROC discussion
3
3 Sample Projects NIMO Fa Combining structured & unstructured data Projects using Nagios Projects using IBM autonomic computing toolkit
4
4 NIMO: NonInvasive Modeling for Optimization Build performance models for scientific apps –Automatic, online, and noninvasive Projects –Study many scientific apps (e.g., 140 bio apps in BioPortal) characterize behavior, good models –“Steal app”, build and refine model –Incorporate NIMO in a “grid” scheduler (Condor, Globus) –Optimization problems in scheduling workflows
5
5 Fa Testbed to study: –Whether we can automate problem prediction, diagnosis –Relationship among problems, causes, data, & models Projects –Models for predicting performance problems (online) –Models and mechanisms for root-cause queries –Others
6
6 Structured and Unstructured Data Combined querying/mining of structured and unstructured system data –Structured data: time series of CPU utilization –Unstructured data (free text): System error log Ex: Characterize system state when a specific error occurs
7
7 Add New Features to Current Systems Add problem-prediction capability to Nagios Add root-cause querying to Nagios Similar projects using the IBM Autonomic Computing Toolkit + ABLE framework Remember the “mechanism projects” –Undo, virtualization, active probing
8
8 ROC: Recovery-Oriented Computing Complaints about current systems –Focus only on performance Availability & maintainability is neglected –Focus on MTTF of individual components MTTR neglected –MTTF of system << MTTF of individual components
9
9 ROC Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) People/HW/SW failures are facts, not problems Recovery/repair is how we cope with above facts ROC focus is on fast repair Vs. old focus on longer time between failures
10
10 ROC Principles Recovery experiments: benchmarking recovery Pinpoint: Automatic problem diagnosis Recursive restart: Innovative use of reboot App and system undo Defense in depth: ROC at hardware level
11
11 Discussion Strong point: Comprehensive, relate to other fields Margin of safety for systems –Current examples? –How to incorporate? Negative point: Evolution Vs. revolution? –What approach is the project taking? At what level should we support Undo? –Transaction, application, system –Pros and cons Benchmarking availability/recovery (TOC?) –How can you claim that a system is 99.999% available? Dealing with the automation irony –Fire drills
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.