Autonomic Systems Sukumar Ghosh Department of Computer Science The University of Iowa.

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Snap-stabilizing Committee Coordination Borzoo Bonakdarpour Stephane Devismes Franck Petit IEEE International Parallel and Distributed Processing Symposium.
KAIS T The Vision of Autonomic Computing Jeffrey O. Kephart, David M Chess IBM Watson research Center IEEE Computer, Jan 발표자 : 이승학.
Welcome to DEAS 2005 Design and Evolution of Autonomic Application Software David Garlan, CMU Marin Litoiu, IBM CAS Hausi A. Müller, UVic John Mylopoulos,
© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove.
Autonomic Systems Justin Moles, Winter 2006 Security in an Autonomic Computing Environment Paper by: D. M. Chess, C. C. Palmer S. R. White Presentation.
Self-stabilizing Distributed Systems Sukumar Ghosh Professor, Department of Computer Science University of Iowa.
Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.
Fabian Kuhn, Microsoft Research, Silicon Valley
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Fabián E. Bustamante, Winter 2006 Autonomic Computing The vision of autonomic computing, J. Kephart and D. Chess, IEEE Computer, Jan Also - A.G.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Self Stabilization 1.
Yingping Huang and Gregory Madey University of Notre Dame A W S utonomic eb-based imulation Presented by Tariq M. King Published by the IEEE Computer Society.
1 Sensor Networks and Networked Societies of Artifacts Jose Rolim University of Geneva.
The Organic Grid: Self- Organizing Computation on a Peer-to-Peer Network Presented by : Xuan Lin.
Taming Dynamic and Selfish Peers “Peer-to-Peer Systems and Applications” Dagstuhl Seminar March 26th-29th, 2006 Stefan Schmid Distributed Computing Group.
Dissemination protocols for large sensor networks Fan Ye, Haiyun Luo, Songwu Lu and Lixia Zhang Department of Computer Science UCLA Chien Kang Wu.
CPSC 668Self Stabilization1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
LSRP: Local Stabilization in Shortest Path Routing Anish Arora Hongwei Zhang.
Autonomic Computing Shafay Shamail Malik Jahan Khan.
Self-Stabilization An Introduction Aly Farahat Ph.D. Student Automatic Software Design Lab Computer Science Department Michigan Technological University.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
1 Characterizing Selfishly Constructed Overlay Routing Networks March 11, 2004 Byung-Gon Chun, Rodrigo Fonseca, Ion Stoica, and John Kubiatowicz University.
Localized Self- healing using Expanders Gopal Pandurangan Nanyang Technological University, Singapore Amitabh Trehan Technion - Israel Institute of Technology,
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
1 FM Overview of Adaptation. 2 FM RAPIDware: Component-Based Design of Adaptive and Dependable Middleware Project Investigators: Philip McKinley, Kurt.
GS 3 GS 3 : Scalable Self-configuration and Self-healing in Wireless Networks Hongwei Zhang & Anish Arora.
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.
01/16/2002 Reliable Query Reporting Project Participants: Rajgopal Kannan S. S. Iyengar Sudipta Sarangi Y. Rachakonda (Graduate Student) Sensor Networking.
1 Network Creation Game A. Fabrikant, A. Luthra, E. Maneva, C. H. Papadimitriou, and S. Shenker, PODC 2003 (Part of the Slides are taken from Alex Fabrikant’s.
Distributed Systems Sukumar Ghosh Department of Computer Science University of Iowa.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
1 Autonomic Computing An Introduction Guenter Kickinger.
Selected topics in distributed computing Shmuel Zaks
WELCOME. AUTONOMIC COMPUTING PRESENTED BY: NIKHIL P S7 IT ROLL NO: 33.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Andreas Larsson, Philippas Tsigas SIROCCO Self-stabilizing (k,r)-Clustering in Clock Rate-limited Systems.
Automating service management Tiina Niklander Faculty of Science Department of Computer Science In AMICT 2008 Petrozavodsk, May 2008.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
Distributed Systems and Algorithms Sukumar Ghosh University of Iowa Spring 2011.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.
Dissecting Self-* Properties Andrew Berns & Sukumar Ghosh University of Iowa.
A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.
THE VISION OF AUTONOMIC COMPUTING. WHAT IS AUTONOMIC COMPUTING ? “ Autonomic Computing refers to computing infrastructure that adapts (automatically)
Autonomic distributed systems. 2 Think about this Human population x10 9 computer population.
Fault Management in Mobile Ad-Hoc Networks by Tridib Mukherjee.
University of Iowa1 Self-stabilization. The University of Iowa2 Man vs. machine: fact 1 An average household in the developed countries has 50+ processors.
Self-stabilization. What is Self-stabilization? Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward.
The Vision of Autonomic Computing Self-Management Unit 7-2 Managing the Digital Enterprise Kephart, and Chess.
CS 542: Topics in Distributed Systems Self-Stabilization.
Self-stabilizing energy-efficient multicast for MANETs.
Hwajung Lee.  Technique for spontaneous healing.  Forward error recovery.  Guarantees eventual safety following failures. Feasibility demonstrated.
Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
ITEC452 Distributed Computing Lecture 15 Self-stabilization Hwajung Lee.
Langley Research Center An Architectural Concept for Intrusion Tolerance in Air Traffic Networks Jeffrey Maddalon Paul Miner {jeffrey.m.maddalon,
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Self-stabilizing Overlay Networks Sukumar Ghosh University of Iowa Work in progress. Jointly with Andrew Berns and Sriram Pemmaraju (Talk at Michigan Technological.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
AUTONOMIC COMPUTING B.Akhila Priya 06211A0504. Present-day IT environments are complex, heterogeneous in terms of software and hardware from multiple.
The Biologically Inspired Distributed File System: An Emergent Thinker Instantiation Presented by Dr. Ying Lu.
On a Network Creation Game
Distributed Systems – Paxos
The Vision of Autonomic Computing
Jigar.B.Katariya (08291A0531) E.Mahesh (08291A0542)
Autonomic Pervasive Systems
Presentation transcript:

Autonomic Systems Sukumar Ghosh Department of Computer Science The University of Iowa

2 Preamble Large distributed systems are witnessing explosive growth. –Peer-to-peer networks –Sensor networks –2G/3G/4G cellular networks –Cloud computing infrastructure –Grids Also, the growth of processor population vastly outpaced the growth of human population

Examples Skype is used by 200 million users worldwide. The scale, dynamism and uncertainty present significant reconfiguration and management challenges

Examples The Computing Grid (LCG) for the Large Hadron Collider in CERN will handle more than one petabyte of data every month. The data will be sent out to 140 different computer centers in 33 different countries for storage and analysis.

Examples Autonomic Virtual Machine mapping in a Data Center. An autonomic controller dynamically manages the mapping of virtual machines onto physical hosts in accordance with policies specified by the user. Policy Virtual Machines Physical hosts

6 The problem Who will manage these networks? Management includes Fault handling System reconfiguration on demand Adapting to environmental changes Employing people for everything is unrealistic Slow and error prone Not enough bodies in the IT force Not profitable from a business perspective

7 The preferred solution Large systems have to manage themselves. Otherwise these are not practical or profitable. It is much more than the traditional perception of fault tolerance. Changes in environment, user demands, security breaches are no more catastrophic, but expected events, and add to the adversarial scenario. Everything is dynamic, and changes need to be dealt with on-the-fly.

Types of triggers Failure crash, transient, byzantine, security etc Environment changes processes join or leave user demands change Let F denote a trigger

Types of remedies Masking: P = Q P Q Non-masking: P Q P Caused by F [Arora and Gouda 1993] P = predicate reflecting “desirable” configurations P  Q (the weakest predicate generated by F)

10 Autonomic systems Dictionary meaning of autonomic ( au·to·nom·ic) 1. controlled by automatic responses: describes functions of the nervous system not under voluntary control, e.g. the regulation of heartbeat or gland secretions 2. without thought: describes an action or response that occurs without conscious control Stresses the philosophy of self-management Can computing systems behave in a similar manner?

A bit of history Fault-tolerant computing system design started with space expeditions in the 60’s (Self Testing And Repairing computer for the Voyager Mission -- see the STAR paper by Avizienis in 1971). The autonomic computing initiative started by IBM in 2001 to reduce the barrier that complexity poses to further growth of systems. Related paradigms Organic computing Evolutionary computing Amorphous computing Autonomic communication stresses only on the networking aspects of autonomic computing. The living cell is as complex as any man-made computer, Yet the living cell is not algorithmically controlled in any practical sense: it is not digital or deterministic. See

12 Self-star properties These (and similar self-) properties are collectively called self- * properties, and these characterize an Autonomic System. Self-management Self-healing Self-organizing Self-optimizing Self-protecting Self-

Self-stabilization Somehow, the autonomic systems community forgot to include self-stabilization (that dates back to 1974) in their wish-list of self-star properties. Self-stabilizing systems are capable of eventual recovery to a legal configuration from arbitrary initial configurations. Such systems are suitable for ad-hoc deployment - they tolerate arbitrary transient failures than can corrupt its data state, as long as the codes remain unchanged.

Self-stabilization Faulty configuration any transient fault recovery Legal configuration No fault

Self-organization The ability to react fast to topology changes and restore the system to a legal configuration. Self-organizing systems efficiently handle join and leave operations of processes Join / leave (p) Self-organization In progress Self-organization In progress Join / leave (p) Self-organization In progress Local aggregate function f p for the neighborhood of p fpfp

Self-organization Before Node 25 contacts 119 to join the system succ(119) pre(119)

Self-organization After Time complexity of join is O(N). Too large! To qualify for being “self-organizing” join or leave should be completed in sublinear time (Dolev 2007)

Self-organization in Chord Before Contacts 119 to Join the system

Self-organization in Chord After Time complexity of join is O(log N). It is self-organizing

Self-organization vs Self-stabilization Self-stabilizing systems Self-organizing systems

Self-organization vs Self-stabilization fault Self-organizing but not self-stabilizing to the legal configuration (“single ring”) ?

Self-optimization Processes collectively try to maximize or minimize a cost metric related to the system configuration. Example: minimum spanning tree construction.

Self-optimization The perception of the cost may be global or individual. In traditional solutions, all processes cooperate. When processes are selfish, the perception of the cost is individual. Game theory is rich in dealing with such issues.

Network Creation Game N nodes, each represented by a vertex and can buy (undirected) links to a set of others (s i ) One agent buys a link, but anyone can use it Cost to node: Pay $  for each link you buy Pay $1 for every hop to every node Distance from i to j (Fabrikant et al PODC 2003)

Example  (Convention: arrow from the node buying the link) ++ c(i)=  +13 c(i)=2  +9

Some questions Will the system of processes reach a Nash equilibrium? If so, what is the relationship between the equilibrium topology and  ? Fabrikant et al. (PODC 2003) discuss some cases and make some conjectures. Moscibroda, Schmidt and Wattenhofer (PODC 2006) showed examples where the system may never reach an equilibrium.

No equilibrium The shortest path tree computation by the three nodes has no equilibrium configuration. The edge costs shown are for (black, white, grey)

No equilibrium 9, 7 7,0 6,7 9,0 6,9 9,1 7,9 r (white, black) Each node tries to push the maximum flow to the root Max flow tree

Research questions What are the necessary conditions for the existence of such non-equilibrium configurations? What are the sufficient conditions? Are such conditions locally detectable?

Research issues Algorithms for implementing self-* properties relevant to specific systems or applications (algorithmic research: what is possible, what is impossible, bounds, complexity etc.) New type of properties that may be meaningful (can a system learn from failure history and be smarter? How can a system gracefully degrade?) New approaches to solving problems (can we reverse engineer some natural phenomenon to implement some of the self-* properties?

Sample research problems N processes in a P2P network. Each process j has a preferred set of peers nbr(j), but a degree  << |nbr(j)| << N How will each process choose its neighbors, so that the total communication cost (number of hops) to its preferred set of peers is minimum?

Sample research problem (Handling churn in a P2P network) Nodes join and leave at a high rate R/unit time. How to devise an efficient replication mechanism so that (1) at least one copy of each object always exists, and (2) is accessible to all peers?

Self-healing As it stands now, it seems to be as generic as the term “fault-tolerance.” No clear definition has emerged, but mostly local recovery from “minor failures” (not necessarily limited to join or leave) is implied. Some allow graceful degradation after healing.

Graceful degradation P Q Degraded Configuration P’ P, Q are predicates on the global states Other interpretations are possible too

Self-healing On August 15, 2007, Skype was down for 48 hours Skype designers claimed that Skype was self-healing. So, what went wrong? The company described it as a “failure in their self-healing mechanism” Villu Arak. What happened on August 16,

Example of self-healing System monitors the failure of components, and proactively protects the system from major failures. Example. Fine-grained component-level restarts, micro-reboots, help increase availability (Candea, Cutler, Fox, 2004).

Micro-reboot in Mercury OS Failure monitor (M) continuously performs liveness check and tells R of failure Recovery module (R) It uses reboot tree to decide which component must be rebooted. Prevents Infinite reboots. (Mercury OS : Candea, Cutler, Fox, 2004).

The Reboot Tree Reboot failed component Doesn’t work, move to parent Repeat until entire system is rebooted

Self-healing with learning Refinement. System gradually learns about failures while it is running, predicts / anticipates failures, and eventually proactively protects itself. Thus the system “gets better with time.” It drops its protective gears when there is no failure. (By profiling failures at run time, the system potentially lowers the overhead of healing when there is no failure).

Self-protection Mainly refers to protection from external threats. The remedy depends on the actual system and the nature of threats. (Identity theft, Virus, Hacking) are the common threats for the IT installations, but the threats may be different in a sensor network. The system should successfully recognize such threats and defend using local knowledge.

Self-protection Biology and nature provide helpful hints. For example, systems with diversity, modularity and redundancy are less susceptible to failure from external attacks. linux windows xyz

New challenges: cyber-physical systems Deal with the interaction between Distributed computing and Physical processes Examples: UAV, collision avoidance systems, cooperating mobile robots. Such systems must continuously self-organize, adapt to changes, guarantee real-time response, safety etc.

Conclusions Many other self- properties are possible. Self-aware ( learning about ones own behavior ) Self-scaling Self-configuring Self-repairing The definitions need to be cleaned up.

Conclusions Autonomic systems algorithms Biology & nature Control theory ? ?

Robot swarm EU funded I-SWARM project (University of Karlsruhe) Spy fly project in Harvard