1 A Programming model for failure- prone, Collaborative robots Nels Eric Beckman Jonathan Aldrich School of Computer Science Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Advertisements

Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
RPC Robert Grimm New York University Remote Procedure Calls.
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Remote Procedure Call Design issues Implementation RPC programming
1 Fault-Tolerance Techniques for Mobile Agent Systems Prepared by: Wong Tsz Yeung Date: 11/5/2001.
Ivan Lanese Computer Science Department University of Bologna/INRIA Italy Fault in the Future Joint work with Gianluigi Zavattaro and Einar Broch Johnsen.
COS 461 Fall 1997 Where We Are u so far: networking u rest of semester: distributed computing –how to use networks to do interesting things –connection.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Tutorials 2 A programmer can use two approaches when designing a distributed application. Describe what are they? Communication-Oriented Design Begin with.
1 Failure Handling in a modal Language Nels Eric Beckman Research Talk Institute for Software Research October 30, 2006.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
USER LEVEL INTERPROCESS COMMUNICATION FOR SHARED MEMORY MULTIPROCESSORS Presented by Elakkiya Pandian CS 533 OPERATING SYSTEMS – SPRING 2011 Brian N. Bershad.
Research at Intel Distributed Localization of Modular Robot Ensembles Robotics: Science and Systems 25 June 2008 Stanislav Funiak, Michael Ashley-Rollman.
Communication in Distributed Systems –Part 2
Implementing Remote Procedure Calls an introduction to the fundamentals of RPCs, made during the advent of the technology. what is an RPC? what different.
Ivan Lanese Computer Science Department University of Bologna/INRIA Italy Fault in the Future Joint work with Gianluigi Zavattaro and Einar Broch Johnsen.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Remote Procedure Calls. 2 Client/Server Paradigm Common model for structuring distributed computations A server is a program (or collection of programs)
DISTRIBUTED PROCESS IMPLEMENTAION BHAVIN KANSARA.
4/25/ Application Server Issues for the Project CSEP 545 Transaction Processing for E-Commerce Philip A. Bernstein Copyright ©2003 Philip A. Bernstein.
Bootstrap and Autoconfiguration (DHCP)
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
1 Chapter Eight Exception Handling. 2 Objectives Learn about exceptions and the Exception class How to purposely generate a SystemException Learn about.
Benefits of PL/SQL. 2 home back first prev next last What Will I Learn? In this lesson, you will learn to: –List and explain the benefits of PL/SQL –List.
Networked File System CS Introduction to Operating Systems.
CMSC 202 Exceptions. Aug 7, Error Handling In the ideal world, all errors would occur when your code is compiled. That won’t happen. Errors which.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
DCE (distributed computing environment) DCE (distributed computing environment)
Remote Procedure Calls Adam Smith, Rodrigo Groppa, and Peter Tonner.
Implementing Remote Procedure Calls Authored by Andrew D. Birrell and Bruce Jay Nelson Xerox Palo Alto Research Center Presented by Lars Larsson.
(Business) Process Centric Exchanges
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Practical Byzantine Fault Tolerance
 Remote Procedure Call (RPC) is a high-level model for client-sever communication.  It provides the programmers with a familiar mechanism for building.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved RPC Tanenbaum.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
“Dynamic fault handling mechanisms for service-oriented applications” Fabrizio Montesi, Claudio Guidi, Ivan Lanese and Gianluigi Zavattaro Department of.
Distributed System Services Fall 2008 Siva Josyula
CSIT 220 (Blum)1 Remote Procedure Calls Based on Chapter 38 in Computer Networks and Internets, Comer.
Replication (1). Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
1 Joint work with Claudio Antares Mezzina and Jean-Bernard Stefani Controlled Reversibility and Compensations Ivan Lanese Focus research group Computer.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Energy Efficient Data Management for Wireless Sensor Networks with Data Sink Failure Hyunyoung Lee, Kyoungsook Lee, Lan Lin and Andreas Klappenecker †
Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.
Implementing Remote Procedure Calls Andrew D. Birrell and Bruce Jay Nelson Xerox Palo Alto Research Center Published: ACM Transactions on Computer Systems,
Concurrent Object-Oriented Programming Languages Chris Tomlinson Mark Scheevel.
Scalable Shape Sculpting via Hole Motion: Motion Planning for Claytronics Michael De Rosa SSS/Speaking Skills Talk 11/4/2005.
Distributed Computing & Embedded Systems Chapter 4: Remote Method Invocation Dr. Umair Ali Khan.
Chapter 29: Program Security Dr. Wayne Summers Department of Computer Science Columbus State University
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
The Structuring of Systems Using Upcalls By David D. Clark Presented by Samuel Moffatt.
Topic 4: Distributed Objects Dr. Ayman Srour Faculty of Applied Engineering and Urban Planning University of Palestine.
Java Programming Language
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Exception Handling Imran Rashid CTO at ManiWeber Technologies.
Chapter 29: Program Security
Remote Procedure Call Hank Levy 1.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Remote Procedure Call Hank Levy 1.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Remote Procedure Call Hank Levy 1.
CMSC 202 Exceptions.
Presentation transcript:

1 A Programming model for failure- prone, Collaborative robots Nels Eric Beckman Jonathan Aldrich School of Computer Science Carnegie Mellon University SDIR 2007 April 14 th, 2007

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 2 Failure Blocks: Increasing Application Liveness In the Claytronics domain, failure will be commonplace. Certain Applications: Failure of one catom causes others to be useless. Our model: An extension to remote procedure calls. Helps developers preserve liveness. Developers Signify where liveness is a concern. Specify liveness preserving actions. When failure is automatically detected, those actions are taken.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 3 Outline In our domain, the rate of failure will be high ‘Hole Motion’ (An example failure scenario) Existing RPC systems do not help us to preserve liveness Our model has two key pieces The failure block The compensating action

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 4 Rate of Failure in Catoms will be High Due to the large numbers involved: Per-unit cost must be low, which implies A lack of hardware error detection features. Rate of mechanical imperfections will be high. Probability of some catom failing becomes high. Interaction with the physical world: Dust particles? Other unintended interactions?

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 5 Outline In our domain, the rate of failure will be high ‘Hole Motion’ (An example failure scenario) Existing RPC systems do not help us to preserve liveness Our model has two key pieces The failure block The compensating action

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 6 The Hole Motion* Algorithm A Motion-Planning Technique The Idea: Randomly send holes through the mass of catoms. Holes ‘stick’ to areas that should shrink. They are more likely to be created from areas that should grow. *De Rosa, Goldstein, Lee, Campbell, Pillai. Scalable Shape Sculpting Via Hole Motion: Motion Planning in Lattice-Constrained Modular Robots. IEEE ICRA May 2006.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 7 In Detail... At each ‘hole time- step,’ catoms around the hole have a leader. They only accept commands from this leader. This protects the hole’s integrity. L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 8 In Detail... In order to become the ‘leader,’ this catom calls ‘setLeader’ on its neighbors. The same method is called recursively on other would-be group members. L nextCatom->setLeader(me);

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 9 In Detail... Catom on the stack fails: Catoms i and j may have already set L as their leader! But the only communication path to L is gone. Lg i j nextCatom->setLeader(me);

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 10 In Detail... Instead, suppose operation returned normally... L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 11 In Detail... Instead, suppose operation returned normally... L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 12 In Detail... Instead, suppose operation returned normally... L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 13 In Detail... Instead, suppose operation returned normally... L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 14 In Detail... Instead, suppose operation returned normally... L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 15 In Detail... Now L fails: Catoms g-j (and all the rest) expect commands from L! For all practical purposes, 12 catoms have failed. Lgh i j

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 16 Outline In our domain, the rate of failure will be high ‘Hole Motion’ (An example failure scenario) Existing RPC systems do not help us to preserve liveness Our model has two key pieces The failure block The compensating action

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 17 Existing RPC Systems, Not a Perfect Fit Weak Failure Detection Usually a timeout mechanism. Our model uses active failure detection. No Callee-Side Failure Handling Caller can catch timeout exception; not callee. But the callee could be left in an invalid state. Our model provides callee with compensating actions.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 18 Existing RPC Systems, Not a Perfect Fit Only detect failure on the stack of RPC calls. Our model designates catoms as being a part of the group for a lexical ‘amount of time.’ They are still a part of this group when the thread moves to a different location. Failures on the stack and off are dealt with in the same manner.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 19 Outline In our domain, the rate of failure will be high ‘Hole Motion’ (An example failure scenario) Existing RPC systems do not help us to preserve liveness Our model has two key pieces The failure block The compensating action

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 20 The Model: Two Key Pieces fail_block, which specifies The logical ‘time period’ during which live- ness concerns exist The members of the group (implicitly) Where control should return in the event of a failure push_comp, which allows The specification of code to be executed in the event of catom failure

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 21 The fail_block Primitive fail_block b Evaluates the code in block b. In the event of a detected failure The entire block throws an exception. Execution continues from the catom where the failure block is evaluated.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 22 The fail_block Primitive At runtime, the entire operation is given a unique ‘operation ID.’ When a RPC is called from within block Callee becomes ‘part’ of the operation. Callee and caller add one another as collaborators. They ‘ping’ each other regularly to detect failure. Applies recursively. In the event a failure is detected, they share the information about the demise of that operation.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 23 The fail_block Primitive If b is successfully executed An ‘end’ message is sent out. Collaborators stop detecting failure for that OID.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 24 ‘Demo’ fail_block { // catom 1 lnode->setBoss(this); rnode->setBoss(this); }... setBoss(Catom h) myLeader = h; } Group Members: {} Op ID: Failure Detect

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 25 ‘Demo’ fail_block { // catom 1 lnode->setBoss(this); rnode->setBoss(this); }... setBoss(catom h) { myLeader = h; } Group Members: {1} Op ID: Failure Detect

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 26 ‘Demo’ fail_block { // catom 1 lnode->setBoss(this); rnode->setBoss(this); }... setBoss(catom h) { myLeader = h; } Group Members: {1,2} Op ID: Failure Detect

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 27 ‘Demo’ fail_block { // catom 1 lnode->setBoss(this); rnode->setBoss(this); }... setBoss(catom h) { myLeader = h; } Group Members: {1,2,4} Op ID: Failure Detect

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 28 ‘Demo’ fail_block { // catom 1 lnode->setBoss(this); rnode->setBoss(this); }... setBoss(catom h) { myLeader = h; } Group Members: {1,2,4} Op ID: end Failure Detect

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 29 The push_comp Primitive push_comp b On whichever catom it is called: Suspend code in block b. This code will be evaluated (purely for its side- effects) in the event that a failure is detected. Called ‘compensating actions*’ or ‘compensations.’ *Westley Weimer and George C. Necula. Finding and preventing runtime error handling mistakes. In OOPSLA ’04: Proceedings of the 19 th annual ACM SIGPLAN conference on Object- oriented programming, systems, languages, and applications, pages 419–431, New York, NY, USA, ACM Press.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 30 The push_comp Primitive Each catom has several stacks of compensations, one for each OID, and compensating actions are executed from top to bottom. OID: i OID: jOID: n......

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 31 The push_comp Primitive OID: i OID: jOID: n e_1 push_comp e_1 Each catom has several stacks of compensations, one for each OID, and compensating actions are executed from top to bottom.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 32 The push_comp Primitive OID: i OID: jOID: n e_1 push_comp e_2 e_2 Each catom has several stacks of compensations, one for each OID, and compensating actions are executed from top to bottom.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 33 The push_comp Primitive OID: i OID: jOID: n e_1 push_comp e_3 e_2 e_3 Each catom has several stacks of compensations, one for each OID, and compensating actions are executed from top to bottom.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 34 The push_comp Primitive OID: i OID: jOID: n e_1 FAILURE OID i!! e_2 e_3 Each catom has several stacks of compensations, one for each OID, and compensating actions are executed from top to bottom.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 35 The push_comp Primitive OID: i OID: jOID: n e_1 e_2 e_3.run() Each catom has several stacks of compensations, one for each OID, and compensating actions are executed from top to bottom.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 36 The push_comp Primitive OID: i OID: jOID: n e_1 e_2.run() Each catom has several stacks of compensations, one for each OID, and compensating actions are executed from top to bottom.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 37 The push_comp Primitive OID: i OID: jOID: n e_1.run() Each catom has several stacks of compensations, one for each OID, and compensating actions are executed from top to bottom.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 38 ‘Demo,’ Continued fail_block { // catom 1 lnode->recurse(this,LEFT); rnode->recurse(this,RIGH); } recurse(catom ldr,Dir d){ myLeader = ldr; push_comp( myLeader = -1);... nnode->recurse(lead); }

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 39 ‘Demo,’ Continued fail_block { // catom 1 lnode->recurse(this,LEFT); rnode->recurse(this,RIGH); } recurse(catom ldr,Dir d){ myLeader = ldr; push_comp( myLeader = -1);... nnode->recurse(lead); }

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 40 ‘Demo,’ Continued fail_block { // catom 1 lnode->recurse(this,LEFT); rnode->recurse(this,RIGH); } recurse(catom ldr,Dir d){ myLeader = ldr; push_comp( myLeader = -1);... nnode->recurse(lead); } FAI L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 41 ‘Demo,’ Continued fail_block { // catom 1 lnode->recurse(this,LEFT); rnode->recurse(this,RIGH); } recurse(catom ldr,Dir d){ myLeader = ldr; push_comp( myLeader = -1);... nnode->recurse(lead); } FAI L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 42 ‘Demo,’ Continued fail_block { // catom 1 lnode->recurse(this,LEFT); rnode->recurse(this,RIGH); } recurse(catom ldr,Dir d){ myLeader = ldr; push_comp( myLeader = -1);... nnode->recurse(lead); } FAI L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 43 try { fail_block { // catom 1 lnode->recurse(this,LEFT); rnode->recurse(this,RIGH); } } catch(OpFailure) {...} recurse(catom ldr,Dir d){ myLeader = ldr; push_comp( myLeader = -1);... nnode->recurse(lead); } ‘Demo,’ Continued

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 44 Conclusion Failure Blocks An extension to RPC for recovering from node failures. Within a failure block RPC calls add the callee to the current operation. Callee and caller detect failure in one another. Compensating actions can be stored, executed in the event of failure. Targeted at modular robotic systems where failure is high but availability is important.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 45 References N. Beckman and J. Aldrich. A Programming Model for Failure-Prone, Collaborative Robots. To appear in the 2nd International Workshop on Software Development and Integration in Robotics (SDIR). Rome, Italy. April 14, De Rosa, Goldstein, Lee, Campbell, Pillai. Scalable Shape Sculpting Via Hole Motion: Motion Planning in Lattice-Constrained Modular Robots. IEEE ICRA May Achour Mostefaoui, Eric Mourgaya, and Michel Raynal. Asynchronous implementation of failure detectors. In 2003 International Conference on Dependable Systems and Networks (DSN’03), page 351, Westley Weimer and George C. Necula. Finding and preventing runtime error handling mistakes. In OOPSLA ’04: Proceedings of the 19 th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 419–431, New York, NY, USA, ACM Press. Michel Reynal. A short introduction to failure detectors for asynchronous distributed systems. SIGACT News, 36(1):53–70, 2005.

46 The end

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 47 Scenario One 123

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 48 Scenario One 123 host2->foo()

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 49 Scenario One 123

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 50 Scenario One 123 host3->bar()

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 51 Scenario One 123

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 52 ‘Demo’ fail_block { (* host 1 *) host2->foo(); host4->bar(); }... foo() { host3->doWork(h1); } Group Members: {} Op ID: Regular Ping

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 53 ‘Demo’ fail_block { (* host 1 *) host2->foo(); host4->bar(); }... foo() { host3->doWork(h1); } Group Members: {1} Op ID: Regular Ping

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 54 ‘Demo’ fail_block { (* host 1 *) host2->foo(); host4->bar(); }... foo() { host3->doWork(h1); } Group Members: {1,2} Op ID: Regular Ping

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 55 ‘Demo’ fail_block { (* host 1 *) host2->foo(); host4->bar(); }... foo() { host3->doWork(h1); } Group Members: {1,2,3} Op ID: Regular Ping

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 56 ‘Demo’ fail_block { (* host 1 *) host2->foo(); host4->bar(); }... foo() { host3->doWork(h1); } Group Members: {1,2,3,4} Op ID: Regular Ping

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 57 ‘Demo’ fail_block { (* host 1 *) host2->foo(); host4->bar(); }... foo() { host3->doWork(h1); } Group Members: {} Op ID: Regular Ping end !

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 58 At a Macroscopic Level... (Video)

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 59 ‘Demo,’ Continued... doWork(HostAddr a) { myLeader = a; push_comp { if(myLeader == a) myLeader = null; } Failure!

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 60 ‘Demo,’ Continued... doWork(HostAddr a) { myLeader = a; push_comp { if(myLeader == a) myLeader = null; } myLeade r = null;...

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 61 Outline The Rate of Failure Will be High Two Failure Scenarios We Would Like to Handle Existing RPC Systems Do Not Meet Our Needs Our Model Has Two Key Pieces fail_block push_comp Our Model Does Not Require Consistency

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 62 Our System Does Not Require Consistency Our model has a nice feature: We do not require consistency in failure detection! This has been proven to be impossible in ‘time-free’ systems.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 63 What is Consistency? fail_block { (* host 1 *) host2->foo(); host4->bar(); }... foo() { host3->doWork(h1); }

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 64 What is Consistency? fail_block { (* host 1 *) host2->foo(); host4->bar(); }... foo() { host3- >doWork(h1); } OID: 9, failure! OID: 9, end!

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 65 Our System Does Not Require Consistency Domain Assumption: The ultimate goal of any application is to perform actuator movements. Additionally, The thread of control must migrate to a catom in order to issue an actuator command. If a thread migrates to a catom that has detected or knows about a failure, that thread will not continue normally.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 66 Our System Does Not Require Consistency Therefore, if inconsistency occurs, we know: In between detection and fail_block completion, no actuator movements were necessary on any hosts that knew about the failure. In the sense that actuator movements are the ultimate goal in the domain, their work was already done.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 67 Our System Does Not Require Consistency What if we won’t make an actuator movement on a host, but we need to know it performed its duty? E.g., structural catoms This is a question of live-ness versus other goals. fail_block should be used precisely when live-ness is a chief concern.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 68 Assumptions Movement: When a movement occurs, you are required to talk with these surrounding hosts and they will be able to figure out the new location to ping. Goals: Actuator movements are the ultimate goal of most applications in this domain.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 69 What about Transactions? Semantics of roll-back suggest a transactional model. Similarly, it seems that Two-Phase commit could give us consistency. But 2PC has one or two extra rounds of communication Application doesn’t make progress! Our model has no extra blocking rounds. 2PC can block indefinitely if the coordinator fails In non-blocking protocols the number of failures is bounded. It is not clear how error detection and 2PC could be combined.

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 70 In Detail... But, after this field has been set, failure of the leader leaves the catoms in a dead state. L

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 71 The push_comp Primitive We call this suspended code ‘compensating actions.’ Borrowed terminology from Weimar and Necula. Originally used to ensure proper clean-up for file handlers, etc. in exceptional circumstances. (However, our compensating actions are only executed when a failure is detected.)

SDIR07: A Programming Model for Failure-Prone Collaborative Robots 72 Why Server-Side Failure Handling? The client may think the server has failed, when it hasn’t. Allow server to return to a stable state. Failure detectors unreliable in ‘time-free’ systems. The client may have failed. An catom on the return route may have failed.