Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.

Slides:



Advertisements
Similar presentations
Chapter 17 Failures and exceptions. This chapter discusses n Failure. n The meaning of system failure. n Causes of failure. n Handling failure. n Exception.
Advertisements

MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Lectures on File Management
Exception Handling. Background In a perfect world, users would never enter data in the wrong form, files they choose to open would always exist, and code.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Remote Procedure Call Design issues Implementation RPC programming
Exception Handling Chapter 12.  Errors- the various bugs, blunders, typos and other problems that stop a program from running successfully  Natural.
Systems Seminar Schedule Monday, 18 Februrary, 4pm: – “New Wine in Old Bottles” - Douglas Thain 4 March: – No seminar: Paradyn/Condor Week Tuesday, 19.
Tam Vu Remote Procedure Call CISC 879 – Spring 03 Tam Vu March 06, 03.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
Slide 1 Client / Server Paradigm. Slide 2 Outline: Client / Server Paradigm Client / Server Model of Interaction Server Design Issues C/ S Points of Interaction.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Error Scope on a Computational Grid: Theory and Practice Douglas Thain and Miron Livny Computer Sciences Department University of Wisconsin HPDC-11, July.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Remote Method Invocation Chin-Chih Chang. Java Remote Object Invocation In Java, the object is serialized before being passed as a parameter to an RMI.
Exceptions in Java Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
PRASHANTHI NARAYAN NETTEM.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
COMP 14: Intro. to Intro. to Programming May 23, 2000 Nick Vallidis.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
Slide 3-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 3 Operating System Organization.
DISTRIBUTED PROCESS IMPLEMENTAION BHAVIN KANSARA.
1 Exception and Event Handling (Based on:Concepts of Programming Languages, 8 th edition, by Robert W. Sebesta, 2007)
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Distributed Deadlocks and Transaction Recovery.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
June 14, 2001Exception Handling in Java1 Richard S. Huntrods June 14, 2001 University of Calgary.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Grid Computing I CONDOR.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
SWE 619 © Paul Ammann Procedural Abstraction and Design by Contract Paul Ammann Information & Software Engineering SWE 619 Software Construction cs.gmu.edu/~pammann/
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Dr. Tom WayCSC Testing and Test-Driven Development CSC 4700 Software Engineering Based on Sommerville slides.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
CE Operating Systems Lecture 3 Overview of OS functions and structure.
1 Week 12 l Overview of Streams and File I/O l Text File I/O Streams and File I/O.
Lecture 8 February 29, Topics Questions about Exercise 4, due Thursday? Object Based Programming (Chapter 8) –Basic Principles –Methods –Fields.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
DEVS Based Modeling and Simulation of the CORBA POA F. Bernardi, E. de Gentili, Pr. J.F. Santucci {bernardi, gentili, University.
Chapter 8 Lecture 1 Software Testing. Program testing Testing is intended to show that a program does what it is intended to do and to discover program.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Objective You will be able to define the basic concepts of object-oriented programming with emphasis on objects and classes by taking notes, seeing examples,
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
(c) University of Washington10-1 CSC 143 Java Errors and Exceptions Reading: Ch. 15.
Exceptions Lecture 11 COMP 401, Fall /25/2014.
 PROCESS MANAGEMENT  A process is a program in execution: (A program is passive, a process active.)  A process has resources (CPU time, files) and.
Review A program is… a set of instructions that tell a computer what to do. Programs can also be called… software. Hardware refers to… the physical components.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
Five todos when moving an application to distributed HTC.
Chapter 8 – Software Testing
Chapter 2: System Structures
Basic Grid Projects – Condor (Part I)
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Presentation transcript:

Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July 2002

Outline An Exercise: Condor + Java Bad News: Error Explosion A Theory of Error Propagation Down with Generic Errors! Condor Revisited Parting Thoughts

An Exercise: Coupling Condor and Java The Condor Project, est – Production high-throughput computing facility. – Provides a stable execution environment on a Grid of unstable, autonomous resources. The Java Language, est – Production language, compiler, and interpreter. – Provides a standard instruction set and libraries on any processor and system. The Grid, est. ???? – Execute any code any where at any time. – Dependable, consistent, pervasive, inexpensive... – Are we there yet?

The Condor High Throughput Computing System HTC != HPC – Measured in sims/week, frames/month, cycles/year. All participants are autonomous. – Users give constraints on usable machines. – Machines give constraints on jobs and users. – ClassAds: a language for matchmaking. If you are willing to re-link jobs... – Remote system calls for transparent mobility. – Binary checkpointing for migration and fault-tolerance. – Can’t relink? All other features available. Special “universes” support software environments. – PVM, MPI, Master-Worker, Vanilla, Globus, Java

Home File System Execution SiteSubmission Site User Agent (schedd) Match- Maker Machine Agent (startd) Policy Control Policy Control Execution Protocol The Job Fork Job Agent (starter) Fork Job Agent (shadow) Fork “I want...” “I have...” Claiming Protocol notify

Java Universe Execution: – User specifies.class and.jar files. – Machine provides the JVM details. Input and Output: – Know all of your files? Condor transfers whole files for you. – Need online I/O? Link program with Chirp I/O Library. Execution site provides proxy to home site.

JVM Fork Job Agent (starter) Job Agent (shadow) Home File System I/O Library The Job I/O ServerI/O Proxy Secure Remote I/O Local System Calls Local RPC (Chirp) Execution SiteSubmission Site Wrapper

Initial Experience Bad news! Any kind of error sent the job back to the user with an exception message: – NullPointerException - Program is faulty. – OutOfMemory - Program outgrew machine. – ClassNotFoundError - Machine incorrectly installed. – ConnectionRefused - Network temporarily unavailable. Users were frustrated because they had to evaluate whether the job failed or the system failed. These were correct in the sense they were true. These were not bugs. We deliberately trapped all possible errors and passed them up the chain.

What’s the Problem? To reason about this problem, we began to construct a theory of error propagation. This theory offers some common definitions and four principles that outline a design discipline. We re-examined the Java Universe according to this theory. Our most serious mistake: We failed to propagate errors according to their scope.

We are NOT Talking About: Fault Tolerance – What algorithms are fault-resistant? – How many disks can I lose without losing data? – How many copies should I make for five nines? Language Structures – Should I use Objects or Strings to represent errors? – Should I use Exceptions or Signals to communicate errors? These are important and valuable questions, but we are asking something different!

We ARE Talking About: Where is the problem? How should a program respond to an error? Who should receive an error message? What information should an error carry? How can we even reason about this stuff?

Engineering Perspective Fault – A physical disruption of the machine. Error – An information state that reflects a fault. Failure – A violation of documented/guaranteed behavior. Fault – (A failure in one’s underlying components.)

Interface Perspective Implicit Error – A result presented as valid, but found to be false. – Example: sqrt(3) -> 2. Explicit Error – A result describing an inability to carry out the request. – Example: open(“file”) -> ENOENT. Escaping Error – A return to a higher level of abstraction. – Example: read -> virt mem failure -> process abort. – Example: server out of memory -> shutdown socket

Program Virtual Memory System Physical Memory Backing Store load data Could return a default value, but that creates an implicit error. Would like to return an explicit error, but a load insn has no exit code. Parent Process Escaping error: Tell the parent that the program could not complete. Normal Exit Abnormal Exit

Interface Contracts int load( int address ); The implementor must either compute a result that conforms to the contract, or is obliged to cause an escaping error.

Exceptions int open( String filename ) throws FileNotFound, AccessDenied; A language with exceptions provides more structure to the contract. A declared exception is an explicit error. Yet, escaping errors are still possible.

Program Virtual File System Memory Disk open Success, FileNotFound, AccessDenied Parent Process Normal Exit Abnormal Exit MemoryCorrupt, DiskOffline, PigeonLost INTERFACE IMPLEMENTATION

Error Scope In order to be accepted by end users, a distributed system must be able to distinguish between errors computed by the program and errors forced upon it by the environment. We use the term scope to draw the distinction.

Error Scope The scope of an error is the portion of the system that it invalidates. An error must be delivered to the process responsible for managing that scope. ErrorScopeHandler FileNotFoundFileCalling Function RPC DisconnectProcessParent Process Cache Coherency Problem MachineHypervisor or Operator PVM Node CrashPVM ClusterParent Process

Error Detail The detail of an error describes in phenomenological terms the cause of the error. In the right hands, the detail is useful. In the wrong hands, the detail can be misleading. Suppose open returns AccessDenied... – File is not accessible - Ok. – Library containing ‘open’ is not accessible - Problem!

What To Do With An Error? A program cannot possibly know what to do with an error outside its scope. – Should sin(x) deal with “math library not available?” Propagate an error to the manager of the scope as directly as possible. Sometimes, a direct mechanism: – Signal, exception, dropped connection, message. Sometimes, an indirect mechanism: – Touch a file, then exit by any means available.

Principles for Error Design Principle 1: – A routine must not generate an implicit error as a result of receiving an explicit error. Principle 2: – An escaping error converts a potential implicit error into an explicit error at a higher level. Principle 3: – An escaping error must be propagated to the program that manages the error’s scope. Principle 4: – Error interfaces must be concise and finite.

Return to Condor What did we do wrong? – We failed to carefully consider the scope of an error. – We fell prey to the deadly generic error. What’s the solution? – Identify error scopes in Condor. – Find more direct mechanisms to send escaping errors to the managing process.

schedd shadow starter JVM program CodeData Program Scope Virtual Machine Scope Remote Resource Scope Local Resource Scope Job Scope Input Data Prog Args Prog Image Output Space I/O Server User Policy Owner Policy Java Pkg Mem & CPU

DetailScopeHandler Program exited normally.ProgramUser Null pointer exception.ProgramUser Out of memory.Virtual Machine JVM Java misconfigured.Remote Resource Starter Home file system offline.Local Resource Shadow Program image corrupt.JobSchedd Scope in Condor

Scope in Condor: JVM Exit Code DetailScopeHandlerExit Code Program exited normally.ProgramUser(x) Null pointer exception.ProgramUser1 Out of memory.Virtual Machine JVM1 Java misconfigured.Remote Resource Starter1 Home file system offline.Local Resource Shadow1 Program image corrupt.JobSchedd1

JVM Job Agent (starter) Job Agent (shadow) Home File System Wrapper I/O Library The Job Result File JVM Result Program Result or Error and Scope Starter Result + Program Result

JVM starter shadow Home File System Wrapper I/O Library The Job Result File JVM Result Starter Result + Program Result I/O Proxy Local I/O (Chirp) Errors of Larger Scope Errors Inside Program Scope

Half-Way Conclusion Small but powerful changes drastically improved the Java Universe. Our mistake was to represent all possible errors explicitly in the closest interface. Error scope is an analytic tool that helps the designer decide how to propagate an error. But, we were initially confused by the presence of the deadly generic error.

The Deadly Generic Error Whereas, a program may fail in more ways than we can possibly imagine... And whereas, generality and flexibility are virtues of programming... Be if therefore resolved that interfaces should return general, flexible, arbitrary values: – int open( String name ) throws IOException;

What’s Wrong with Generality? The structure and types of errors are as essential to an interface as the arguments and return values. Every error requires a different recovery mechanism, according to its scope: – EINTR - try again right away – ETIMEDOUT - will be available again in the future – EPERM - you can’t at all without talking to a person – ESTALE - must kill process A program must know the *specific* details of an error in order to take the right action. Guesses don’t work. – Exit on unknown errors? Program is brittle. – Retry on unknown errors? Program waits endlessly.

An Example of Generality int open( String name ) throws IOException; int write( int data ) throws IOException;

An Example of Generality Java defines several types of IOException: – AccessDenied, FileNotFound, EndOfFile... Can open throw...? – FileNotFound – EndOfFile – DiskFull Can write throw...? – AccessDenied – FileNotFound – DiskFull Trick Question!

My Disk Runneth Over! What can a program expect for a full disk? – DiskFullException – OutOfSpaceException – It’s really neither! (How would we know?) What should an implementor do when the disk fills up? – There is no appropriate exception to throw. – Making up an exception is not useful. – Only solution: an escaping error. (Example later.)

Advice for Constructing Error Interfaces Export a small set of expected error types. – Bad Arguments – Lost Connection – No Such File Choose an internal error management strategy. You know the cost of retry vs the cost of failure. – Retry internally – Abort process – Drop connection

A Better Interface int open( String name ) throws AccessDenied, throws FileNotFound; int write( int data ) throws DiskFull;

Conclusion Small but powerful changes drastically improved the Java Universe. Our mistake was to represent all possible errors explicitly in the closest interface. Error scope is an analytic tool that helps the designer decide how to propagate an error. An error discipline saves precious resources: time and aggravation!

For more information... “Error Scope” Paper – Douglas Thain – Miron Livny – Condor Software, Manuals, Papers, and More – Questions now?