Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002.

Slides:



Advertisements
Similar presentations
WS-JDML: A Web Service Interface for Job Submission and Monitoring Stephen M C Gough William Lee London e-Science Centre Department of Computing, Imperial.
Advertisements

Lists Chapter 4 Carrano, Data Structures and Abstractions with Java, Second Edition, (c) 2007 Pearson Education, Inc. All rights reserved X.
Lectures on File Management
Exception Handling. Background In a perfect world, users would never enter data in the wrong form, files they choose to open would always exist, and code.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Systems Seminar Schedule Monday, 18 Februrary, 4pm: – “New Wine in Old Bottles” - Douglas Thain 4 March: – No seminar: Paradyn/Condor Week Tuesday, 19.
What iS RMI? Remote Method Invocation. It is an approach where a method on a remote machine invokes another method on another machine to perform some computation.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
CSI 3120, Exception handling, page 1 Exception and Event Handling Credits Robert W. Sebesta, Concepts of Programming Languages, 8 th ed., 2007 Dr. Nathalie.
Notes to the presenter. I would like to thank Jim Waldo, Jon Bostrom, and Dennis Govoni. They helped me put this presentation together for the field.
Error Scope on a Computational Grid: Theory and Practice Douglas Thain and Miron Livny Computer Sciences Department University of Wisconsin HPDC-11, July.
1 CS 502: Computing Methods for Digital Libraries Lecture 22 Web browsers.
Getting Started with Linux Douglas Thain University of Wisconsin, Computer Sciences Condor Project October 2000.
Exceptions in Java Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Reliable I/O on the Grid Douglas Thain and Miron Livny Condor Project University of Wisconsin.
Figure 1.1 Interaction between applications and the operating system.
Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
G Robert Grimm New York University Protection and the Control of Information Sharing in Multics.
1 Exception and Event Handling (Based on:Concepts of Programming Languages, 8 th edition, by Robert W. Sebesta, 2007)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Address Book in JAVA By What is Address Book Address Book is book or database used for storing entries called contacts Each contact.
Enabling Grids for E-sciencE Medical image processing web portal : Requirements analysis. An almost end user point of view … H. Benoit-Cattin,
Introduction to Programming G50PRO University of Nottingham Unit 1 : Introduction Paul Tennent
P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Compilers, Interpreters and Debuggers Ruibin Bai (Room AB326) Division of Computer Science.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Server Performance, Scaling, Reliability and Configuration Norman White.
© Geodise Project, University of Southampton, Geodise Middleware & Optimisation Graeme Pound, Hakki Eres, Gang Xue & Matthew Fairman Summer 2003.
ABone Architecture and Operation ABCd — ABone Control Daemon Server for remote EE management On-demand EE initiation and termination Automatic EE restart.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
Exceptions and Assertions Chapter 15 – CSCI 1302.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
© Geodise Project, University of Southampton, Geodise Middleware Graeme Pound, Gang Xue & Matthew Fairman Summer 2003.
The Gateway Computational Web Portal Marlon Pierce Indiana University March 15, 2002.
(c) University of Washington10-1 CSC 143 Java Errors and Exceptions Reading: Ch. 15.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
June 1, 2004© Matt Bishop [Changed by Hamid R. Shahriari] Slide #13-1 Chapter 13: Design Principles Overview Principles –Least Privilege –Fail-Safe.
Slide #13-1 Design Principles CS461/ECE422 Computer Security I Fall 2008 Based on slides provided by Matt Bishop for use with Computer Security: Art and.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.
14 Compilers, Interpreters and Debuggers
CompSci 725 Presentation by Siu Cho Jun, William.
Building Grids with Condor
University of Technology
Module 1: Getting Started
DUCKS – Distributed User-mode Chirp-Knowledgeable Server
CS703 - Advanced Operating Systems
Exception Handling Imran Rashid CTO at ManiWeber Technologies.
Why Threads Are A Bad Idea (for most purposes)
Condor: Firewall Mirroring
CS2013 Lecture 7 John Hurley Cal State LA.
CSC 143 Java Errors and Exceptions.
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
Exception Handling.
Presentation transcript:

Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

Overview  We have added a Java Universe to Condor. (More from Todd.)  Adding this code forced us to think about the fundamental problem of coupling systems and representing errors.  A lesson: One must consider the scope of an error as well as its detail.

Java for Scientific Computing  Java is emerging as a tool for large scale (Grande) scientific computing. More accessible to domain scientists. Simplified porting. Faster development, debugging.  User communities are forming: ACM Java Grande Conference The Java Grande Forum

The Hype:  Java: “Write once, run anywhere!”  Condor: “Submit once, run everywhere!”  The Grid: Uniform, dependable, consistent, pervasive, and inexpensive computing.

The Reality:  Coupling systems is not trivial!  The easy part: Putting java in front of the program name.  The tricky parts: Dealing with unexpected events! –Bad java installation. –Unavailable file system. –Temporary resource exhaustion.

Architecture  Execution: User just specifies “java” universe. Execution site gives details of JVM.  I/O: Know all of your files? –Condor transfers whole files for you. Need online I/O? –Link program with Chirp I/O Library. –Execution site provides proxy to home site.

startershadow Home File System Execution SiteSubmission Site

startershadow Home File System Execution SiteSubmission Site JVM Fork

startershadow Home File System Execution SiteSubmission Site JVM Fork The Job

JVM Fork startershadow Home File System I/O Library The Job I/O ServerI/O Proxy Secure Remote I/O Local System Calls Local I/O (Chirp) Execution SiteSubmission Site

Initial Experience  Bad news: Nearly any unexpected failure would cause the job to be returned to the user: Out of memory at execution site. Java misconfigured at execution site. I/O proxy can’t initialize. Home file system offline.

What do Users Want?  This was correct in a certain sense: The information was true. But, still frustrating.  Users want to know when their program fails by design (NullPointerException,) but not if it fails due to the environment.

What Did We Do Wrong?  We thought that we were very careful to propagate errors: I/O errors: server->proxy->library->job JVM exit code: JVM->starter->home  But, we failed to draw a distinction: Errors that are a natural property of the program. Errors that were an incidental result of the environment.

Scope and Detail  The scope of an error is the portion of the system that it invalidates.  The detail of an error describes its philosophical cause.  An error must be delivered according to the handler that manages its scope.

Examples DetailScopeHandler Program exited normally.ProgramUser Null pointer exception.ProgramUser Out of memory.Remote Machine Condor Home file system offline.Home Machine Condor Program image corrupt.JobUser

An Example  With this understanding, we reconsidered many elements of the Java Universe.  One example: The JVM exit code is not a useful result. It gives results that ignore error scope.  Solution: Trap the program exit at a higher level. Report the result and scope on a separate channel.

JVM Exit Code DetailScopeHandlerExit Code Program exited normally.ProgramUser(x) Null pointer exception.ProgramUser1 Out of memory.Remote Machine Condor1 Home file system offline.Home Machine Condor1 Program image corrupt.JobUser1

JVM startershadow Home File System Wrapper I/O Library The Job Result File JVM Result Program Result or Error and Scope Starter Result + Program Result

JVM starter shadow Home File System Wrapper I/O Library The Job Result File JVM Result Starter Result + Program Result I/O Proxy Local I/O (Chirp) Errors of Larger Scope Errors Inside Program Scope

Conclusion  We started building the Java Universe with some naive assumptions about errors.  On encountering practical difficulties, we thought more abstractly about errors and developed the notion of scope and detail.  By routing errors according to their scope, we made the system more robust and usable.  Details in an upcoming paper.

Deeper Problems  Systems have deep semantic differences that cross multiple functions.  Consider this self-cleaning program: Open a file. Delete the file. Close the file.  Works on UNIX, fails on WinNT.  Can we really provide a uniform interface?

More Info:  Demo on Wednesday Morning Room 3381 CS anytime  The Condor Project:  These slides:  Douglas Thain  Questions now?