Systems Seminar Schedule Monday, 18 Februrary, 4pm: – “New Wine in Old Bottles” - Douglas Thain 4 March: – No seminar: Paradyn/Condor Week Tuesday, 19.

Slides:



Advertisements
Similar presentations
Exception Handling. Background In a perfect world, users would never enter data in the wrong form, files they choose to open would always exist, and code.
Advertisements

Tam Vu Remote Procedure Call CISC 879 – Spring 03 Tam Vu March 06, 03.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
CSI 3120, Exception handling, page 1 Exception and Event Handling Credits Robert W. Sebesta, Concepts of Programming Languages, 8 th ed., 2007 Dr. Nathalie.
Distributed Processing, Client/Server, and Clusters
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Chapter 10.
Error Scope on a Computational Grid: Theory and Practice Douglas Thain and Miron Livny Computer Sciences Department University of Wisconsin HPDC-11, July.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Exceptions in Java Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop July.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
BINA RAMAMURTHY UNIVERSITY AT BUFFALO System Structure and Process Model 5/30/2013 Amrita-UB-MSES
1 Exception and Event Handling (Based on:Concepts of Programming Languages, 8 th edition, by Robert W. Sebesta, 2007)
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Address Book in JAVA By What is Address Book Address Book is book or database used for storing entries called contacts Each contact.
Computer Science 340 Software Design & Testing Design By Contract.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Managing Software Quality
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
1 Debugging and Testing Overview Defensive Programming The goal is to prevent failures Debugging The goal is to find cause of failures and fix it Testing.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
Architectural Design portions ©Ian Sommerville 1995 Establishing the overall structure of a software system.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Grid Computing I CONDOR.
07 Coding Conventions. 2 Demonstrate Developing Local Variables Describe Separating Public and Private Members during Declaration Explore Using System.exit.
1 Module Objective & Outline Module Objective: After completing this Module, you will be able to, appreciate java as a programming language, write java.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Distributed System Concepts and Architectures Services
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Testing CSE 160 University of Washington 1. Testing Programming to analyze data is powerful It’s useless (or worse!) if the results are not correct Correctness.
22 nd Oct 2008Euro Condor Week 2008 Barcelona 1 Condor Gotchas III John Kewley STFC Daresbury Laboratory
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Defensive Programming. Good programming practices that protect you from your own programming mistakes, as well as those of others – Assertions – Parameter.
A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.
Object Oriented Programming in
Workload Management Workpackage
Advanced Computer Systems
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Lecture 1: Introduction to JAVA
Building Grids with Condor
System Structure and Process Model
System Structure and Process Model
System Structure B. Ramamurthy.
System Structure and Process Model
Basic Grid Projects – Condor (Part I)
Exception Handling Imran Rashid CTO at ManiWeber Technologies.
Condor: Firewall Mirroring
Computer Science 340 Software Design & Testing
Design Contracts and Errors A Software Development Strategy
Presentation transcript:

Systems Seminar Schedule Monday, 18 Februrary, 4pm: – “New Wine in Old Bottles” - Douglas Thain 4 March: – No seminar: Paradyn/Condor Week Tuesday, 19 March, 3pm: – “The Microsoft.NET System” - Mike Litzkow Tuesday, 2 April, 3pm: – “Condor and the Grid” - Miron Livny Monday, 15 April, 4pm: – “Exploiting Gray-Box Knowledge of Buffer-Cache Management” - Nathan Burnett Monday, 29 April, 4pm: – “Bridging the Information gap in Storage Protocol Stacks” - Tim Denehy

New Wine in Old Bottles: Java on Condor Douglas Thain University of Wisconsin 18 February 2002

Abstract We have added Java support to Condor. I’ll tell you how it works and how to use it. There are some nifty features for end users. Adding this code forced us to think about the fundamental problem of coupling systems and representing errors. A lesson: One must consider the scope of an error as well as its detail.

Disclaimer: This is still rough around the edges. (Someone had to go first!)

Outline Why Java and Condor? Architecture Initial Experience A Little Error Theory Changes for the Better Conclusions

Java for Scientific Computing Java is emerging as a tool for large scale (Grande) scientific computing. – More accessible to domain scientists. – Simplified porting. – Faster development, debugging. User communities are forming: – ACM Java Grande Conference – The Java Grande Forum A. Globus, E. Langhirt, M. Livny, R. Ramamurthy, M. Solomon, and S. Traugott. JavaGenes and Condor: Cycle-Scavenging genetic algorithms. ACM Conf on Java Grande, 2000.

Limitations Java floating point and complex arithmetic do not yet satisfy all of the scientific community. – Arguments continue between industry and academia. Java is yet slower than comparable programs in C/C++/Fortran. – WAT compilers and JIT compilers are catching up. – You choose: 2x slowdown vs 5x machines. Can we really harness 5x machines while still maintaining platform independence?

Condor for Scientific Computing Condor creates a high-throughput computing system on a community of computers. A high-throughput computing system seeks to maximize the amount of work done over a long period of time. A community of computers may be any collection of machines that agree to work together.

Condor Enables Ordinary Users INFN Central Manager condor schedd Job condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu Job condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu Job condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu condor startd RAMcpu Job UWCS Central Manager

226 Condor Pools 5576 Condor Hosts Top 10 Condor Pools:

The Hype: Java: – “Write once, run anywhere!” Condor: – “Submit once, run everywhere!” The Grid: – Uniform, dependable, consistent, pervasive, and inexpensive computing.

The Reality Coupling systems is not trivial! The easy part: – Putting java in front of the program name. The tricky parts: – Java installation messes. – Unavailable file systems. – Distinguishing program errors from environmental errors.

Outline Why Java and Condor? Architecture Initial Experience A Little Error Theory Changes for the Better Conclusions

scheddstartd Match Maker Machine Policies Job Policies Home File System Claiming Protocol Activation Protocol Matchmaking Protocol Execution Protocol shadowstarter Fork The Job Fork Creates the execution environment. Exports the details, policy, and I/O services.

JVM Fork startershadow Fork Home File System Wrapper I/O Library The Job I/O ServerI/O Proxy Secure Remote I/O Local System Calls Local I/O (Chirp)

User Interface condor_status -java Name JavaVendor Ver State Activity LoadAv Mem aish.cs.wisc. Sun Microsy Owner Idle anfrom.cs.wis Sun Microsy Owner Idle babe.cs.wisc. Sun Microsy Claimed Busy Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX Total

User Interface universe = java executable = Main.class jar_files = MyLibrary.jar input = infile output = outfile arguments = Main queue condor_submit

I/O Interface Input, output, and error files are automatically transferred to/from the execution site. Any other named files may be transferred as well. To do online I/O without transferring whole files, you must make small changes to the code: – FileInputStream -> ChirpInputStream – FileOutputStream -> ChirpOutputStream

Application Java Standard Libraries Java Virtual Machine Operating System C Standard Library Chirp I/O Library Added a new library on existing interfaces. User must call new constructors. JNI Java symbols are fully qualified, so transparent replacedment of classes is not possible. Could replace native methods in the JVM, but this ties us to open-source JVMs. Could trap real system calls, but these are complex (asynchronous, nonblocking, threaded) and may be difficult to distringuish from the JVM’s own operations.

Outline Why Java and Condor? Architecture Initial Experience A Little Error Theory Changes for the Better Conclusions

Initial Experience Bad news: Nearly any unexpected failure would cause the job to be returned to the user: – Out of memory at execution site. – Java misconfigured at execution site. – I/O proxy can’t initialize. – Home file system offline.

Initial Experience Although this was correct in some sense -- the information was true -- it was very frustrating. Users want to know when their program fails by design (NullPointerException,) but not if it fails due to the environment. What did we do wrong?

Outline Why Java and Condor? Architecture Initial Experience A Little Error Theory Changes for the Better Conclusions

A Little Error Theory Build on standard definitions from fault- tolerance and programming languages. Some brief examples to get the idea. Return to Condor and use the theory to understand our design mistakes.

Fault Tolerance Terminology Failure – An externally-visible deviation from specifications. Error – An internal data state that leads to a failure. Fault – An external event that creates an error. A. Avizienis and J.C. Laprie. Dependable computing: From concepts to design diversity. IEEE 74(5) May 1986.

Example ClientServer What is sqrt(4)? Hmm, sqrt(4) is... Hmm, sqrt(9) is... Answer: 3 ERROR FAILURE FAULT

Implicit errors – The system claims to have reached a valid result, but an auditor claims it is invalid. Example: sqrt(3)==2 Explicit errors – The system tells us it cannot complete the desired action. Example: file not found. Escaping errors – The system detects an error, but has no method of reporting it, so it escapes by an alternate route. Example: core dump, kernel panic. John B. Goodenough, Exception Handling: issues and a proposed notation. CACM 18(120, December K. Ekandham and A. Bernstein. Some new Transitions in hierarchical level structures. Operating Systems Review 12(4), 1978.

Program Virtual Memory System Physical Memory Backing Store load data Could return a default value, but that creates an implicit error. Would like to return an explicit error, but a load insn has no exit code. Parent Process Escaping error: Tell the parent that the program could not complete. Normal Exit Abnormal Exit

Interface Contracts int load( int address ); The implementor must either compute a result that conforms to the contract, or is obliged to cause an escaping error. C. Hoare. An axiomatic basis for computer programming. CACM 12(10: , October B. Meyer. Object-Oriented Software Construction. Prentice Hall, 1997.

Exceptions int open( String filename ) throws FileNotFound, AccessDenied; A language with exceptions provides more structure to the contract. A declared exception is an explicit error. Yet, escaping errors are still possible.

Program Virtual File System Memory Disk open Success, FileNotFound, AccessDenied Parent Process Normal Exit Abnormal Exit MemoryCorrupt, DiskOffline, PigeonLost INTERFACE IMPLEMENTATION

Error Scope In order to be accepted by end users, a distributed system must be able to distinguish between errors computed by the program and errors forced upon it by the environment. We use the term scope to draw the distinction.

Error Scope The scope of an error is the portion of the system that it invalidates. An error must be delivered to the process responsible for managing that scope. ErrorScopeHandler FileNotFoundFileCalling Function RPC DisconnectProcessParent Process Cache Coherency Problem MachineHypervisor or Operator PVM Node CrashPVM ClusterParent Process

Error Detail The detail of an error describes in phenomenological terms the cause of the error. In the right hands, the detail is useful. In the wrong hands, the detail can be misleading. Suppose open returns AccessDenied... – File is not accessible - Ok. – Library containing ‘open’ is not accessible - Problem!

Lessons Principle 1: – A routine must not generate an implicit error as a result of receiving an explicit error. Principle 2: – An escaping error converts a potential implicit error into an explicit error at a higher level. Principle 3: – An escaping error must be propagated to the program that manages the error’s scope.

Outline Why Java and Condor? Architecture Initial Experience A Little Error Theory Changes for the Better Conclusions

Java and Condor Revisited What did we do wrong? We focussed on error detail without considering error scope.

Java and Condor Revisited To fix the system, we revisited the notion of error scope throughout. Two examples: – JVM exit code – I/O errors

JVM Exit Code DetailScopeExit Code Program exited by completing mainProgram0 Program exited through System.exit(x)Programx Exception: Null pointer.Program1 Exception: Out of memory.Virtual Machine 1 Exception: Java Misconfigured.Remote Resource 1 Exception: Home file system offline.Local Resource 1 Exception: Program image corrupt.Job1

JVM startershadow Fork Home File System Wrapper I/O Library The Job Result File JVM Result Result of Execution Attempt + Result of Program, If any. Starter Result + Program Result

I/O Error Scope All Java I/O operations throw a single exception type -- IOException. Our mistake: convert all detected errors into IOExceptions and pass them to the program. Makes sense for FileNotFound, but not for ProxyUnavailable or CredentialsExpired.

JVM starter Wrapper I/O Library The Job Result File JVM Result Result of Execution Attempt + Result of Program, If any. To I/O Proxy Error Outside Program Scope Error Inside Program Scope

Outline Why Java and Condor? Architecture Initial Experience A Little Error Theory Changes for the Better Conclusions

Conclusion We started building the Java Universe with some naive assumptions about errors. On encountering practical difficulties, we thought more abstractly about errors and developed the notion of scope and detail. By routing errors according to their scope, we made the system more robust and usable.

Food for Thought There isn’t always an easy way to propagate an error to the scope handler. – Escaping error to parent process: Raise a POSIX signal. – Escaping error to the starter: Throw a Java Error, trapped by the Wrapper, placed in file, read after process exits.

Food for Thought The mere use of exceptions in a program does not imply a disciplined error management. For example, throws IOException is a very vague statement about an interface. What is an implementor allowed to throw? – Can open() return FileNotFound? (Probably.) – Can read() throws FileNotFound? (Asking for trouble.) – What about ConnectionRefused?

Food for Thought An contract can govern more than simply the interface specification. Consider this self-cleaning program: fd = open(“file”); unlink(“file”); close(fd); Works on UNIX, fails on WinNT. Can an interface (code+docs) really state all the necessary semantic information? Should it?

Deployment As of February 14th, the Java Universe is running on 515 RedHat 7.2 machines. Will be rolled out as part of Condor on all platforms in the regular release schedule. Sun JDK on UNIX machines. Sun JDK on WinNT machines. “Is the Java Universe available on my machine?” – condor_status -java

c2 cluster tux lab istat skywalker.cs.wisc.edu

Acknowledgements Although we me take credit (or blame) for the most recent changes, the Condor architecture has dealt with errors for many years. Much credit goes to the core designers, esp. Mike Litzkow, Todd Tannenbaum, and Derek Wright.

More Info: The Condor Project: – These slides: – Douglas Thain – Questions now?