Issues #1 and #3 Fault Tolerance Working Group December 2015 MPI Forum.

Slides:



Advertisements
Similar presentations
End-to-End Arguments in System Design
Advertisements

RMA Considerations for MPI-3.1 (or MPI-3 Errata)
PC Encryption installation progress/password screen Includes comments from: Encryption team Sarah Deane Tony Stieber Selected people who took part in the.
DL Windows Software “Rules” Import a CSV File From Excel
1 CS345 Operating Systems Φροντιστήριο Άσκησης 1.
Exception Handling. Background In a perfect world, users would never enter data in the wrong form, files they choose to open would always exist, and code.
Update on ULFM Fault Tolerance Working Group MPI Forum, San Jose CA December, 2014.
Keyval and callbacks Rules and Behaviors. Background Keyvals are used for caching attributes on {COMM,TYPE,WIN} objects Discussion will focus on COMM.
System Optimization Agent Procedures using Kaseya Developed By: Jason Aparcana Advisor : Dr. S. Masoud Sadjadi School of Computing and Information Sciences.
1 JAC : Aspect Oriented Programming in Java An article review by Yuval Nir and Limor Lahiani.
An Introduction to Java Programming and Object- Oriented Application Development Chapter 8 Exceptions and Assertions.
Author Instructions How to upload a full session proposal with abstracts – two step process.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Microsoft VB 2005: Reloaded, Advanced Chapter 5 Input Validation, Error Handling, and Exception Handling.
Exception Handling Introduction Exception handling is a mechanism to handle exceptions. Exceptions are error like situations. It is difficult to decide.
ATM User Interface Design. Requirements A bank customer is able to access his or her account using an automatic teller machine. To be able to use an ATM.
Task Scheduling and Distribution System Saeed Mahameed, Hani Ayoub Electrical Engineering Department, Technion – Israel Institute of Technology
Exceptions in Java Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Slides prepared by Rose Williams, Binghamton University Chapter 13 Interfaces and Inner Classes.
Chapter 81 Exception Handling Chapter 8. 2 Reminders Project 5 due Oct 10:30 pm Project 3 regrades due by midnight tonight Discussion groups now.
End-To-End Arguments in System Design J.H. Saltzer, D.P. Reed, and D. Clark Presented by: Ryan Huebsch CS294-4 P2P Systems – 9/29/03.
Computer Science 162 Section 1 CS162 Teaching Staff.
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
SMS Gateway OZEKI NG Document version: v Adding SMS functionality to Sharepoint.
Microsoft Visual Basic 2005 CHAPTER 8 Using Procedures and Exception Handling.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
Microsoft Visual Basic 2012 Using Procedures and Exception Handling CHAPTER SEVEN.
What is Sure BDCs? BDC stands for Batch Data Communication and is also known as Batch Input. It is a technique for mass input of data into SAP by simulating.
Microsoft Visual Basic 2008 CHAPTER 8 Using Procedures and Exception Handling.
CMSC 202 Exceptions. Aug 7, Error Handling In the ideal world, all errors would occur when your code is compiled. That won’t happen. Errors which.
ASENT_IMPORT.PPT Importing Part Lists and Board Data Last revised 10/28/2009.
Software Engineering General architecture. Architectural components:  Program organisation overview Major building blocks in a system Definition of each.
11 Web Services. 22 Objectives You will be able to Say what a web service is. Write and deploy a simple web service. Test a simple web service. Write.
CL1 Proposal Redefine “install”. Add update artifact. Remove inconsistencies introduced by “baseUninstall” package type.
Open Your Mind... OpenOffice OpenOffice is the future of word processing in the world of Web 2.0—the read/write web.
Cohesion and Coupling CS 4311
Introduction to Exception Handling and Defensive Programming.
Spring/2002 Distributed Software Engineering C:\unocourses\4350\slides\DefiningThreads 1 RMI.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Exception Handling Programmers must deal with errors and exceptional situations: User input errors Device errors Empty disk space, no memory Component.
Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.
End-To-End Arguments in System Design J.H. Saltzer, D.P. Reed, and D. Clark Presented by: Amit Mondal.
COP4020 Programming Languages Exception Handling Prof. Robert van Engelen (modified by Prof. Em. Chris Lacher)
Web Services Error Handling and Debugging. Agenda Simple SOAP faults Advanced SOAP faults SOAP headers and faults Error handling From a Service Perspective.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Files Tutor: You will need ….
Exceptions in C++. Exceptions  Exceptions provide a way to handle the errors generated by our programs by transferring control to functions called handlers.
Sending large message counts (The MPI_Count issue)
MPI Communicator Assertions Jim Dinan Point-to-Point WG March 2015 MPI Forum Meeting.
Firoze Abdur Rakib. Syntax errors, also known as parsing errors, are perhaps the most common kind of error you encounter while you are still learning.
Fault Tolerance (2). Topics r Reliable Group Communication.
CS241 Systems Programming Discussion Section Week 2 Original slides by: Stephen Kloder.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
Exception Handling. VB.NET has an inbuilt class that deals with errors. The Class is called Exception. When an exception error is found, an Exception.
Exceptions in OO Programming Introduction Errors Exceptions in Java Handling exceptions The Try-Catch-Finally mechanism Example code Exception propagation.
ECE122 L23: Exceptions December 6, 2007 ECE 122 Engineering Problem Solving with Java Lecture 24 Exceptions.
Error Handler Rework Fault Tolerance Working Group.
Upgrade on Windows 7. DownloadSoftware Download Software from link provided in Webliography: e/
Answer to Summary Questions
Jim Fawcett CSE687-OnLine – Object Oriented Design Summer 2017
Jim Fawcett CSE687 – Object Oriented Design Spring 2001
Human Computer Interaction Lecture 21,22 User Support
Jim Fawcett CSE687 – Object Oriented Design Spring 2015
Finalization 17: Finalization
Programming in C# CHAPTER - 7
HP Printer Technical Support Number +1 (844)
Error Handling.
Xen and the Art of Virtualization
Presentation transcript:

Issues #1 and #3 Fault Tolerance Working Group December 2015 MPI Forum

Background ● Section 8.3 is imprecise about where MPI_ERRORS_ARE_FATAL is applied. ● The first and second sentences are contradictory ( MPI_ABORT accepts a communicator argument) ● The second sentence is more permissive for FT implementations and good software engineering.

Motivation ● Allows the application to clean itself up after an error. ○ Flush state to disk, close files, etc. ● Step 0 for any FT solution ○ Could be part of ULFM, but this is more generic. ○ Allows very basic FT implementations and applications ○ Without these changes, if any communicator uses MPI_ERRORS_ARE_FATAL, all use it. ○ Even if users don’t want FT, they still might want to clean up reliably ● Have some users at BSC who would already like to see this available

Changes Summary – Issue 1 MPI_ERRORS_ARE_FATAL causes all connected processes to abort (as if called by MPI_ABORT). Better define how communicators inherit error handlers for each communicator creation function. Windows all inherit their error handler from MPI_WIN_NULL Mirrors files General text cleanup to have definitions in a single place. ADVICE: HQI should detect aborts correctly

Changes Summary – Issue 3 Add new error handler MPI_ERRORS_ABORT which aborts only a communicator Define MPI_COMM_SELF as default communicator for error handlers (instead of MPI_COMM_WORLD)

Objection from Bordeaux Rolf pointed out a backward compatibility issue An application is catching its own errors via MPI_ERRORS_RETURN on MPI_COMM_WORLD, but does not change MPI_ERRORS_ARE_FATAL on MPI_COMM_SELF. A simple error (MPI_ALLOC_MEM) would abort in this model where it did not before. The forum considered this problem and straw voted to break backward compatibility here. We looked at it in the working group and came up with alternatives, but all required adding a bunch of new semantics. (Discussion on issue)

Burden on Implementations ● Must provide MPI_ERRORS_ABORT ○ In practice, could be the same as MPI_ERRORS_ARE_FATAL ● Change error handler propagation for windows ○ Minor change to inherit from MPI_WIN_NULL ● Can still abort all processes if that’s what the implementation wants to do ○ Implementations that choose to do better now have clear instructions ● Optional: ○ Improve MPI_ABORT to only abort a subcommunicator

Read Through Changes

Backup Slides

Previous Objections ● This prevents the application from handling an exception safely! ( MPI_SEND kills apps) ○ If you want to be safe, set your error handler to MPI_ERRORS_RETURN on communicators that you want to keep alive. ○ Completing MPI_SEND does not imply that the message has been received (or that the receiver is alive). ● This has a large burden on implementations! ○ This is just as optional as FT has always been. Feel free to abort.

Why MPI_Send can succeed when the receiver is dead Returning from MPI_Send does not imply that the message is received, only buffered somewhere.