Presentation is loading. Please wait.

Presentation is loading. Please wait.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Similar presentations


Presentation on theme: "A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific."— Presentation transcript:

1 A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific Computing Research Research sponsored by Mathematics, Information and Computational Sciences Office U.S. Department of Energy il

2 Mean time to failure of 100,000 cpu system will be measured in minutes. –problem when MTTF is O(synch time) –less than application startup time How long do I wait –to find out if something has gone wrong? –memory is a million cycles away. Fault Tolerance on 100,000 processors Validation of applications - when... –it has subtle synchronization error at p>78,347 –HW failure looses my asynchronous message or changes its bits (myrinet) ? It worked fine on the validation test sizes

3 When you are a Failure and don’t know it Speed is not a problem if the answer doesn’t have to be right. –Do you care? Numerical accuracy (pgon) –Do you know? Validation issues Scientists seem overly concerned about getting the right answer. –Concern for their reputation and integrity –Some even feel the answer is the product Such is not the case for those who report how fast their computers are (or who run Enron). Do I have to beat the right answer out of you? The Uncertainty Principle of Large-scale Computing

4 Fault Tolerance – today’s system approach There are three main steps in traditional fault tolerance Detection that something has gone wrong System – detection in hardware Framework – detection by runtime environment Library – detection in math or communication library Notification of the application Interrupt – signal sent to job Error code returned by application routine Recovery of the application to the fault Restart – from checkpoint or from beginning Migration of task to other hardware Reassignment of work to remaining tasks Now we are cooking!

5 Fault Tolerance – application recovery There are three main steps in traditional fault tolerance Detection that something has gone wrong Application depends on runtime to do this Notification of the application Interrupt – application gets a signal (limited info) Error code returned by library (got to check for it) Recovery of the application to the fault Restart – app typically needs to include restart routine Run Through failure – app needs a fault tolerant programming model (eg. FT-MPI) - Reassignment of work to remaining tasks - Lost information/state/messages a big concern for run through Not another hurdle!

6 Fault Tolerance – a new perspective Checkpointing and restarting a 100,000 processor system could take longer than the time to the next failure. It isn’t a good use of the resource to restart 99,999 nodes just because one failed. A new perspective on fault tolerance is needed. Development of algorithms that can be Scale invariant and naturally fault tolerant I.e. failure anywhere can be ignored? ORNL has developed a few naturally fault tolerant algorithms. Many such algorithms exist. Approach can also address validation finite difference example

7 Need for Adaptive System Software KISS petaflop system software. Do we need 100,000 copies of Linux? Can microkernel do the job? –Dynamically configure environment to app needs –Less to break, less to watch Needs to automatically detect and adapt to “changes” in the system. Note problems happen at petaflop speeds! –Cost of hardware support for detection –Migration of tasks away from bad spots. –Reroute messages around failures. Distributed Control for fault tolerance Harness Distributed Control

8 System Software Environments Breakout Report June 27, 2002

9 Anemic Areas of Existing Research What are the under funded critical research issues? 1.Security – protecting systems from compromise 2.Fault tolerance – being able to run through failure. Three steps: detection, notification, system recovery Linkage between projects and groups for Fault tolerance chain of events through recovery. 3.Validation of result – how know that app got right ans. Given system runs through failure.

10 Potential Gaps in Research What are the gaps in MICS research portfolio related to peta-scale computing? 1.What happens after Linux? 2.OS that supports the programming model – if it is going to change from distributed memory message passing then… 3.Alternate to microkernel approach “Concrete” holds everything together but is really heavy. Local address space vs more expansive address support 4.Runtime & Programming models need to give feedback to OS so it can reconfigure to optimize needs 5.Experimental architecture research testbed IRIX linux dos OS2 next?


Download ppt "A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific."

Similar presentations


Ads by Google