A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

Presented by Fault Tolerance Challenges and Solutions Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
System Software Environments Breakout Report June 27, 2002.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
(Quickly) Testing the Tester via Path Coverage Alex Groce Oregon State University (formerly NASA/JPL Laboratory for Reliable Software)
Computer Systems/Operating Systems - Class 8
Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
W4118 Operating Systems OS Overview Junfeng Yang.
1: Operating Systems Overview
OS Spring’03 Introduction Operating Systems Spring 2003.
Statistical Methods in Computer Science Why? Ido Dagan.
CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.
Exokernel: An Operating System Architecture for Application-Level Resource Management Dawson R. Engler, M. Frans Kaashoek, and James O’Toole Jr. M.I.T.
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
CSE 451: Operating Systems Autumn 2013 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.
Computer System Architectures Computer System Software
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
1 Perspectives from Operating a Large Scale Website Dennis Lee.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Input/ Output By Mohit Sehgal. What is Input/Output of a Computer? Connection with Machine Every machine has I/O (Like a function) In computing, input/output,
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Heterogeneous Multikernel OS Yauhen Klimiankou BSUIR
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Processes Introduction to Operating Systems: Module 3.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
1 CSE451 Architectural Supports for Operating Systems Autumn 2002 Gary Kimura Lecture #2 October 2, 2002.
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
What Programming Paradigms and algorithms for Petascale Scientific Computing, a Hierarchical Programming Methodology Tentative Serge G. Petiton June 23rd,
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
March 4, 2003SOS-71 FAST-OS Arthur B. (Barney) Maccabe Computer Science Department The University of New Mexico SOS 7 Durango, Colorado March 4, 2003.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Introduction to operating systems What is an operating system? An operating system is a program that, from a programmer’s perspective, adds a variety of.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Counting on Failure 10, 9, 8, 7,…,3, 2, 1 Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory September 12, 2006 CCGSC Conference.
Exceptional Control Flow
COMPSCI 110 Operating Systems
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
2. OPERATING SYSTEM 2.1 Operating System Function
Chapter 1: Introduction
Compiler Back End Panel
Compiler Back End Panel
Operating Systems Chapter 5: Input/Output Management
Architectural Support for OS
Interrupt handling Explain how interrupts are used to obtain processor time and how processing of interrupted jobs may later be resumed, (typical.
Architectural Support for OS
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific Computing Research Research sponsored by Mathematics, Information and Computational Sciences Office U.S. Department of Energy il

Mean time to failure of 100,000 cpu system will be measured in minutes. –problem when MTTF is O(synch time) –less than application startup time How long do I wait –to find out if something has gone wrong? –memory is a million cycles away. Fault Tolerance on 100,000 processors Validation of applications - when... –it has subtle synchronization error at p>78,347 –HW failure looses my asynchronous message or changes its bits (myrinet) ? It worked fine on the validation test sizes

When you are a Failure and don’t know it Speed is not a problem if the answer doesn’t have to be right. –Do you care? Numerical accuracy (pgon) –Do you know? Validation issues Scientists seem overly concerned about getting the right answer. –Concern for their reputation and integrity –Some even feel the answer is the product Such is not the case for those who report how fast their computers are (or who run Enron). Do I have to beat the right answer out of you? The Uncertainty Principle of Large-scale Computing

Fault Tolerance – today’s system approach There are three main steps in traditional fault tolerance Detection that something has gone wrong System – detection in hardware Framework – detection by runtime environment Library – detection in math or communication library Notification of the application Interrupt – signal sent to job Error code returned by application routine Recovery of the application to the fault Restart – from checkpoint or from beginning Migration of task to other hardware Reassignment of work to remaining tasks Now we are cooking!

Fault Tolerance – application recovery There are three main steps in traditional fault tolerance Detection that something has gone wrong Application depends on runtime to do this Notification of the application Interrupt – application gets a signal (limited info) Error code returned by library (got to check for it) Recovery of the application to the fault Restart – app typically needs to include restart routine Run Through failure – app needs a fault tolerant programming model (eg. FT-MPI) - Reassignment of work to remaining tasks - Lost information/state/messages a big concern for run through Not another hurdle!

Fault Tolerance – a new perspective Checkpointing and restarting a 100,000 processor system could take longer than the time to the next failure. It isn’t a good use of the resource to restart 99,999 nodes just because one failed. A new perspective on fault tolerance is needed. Development of algorithms that can be Scale invariant and naturally fault tolerant I.e. failure anywhere can be ignored? ORNL has developed a few naturally fault tolerant algorithms. Many such algorithms exist. Approach can also address validation finite difference example

Need for Adaptive System Software KISS petaflop system software. Do we need 100,000 copies of Linux? Can microkernel do the job? –Dynamically configure environment to app needs –Less to break, less to watch Needs to automatically detect and adapt to “changes” in the system. Note problems happen at petaflop speeds! –Cost of hardware support for detection –Migration of tasks away from bad spots. –Reroute messages around failures. Distributed Control for fault tolerance Harness Distributed Control

System Software Environments Breakout Report June 27, 2002

Anemic Areas of Existing Research What are the under funded critical research issues? 1.Security – protecting systems from compromise 2.Fault tolerance – being able to run through failure. Three steps: detection, notification, system recovery Linkage between projects and groups for Fault tolerance chain of events through recovery. 3.Validation of result – how know that app got right ans. Given system runs through failure.

Potential Gaps in Research What are the gaps in MICS research portfolio related to peta-scale computing? 1.What happens after Linux? 2.OS that supports the programming model – if it is going to change from distributed memory message passing then… 3.Alternate to microkernel approach “Concrete” holds everything together but is really heavy. Local address space vs more expansive address support 4.Runtime & Programming models need to give feedback to OS so it can reconfigure to optimize needs 5.Experimental architecture research testbed IRIX linux dos OS2 next?