Report on 2002 Fault Tolerance Workshop Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories.

Slides:



Advertisements
Similar presentations
Beth Roland Eighth Grade Science JFMS
Advertisements

Presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham.
STORAGE MANAGEMENT/ GETTING STARTED: Storage Management 101 Everything you always wanted to know about Storage Management (but were afraid to ask) Stephen.
Machine Learning on.NET F# FTW!. A few words about me  Mathias Brandewinder  Background: economics, operations research .NET developer.
Great Theoretical Ideas in Computer Science.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Network Virtualization COS 597E: Software Defined Networking.
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
System Software Environments Breakout Report June 27, 2002.
June 15, 2005 Integration of Diagnostics and Prognostics.
IPOG: A General Strategy for T-Way Software Testing
Gorilla Systems Engineering versus Guerilla Systems Engineering Keith A. Taggart, PhD James Willis Steve Dam, PhD Presented to the INCOSE SE DC Meeting,
COMMMONWEALTH OF AUSTRALIA Do not remove this notice.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
1 Recommendations for Requiring Student Owned Notebook Computers Rich Goosen Product Design Engineering Technology 2/29/2008.
Ensuring Non-Functional Properties. What Is an NFP?  A software system’s non-functional property (NFP) is a constraint on the manner in which the system.
Lee Center Workshop, May 19, 2006 Distributed Objects System with Support for Sequential Consistency.
Rick Kuhn Computer Security Division
CS 1 – Introduction to Computer Science Introduction to the wonderful world of Dr. T Dr. Daniel Tauritz.
School of Computing and Mathematics, University of Huddersfield Knowledge Engineering: Issues for the Planning Community Lee McCluskey Department of Computing.
An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.
 KEY IDEAS  CONTACT INFORMATION. KEY IDEAS for the IMP Curriculum  IMP units structured around a complex central problem instead of restricting mathematical.
Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.
Solutions for Network Monitoring Access Performance Challenges Load Balancing Monitoring Access.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
UK GRID Firewall Workshop Matthew J. Dovey Technical Manager Oxford e-Science Centre.
Distributed Control of FACTS Devices Using a Transportation Model Bruce McMillin Computer Science Mariesa Crow Electrical and Computer Engineering University.
1 The Design of a Robust Peer-to-Peer System Gisik Kwon Dept. of Computer Science and Engineering Arizona State University Reference: SIGOPS European Workshop.
1 TDTWG Report to RMS SCR 745 ERCOT Unplanned System Outages Wednesday, July 13th.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Distributed Computing CSC 345 – Operating Systems By - Fure Unukpo 1 Saturday, April 26, 2014.
Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.
What’s Right with Undergraduate Statistics? Exciting Course Options.
Summary and Review. Course Objectives The main objectives of the course are to –introduce different concepts in operating system theory and implementation;
April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Lawrence Livermore National Laboratory S&T Principal Directorate - Computation Directorate Tools and Scalable Application Preparation Project Computation.
Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003.
Project / Thesis Seminar Young Suk Moon Department of Computer Science Rochester Institute of Technology.
June 13-15, 2007Policy 2007 Infrastructure-aware Autonomic Manager for Change Management H. Abdel SalamK. Maly R. MukkamalaM. Zubair Department of Computer.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Introduction to CS739: Distribution Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
Chapter 1: Introduction Omar Meqdadi SE 3860 Lecture 1 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
Challenges in the Next Generation Internet Xin Yuan Department of Computer Science Florida State University
1 Paul Sheldon Physics & Astronomy Paul Sheldon Physics & Astronomy Welcome! Workshop on High Performance, Fault Adaptive Large Scale Real-Time Systems.
1 TDTWG Report to RMS Recommended Solutions for SCR 745 ERCOT Unplanned System Outages and Failures Wednesday, August 10th.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
March 4, 2003SOS-71 FAST-OS Arthur B. (Barney) Maccabe Computer Science Department The University of New Mexico SOS 7 Durango, Colorado March 4, 2003.
Center for Component Technology for Terascale Simulation Software (CCTTSS) 110 April 2002CCA Forum, Townsend, TN This work has been sponsored by the Mathematics,
Donna G Tidwell, MS, RN, Paramedic Director Office of Emergency Medical Services Partners in Healthcare- Filling unmet needs with untapped resources.
Pengenalan Ilmu Komputasi. Computational Science??
© Cloud Security Alliance, 2015 March 2, Agenda © Cloud Security Alliance, 2015 The SecaaS Working Group Recent Activity Charter Category outline/templates.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.
Presented by SciDAC-2 Petascale Data Storage Institute Philip C. Roth Computer Science and Mathematics Future Technologies Group.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
Robust Task Scheduling in Non-deterministic Heterogeneous Computing Systems Zhiao Shi Asim YarKhan, Jack Dongarra Followed by GridSolve, FT-MPI, Open MPI.
UC Marco Vieira University of Coimbra
SAP Value Assurance plan and safeguard service package for SAP SuccessFactors HCM Suite Business Needs Need to follow leading practices and reduce risk.
Network Load Balancing
A Framework for Object-Based Event Composition in Distributed Systems
الحد من المشاكل السلوكية في رياض الاطفال
Enabling ML Based Research
Unit Two, Day 11 Lab Report.
Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme
In-network computation
Presentation transcript:

Report on 2002 Fault Tolerance Workshop Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories

Motivation  Large COTS systems are prone to failures »Lots of parts; complex configurations »Applications stress the systems »Few options for application survival  University resources are untapped »DOE researchers unfamiliar with fault tolerance experts »University researchers unfamiliar with DOE problem domain Goal: Bring laboratory and university researchers together to educate each other and discuss issues associated with scalable fault tolerance.

Basic Info  June 10-11, 2002 in Albuquerque, NM  ~40 attendees »Cornell, Denison, Florida, Houston, Indiana, LANL, LLNL, MSTI, SNL, Tennessee, UT Austin  Interest exceeded capacity  Organized by Patty Hough (SNL), Tom Bressoud (Denison), and Lee Ward (SNL)  Sponsored by the CSRI

Agenda  11 invited talks + 2 hours focused discussion on: »Application descriptions and needs »System monitoring »MPI fault tolerance »Traditional approaches with a twist  Topics not covered »Checkpoint-free algorithms »Preventative measures »System services »Migration »Redistribution »Validation »Run-time environments

Conclusions  MPI support is needed »Programming model needs to be considered »Balance research with timely delivery of capabilities  New ideas are needed »Leverage hardware »More systematic, integrated approach  There are still outstanding issues »Transparency vs. intrusiveness »Can traditional approaches be made scalable?  Workshop was a great success!

For more information…