Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

CSCE430/830 Computer Architecture
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”
System Software Environments Breakout Report June 27, 2002.
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Priority Research Direction: Portable de facto standard software frameworks Key challenges Establish forums for multi-institutional discussions. Define.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.
Peer-to-Peer Networks as a Distribution and Publishing Model Jorn De Boever (june 14, 2007)
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
CERN openlab Open Day 10 June 2015 KL Yong Sergio Ruocco Data Center Technologies Division Speeding-up Large-Scale Storage with Non-Volatile Memory.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Redundant Array of Independent Disks
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Computer System Architectures Computer System Software
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,
Cluster Reliability Project ISIS Vanderbilt University.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
1 Martin Schulz, Lawrence Livermore National Laboratory Brian White, Sally A. McKee, Cornell University Hsien-Hsin Lee, Georgia Institute of Technology.
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.
Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.
DOE PI Meeting at BNL 1 Lightweight High-performance I/O for Data-intensive Computing Jun Wang Computer Architecture and Storage System Laboratory (CASS)
LLNL-PRES Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work performed under the auspices of the U.S. Department.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Random-Accessible Compressed Triangle Meshes Sung-eui Yoon Korea Advanced Institute of Sci. and Tech. (KAIST) Peter Lindstrom Lawrence Livermore National.
Large Scale Parallel File System and Cluster Management ICT, CAS.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Lawrence Livermore National Laboratory S&T Principal Directorate - Computation Directorate Tools and Scalable Application Preparation Project Computation.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.
Lawrence Livermore National Laboratory BRdeS-1 Science & Technology Principal Directorate - Computation Directorate How to Stop Worrying and Learn to Love.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Tackling I/O Issues 1 David Race 16 March 2010.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
Applications Scaling Panel Jonathan Carter (LBNL) Mike Heroux (SNL) Phil Jones (LANL) Kalyan Kumaran (ANL) Piyush Mehrotra (NASA Ames) John Michalakes.
Load Rebalancing for Distributed File Systems in Clouds.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
NCSA Strategic Retreat: System Software Trends Bill Gropp.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Presented by SciDAC-2 Petascale Data Storage Institute Philip C. Roth Computer Science and Mathematics Future Technologies Group.
A U.S. Department of Energy laboratory managed by UChicago Argonne, LLC. Introduction APS Engineering Support Division –Beamline Controls and Data Acquisition.
Introduction to Operating System (OS)
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Hadoop Technopoints.
CSSSPEC6 SOFTWARE DEVELOPMENT WITH QUALITY ASSURANCE
On the Role of Burst Buffers in Leadership-Class Storage Systems
Lecture 29: Distributed Systems
Presentation transcript:

Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems 3/20/2008 Greg Bronevetsky, Bronis de Supinski, Peter Lindstrom, Adam Moody, Martin Schulz CAR - CASC Performance Measures x.x, x.x, and x.x

2 Science & Technology Principal Directorate - Computation Directorate Enabling Fault Tolerance for Petascale Systems  Problem: Reliability key concern for petascale systems Current fault tolerance approaches scale poorly, use significant I/O bandwidth  Deliverables: Efficient application checkpointing software for upcoming petascale systems High-performance I/O system designs for future petascale systems  Ultimate objective: Reliable software on unreliable petascale hardware

3 Science & Technology Principal Directorate - Computation Directorate Our team has extensive experience implementing scalable fault tolerance and compression techniques  Funding Request: $500k/year (none from other directorates)  Team members: Peter Lindstrom(.25FTE): Floating Point Compression Adam Moody(.5FTE): Checkpointing/HPC Systems Martin Schulz(.25FTE): Checkpointing/HPC Systems  Greg Bronevetsky(.25FTE): Checkpointing/Soft Errors  External collaborators (anticipated): Sally McKee (Cornell University)

4  Current Practice: Drinking the ocean through a straw Compute Network I/O Nodes Parallel File System  BG/L: 20 minutes per checkpoint (pre-upgrade)  Zeus: 26 minutes  Argonne BG/P: 30 minutes (target)  Alternative: Flash, disks on compute network, I/O nodes Extra level of cache between compute nodes, parallel file system Science & Technology Principal Directorate - Computation Directorate Checkpoints on current systems are limited by the I/O bottleneck Storage Elements Compute Network I/O Nodes To parallel file system: 80 minutes To local disks: 1 minute  Thunder checkpoint:

5 Science & Technology Principal Directorate - Computation Directorate Checkpoint scalability must be improved to support coming systems such as Sequoia  Checkpoint Size Reduction  Scalable Checkpoint Coordination Total RAM BG/LPurpleRanger 54TB RAM49TB RAM123TB RAM  Checkpoint Size Reduction Incremental Checkpointing  Save only state that changed since last checkpoint  Changes detected via runtime or compiler Checkpoint Compression  Floating point-specific  Sensitive to relationships between data  Scalable Checkpoint Coordination  Checkpoint Size Reduction  Scalable Checkpoint Coordination Subsets of processors checkpoint together I/O pressure spread evenly over time

6 Science & Technology Principal Directorate - Computation Directorate Application-specific APIs will enable novel fault tolerance solutions like those used in ddcMD  Application semantics improve performance  Programmers can identify Data that doesn’t need to be saved Types of data structures Key for high-performance compression Matrix relationships Recomputation vs storage  Fault detection algorithms Critical for soft errors Ex: ddcMD corrects cache errors on BG/L

7 Science & Technology Principal Directorate - Computation Directorate Our project will create a paradigm shift in LLNL application reliability  LLNL practice: Users write own checkpointing code Wastes programmer time Checkpointing at global barriers unscalable  Current automated solutions do not scale Very large checkpoints No information about application  This project will: Match I/O demands to I/O capacity Minimize programmer effort Scale checkpointing to petascale systems Enable application-specific fault tolerance solutions

8 Science & Technology Principal Directorate - Computation Directorate Fault tolerance is critical for Sequoia and all future platforms  CAR S&T Strategy 1.1: “Perform the research to develop new algorithms that can best exploit likely HPC hardware characteristics, including … fault-tolerant algorithms that can withstand processor failure” Project enables application fault tolerance  Target audience: application developers pf3d uses Adam Moody’s in-memory checkpointer ddcMD implements complex error tolerance schemes  Deliverables: Efficient application checkpointing software for upcoming petascale systems (e.g. Sequoia) High-performance I/O system designs for future petascale systems Application-specific fault tolerance APIs