Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.

Similar presentations


Presentation on theme: "Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems."— Presentation transcript:

1 Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems 3/20/2008 Greg Bronevetsky, Bronis de Supinski, Peter Lindstrom, Adam Moody, Martin Schulz CAR - CASC Performance Measures x.x, x.x, and x.x

2 2 Science & Technology Principal Directorate - Computation Directorate Enabling Fault Tolerance for Petascale Systems  Problem: Reliability key concern for petascale systems Current fault tolerance approaches scale poorly, use significant I/O bandwidth  Deliverables: Efficient application checkpointing software for upcoming petascale systems High-performance I/O system designs for future petascale systems  Ultimate objective: Reliable software on unreliable petascale hardware

3 3 Science & Technology Principal Directorate - Computation Directorate Our team has extensive experience implementing scalable fault tolerance and compression techniques  Funding Request: $500k/year (none from other directorates)  Team members: Peter Lindstrom(.25FTE): Floating Point Compression Adam Moody(.5FTE): Checkpointing/HPC Systems Martin Schulz(.25FTE): Checkpointing/HPC Systems  Greg Bronevetsky(.25FTE): Checkpointing/Soft Errors  External collaborators (anticipated): Sally McKee (Cornell University)

4 4  Current Practice: Drinking the ocean through a straw Compute Network I/O Nodes Parallel File System  BG/L: 20 minutes per checkpoint (pre-upgrade)  Zeus: 26 minutes  Argonne BG/P: 30 minutes (target)  Alternative: Flash, disks on compute network, I/O nodes Extra level of cache between compute nodes, parallel file system Science & Technology Principal Directorate - Computation Directorate Checkpoints on current systems are limited by the I/O bottleneck Storage Elements Compute Network I/O Nodes To parallel file system: 80 minutes To local disks: 1 minute  Thunder checkpoint:

5 5 Science & Technology Principal Directorate - Computation Directorate Checkpoint scalability must be improved to support coming systems such as Sequoia  Checkpoint Size Reduction  Scalable Checkpoint Coordination Total RAM BG/LPurpleRanger 54TB RAM49TB RAM123TB RAM  Checkpoint Size Reduction Incremental Checkpointing  Save only state that changed since last checkpoint  Changes detected via runtime or compiler Checkpoint Compression  Floating point-specific  Sensitive to relationships between data  Scalable Checkpoint Coordination  Checkpoint Size Reduction  Scalable Checkpoint Coordination Subsets of processors checkpoint together I/O pressure spread evenly over time

6 6 Science & Technology Principal Directorate - Computation Directorate Application-specific APIs will enable novel fault tolerance solutions like those used in ddcMD  Application semantics improve performance  Programmers can identify Data that doesn’t need to be saved Types of data structures Key for high-performance compression Matrix relationships Recomputation vs storage  Fault detection algorithms Critical for soft errors Ex: ddcMD corrects cache errors on BG/L

7 7 Science & Technology Principal Directorate - Computation Directorate Our project will create a paradigm shift in LLNL application reliability  LLNL practice: Users write own checkpointing code Wastes programmer time Checkpointing at global barriers unscalable  Current automated solutions do not scale Very large checkpoints No information about application  This project will: Match I/O demands to I/O capacity Minimize programmer effort Scale checkpointing to petascale systems Enable application-specific fault tolerance solutions

8 8 Science & Technology Principal Directorate - Computation Directorate Fault tolerance is critical for Sequoia and all future platforms  CAR S&T Strategy 1.1: “Perform the research to develop new algorithms that can best exploit likely HPC hardware characteristics, including … fault-tolerant algorithms that can withstand processor failure” Project enables application fault tolerance  Target audience: application developers pf3d uses Adam Moody’s in-memory checkpointer ddcMD implements complex error tolerance schemes  Deliverables: Efficient application checkpointing software for upcoming petascale systems (e.g. Sequoia) High-performance I/O system designs for future petascale systems Application-specific fault tolerance APIs


Download ppt "Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems."

Similar presentations


Ads by Google