Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems 3/20/2008 Greg Bronevetsky, Bronis de Supinski, Peter Lindstrom, Adam Moody, Martin Schulz CAR - CASC Performance Measures x.x, x.x, and x.x
2 Science & Technology Principal Directorate - Computation Directorate Enabling Fault Tolerance for Petascale Systems Problem: Reliability key concern for petascale systems Current fault tolerance approaches scale poorly, use significant I/O bandwidth Deliverables: Efficient application checkpointing software for upcoming petascale systems High-performance I/O system designs for future petascale systems Ultimate objective: Reliable software on unreliable petascale hardware
3 Science & Technology Principal Directorate - Computation Directorate Our team has extensive experience implementing scalable fault tolerance and compression techniques Funding Request: $500k/year (none from other directorates) Team members: Peter Lindstrom(.25FTE): Floating Point Compression Adam Moody(.5FTE): Checkpointing/HPC Systems Martin Schulz(.25FTE): Checkpointing/HPC Systems Greg Bronevetsky(.25FTE): Checkpointing/Soft Errors External collaborators (anticipated): Sally McKee (Cornell University)
4 Current Practice: Drinking the ocean through a straw Compute Network I/O Nodes Parallel File System BG/L: 20 minutes per checkpoint (pre-upgrade) Zeus: 26 minutes Argonne BG/P: 30 minutes (target) Alternative: Flash, disks on compute network, I/O nodes Extra level of cache between compute nodes, parallel file system Science & Technology Principal Directorate - Computation Directorate Checkpoints on current systems are limited by the I/O bottleneck Storage Elements Compute Network I/O Nodes To parallel file system: 80 minutes To local disks: 1 minute Thunder checkpoint:
5 Science & Technology Principal Directorate - Computation Directorate Checkpoint scalability must be improved to support coming systems such as Sequoia Checkpoint Size Reduction Scalable Checkpoint Coordination Total RAM BG/LPurpleRanger 54TB RAM49TB RAM123TB RAM Checkpoint Size Reduction Incremental Checkpointing Save only state that changed since last checkpoint Changes detected via runtime or compiler Checkpoint Compression Floating point-specific Sensitive to relationships between data Scalable Checkpoint Coordination Checkpoint Size Reduction Scalable Checkpoint Coordination Subsets of processors checkpoint together I/O pressure spread evenly over time
6 Science & Technology Principal Directorate - Computation Directorate Application-specific APIs will enable novel fault tolerance solutions like those used in ddcMD Application semantics improve performance Programmers can identify Data that doesn’t need to be saved Types of data structures Key for high-performance compression Matrix relationships Recomputation vs storage Fault detection algorithms Critical for soft errors Ex: ddcMD corrects cache errors on BG/L
7 Science & Technology Principal Directorate - Computation Directorate Our project will create a paradigm shift in LLNL application reliability LLNL practice: Users write own checkpointing code Wastes programmer time Checkpointing at global barriers unscalable Current automated solutions do not scale Very large checkpoints No information about application This project will: Match I/O demands to I/O capacity Minimize programmer effort Scale checkpointing to petascale systems Enable application-specific fault tolerance solutions
8 Science & Technology Principal Directorate - Computation Directorate Fault tolerance is critical for Sequoia and all future platforms CAR S&T Strategy 1.1: “Perform the research to develop new algorithms that can best exploit likely HPC hardware characteristics, including … fault-tolerant algorithms that can withstand processor failure” Project enables application fault tolerance Target audience: application developers pf3d uses Adam Moody’s in-memory checkpointer ddcMD implements complex error tolerance schemes Deliverables: Efficient application checkpointing software for upcoming petascale systems (e.g. Sequoia) High-performance I/O system designs for future petascale systems Application-specific fault tolerance APIs